Experiment: In order to gain deeper insight into the microbial dynamics of satellite phage in natural environments, we took advantage of the wealth of available metagenomic sequences to categorize the extremely diverse and ubiquitous satellite phage families, while simultaneously deepening our understanding of satellite genetic structure We searched through metagenomic samples for satellite phage protein clusters using SaPhARI, assessing its ability to find and tag protein clusters.
Results: Identified Protein Clusters
Using SaPhARI at varied protein parameter stringencies, we searched through 7,659 metagenomic contigs, which were less than 20,000bp, representing 37 metagenomic samples. We identified 87 putative satellite protein phage clusters representing 5 out of the 6 developed satellite family models:
- PICMIs - 41
- cfPICI - 21
- PICI - 2
- P4like - 2
- Phagelet - 1
The only family not identified were PLE’s, which are only categorized in Vibrio cholerae.
For each satellite model, the parameters are defined in the “proteins” variable, which is a Python list of potential proteins that each respective satellite family is expected to contain. For example, the above PICI model has 5 protein parameters —primase, integrase, decor, either “major head” or “major capsid”, and either 'alpa' or 'icd-like'—. To define stringency in our experiment, we changed the “minnumber” variable for each model, which defines the minimum number of these proteins that need to be present in order for SaPhARI to tag the cluster as that satellite family. For the satellite group defined in photo above, the protein cluster will be identified as a PICI if it has 3 of the 5 protein groups per defined in the model. A stringency score of 0 would define this minnumber variable as 5 and a stringency score of -1 would define this minnumber variable as 4.
We ran five iterations of stringency, decreasing the minimum required proteins by one in each run until the models reached or exceeded half of the total proteins. The stringency scores were: 0, -1, -2, -3, -4, and -5. Not every model was run with the higher stringency scores, as the highest stringency score used for each model never exceeded the equivalent of ½ of the number of protein parameters defined in the model. For example, with the PICI model, the highest stringency model used was -2, thus requiring 3 out of 5 proteins to be present.
PICMI had the most lenient model, with three potential protein parameters—primase, integrase, and either 'alpa' or 'icd-like'—all located within 15,000 bp of each other. 4 putative PICMIs strictly adhered to the requirement of all three protein parameters, while the remaining 37 were found with a stringency score of -1, requiring only two of the parameters.
2 putative PICIs were detected at a stringency score of -2, with three of the five possible protein parameters to be present within 15,000 bp of each other, making it the second least stringent model. The P4-like family required the highest stringency for detection, at -5, due to having twelve possible protein parameters within 12,000 bp, making it the strictest satellite family model. The 2 putative P4-like satellites were detected at a stringency score of -5.
A single cf-PICI was found at a stringency score of -3, with seven of the ten protein parameters to be present within 15,000 bp of each other, making it the second most stringent model. The remaining 20 cf-PICI’s were found at a stringency score of -4. Lastly, the sole Mycobacterium phage satellite, "phagelet," was found at a stringency score of -3, requiring three of the six protein parameters to be present within 12,000 bp of each other.
Data Analysis:
To assess whether putative satellites could be full phages, we analyzed all contigs containing identified satellites using PHASTEST to infer prophage status, as well as analyzed each contig using BLAST to assess similarity to known species. The results were compiled into a CSV file accessible below:
Click here to download CSV
Within the 82 contigs analyzed with PHASTEST, 44 prophage/phage regions were identified. Of these, 20 were classified as "complete" by the PHASTEST algorithm. Notably, 12 of the 20 "complete" prophage regions were identified as cfPICIs. Additionally, 14 regions were categorized as "questionable," spanning various satellite families—5 were cfPICIs, 8 were PICMIs, and 1 was a “phagelet”. All 9 regions classified as "incomplete" were identified as PICMIs.
Interestingly, 37 of the 43 satellites not detected by PHASTEST were PICMIs. This was expected, as 29 of the 41 PICMIs had a query coverage of less than 10% for their top hit in a nucleotide BLAST, highlighting how poorly characterized these phage satellite sequences are. Furthermore, only 18 of the 82 metagenomic contigs analyzed by BLAST had both query coverage and identity greater than 90%, further demonstrating the lack of characterization in satellite phages.
Reliance solely on nucleotide similarity can yield limited results and overlook numerous satellite phages hidden within bacterial genomes. SaPhARI addresses the urgent need for novel bioinformatic approaches that move beyond nucleotide similarity.
All contig fasta files, original SaPhARI outputs, contig BLAST results, and PHASTEST results are available in our GitLab within the folder titled “METAGENOMIC_SATELLITES”
Intepretation:
The discovery of these 87 putative satellite phage clusters only scratches the surface of the capabilities that SaPhARI has to offer through the identification of satellite protein clusters within metagenomic data. Our team continues to analyze more metagenomic data daily, seeking to characterize more satellite phage systems for future applications within synthetic biology.