Software

Summary of Software Tool

SaPhARI: Satellite Phage Algorithmic Recognition and Interpretation


SaPhARI (Satellite Phage Algorithmic Recognition and Interpretation) is a versatile bioinformatics tool designed to detect and classify bacteriophage satellites with precision and flexibility. By identifying clusters of key proteins within genomic data, SaPhARI enhances the discovery process, enabling efficient analysis of satellite phages and expanding their potential applications as parts in scientific research.

Unlike its only comparable tool, SatelliteFinder, SaPhARI can handle a broad range of input formats, including nucleotide FASTA files, GenBank flat files, and metagenomic assemblies (de Sousa et al., 2023). By eliminating the need for cumbersome preprocessing, SaPhARI streamlines analysis, making it ideal for both genomic and metagenomic research. Its user-friendly design allows customization of various parameters, enabling users to tailor their analyses to specific datasets and research goals.

At the core of SaPhARI’s functionality is its ability to identify families of satellite phages by detecting clusters of key proteins. Using customizable parameters, users can define criteria such as inter-protein distances, the exclusion of specific (forbidden) proteins from clusters, and minimum cluster sizes. SaPhARI’s flexible approach allows it to accurately group related proteins and discover novel satellite systems. The tool’s customizable clustering options also make it particularly suited for identifying evolving satellite families. Researchers can fine-tune the tool’s parameters to either focus on highly specific, narrowly defined families or explore more relaxed, broadly inclusive groupings. This empowers users to tailor their analyses to meet precise research goals, enabling the discovery and definition of new satellite families at any desired level of detail.

SaPhARI’s modular Nextflow-based pipeline integrates advanced bioinformatics tools, such as Prodigal for open reading frame (ORF) prediction and BLAST for protein function annotation (Buchfink et al., 2021; Camacho et al., 2009; Di Tommaso et al., 2017; Hyatt et al., 2010). Users can adjust settings like Prodigal’s ORF prediction criteria and BLAST’s e-value thresholds or coverage cutoffs directly within the Python execution file. This flexibility makes SaPhARI accessible to both novice and experienced bioinformaticians, whether running it on local machines or HPC clusters.

Supported by an up-to-date database of satellite phages, SaPhARI provides transparent reference data for hypothesis testing and wet lab experiments. Additionally, SaPhARI allows users to customize reference protein BLAST and DIAMOND databases, adapting to evolving research needs and making it a powerful tool for a wide range of genomic and metagenomic studies.

In practice, SaPhARI has been invaluable to our team for supporting the novel discovery of satellite phage in metagenomic data. We assembled, annotated, and analyzed 7,659 metagenomic contigs from 37 unique metagenomes derived from the NCBI Sequence Read Archive (SRA). Through this analysis, we have identified 87 putative satellite protein clusters: 61 from the PICMI model, 21 from the cfPICI model , 2 from the P4like model, 2 from the PICI model, and 1 from the Phagelet model.

SaPhARI’s versatility extends beyond satellite discovery. It can be adapted for various bioinformatic applications, such as CRISPR system identification or other projects requiring protein cluster analysis. Whether using the comprehensive pipeline or the standalone Satellite Python class, SaPhARI offers broad utility for diverse scientific inquiries.

In summary, SaPhARI combines flexibility, scalability, and ease of use, offering researchers a powerful tool for uncovering novel satellite phages by clustering key proteins and revealing their context within genomic data. Its broad functionality and adaptability make it a valuable resource in satellite discovery and beyond.

Criteria for Software Tool Award

For usage and our complete documentation, please visit our GitLab repository!


William and Mary Software Tool


Compatibility with Synthetic Biology Standards


SaPhARI utilizes standard synthetic biology data formats such as nucleotide FASTA and GenBank flat files. It supports the annotation of nucleotide sequences and can also leverage GenBank CDS data to identify clusters of interest. Furthermore, SaPhARI enables the reanalysis of CDS translations, providing the ability to reassess protein functions with greater accuracy. 


Rigorous Validation through Experimental Work


SaPhARI has undergone extensive validation through experimental testing, proving its accuracy and reliability. Our Design-Build-Test section demonstrates how we identified protein clusters within genomic data. Moreover, we are actively experimenting with its capabilities by testing the software on metagenomic data, where it has already identified multiple putative satellite elements. 


Broad Utility Across Research Projects


SaPhARI's customizable framework can be adapted for a range of applications, enabling researchers to search for any specified protein clusters, such as CRISPR or Cas genes. Users can modify the reference database and adjust search parameters to tailor the tool for their specific needs. Detailed instructions on data formatting and tool adaptation can be found on our GitLab Software page, including how to convert a FASTA file into a custom reference database for more specialized searches.


Integration with External Tools and Pipelines


SaPhARI's Python and Nextflow-based architecture is built for flexibility. The Satellite Python class can be customized to search for clusters of any proteins, as long as the properly formatted input provides strand orientation, genomic location, and the full titles of proteins.

By utilizing Nextflow, the command-line workflow platform, SaPhARI enhances adaptability by enabling easy customization of specific workflows and processes. For example, if additional genome annotation tools/softwares are required, they can be integrated into the extra annotations workflow by modifying the main.nf Nextflow file and creating a new process (Di Tommaso et al., 2017).

This flexible design enables SaPhARI to fit into existing research pipelines, allowing researchers to utilize the clustering algorithm for their specific needs. Our GitLab repository provides comprehensive guidance on input formatting for the Satellite class, as well as documentation on each Nextflow workflow and process, ensuring that the software can be easily incorporated into various projects.


User-Friendliness and Accessibility


The software is deployed in a single Python file, allowing users to easily adjust parameters, paths, workflow configurations, and define satellite families. For those seeking a quick setup, default settings are provided to minimize the learning curve and maximize efficiency. Installation and operation are straightforward, supported by clear and well-structured documentation that guides users through every step. For more details, please visit our GitLab repository.


Documentation and Future Contributions


SaPhARI is developed with a long-term vision, providing extensive documentation for both users and future contributors. Alongside comprehensive code documentation and inline comments, we include a big O time analysis of the key method in the Satellite Python class in section 7 of our Design-Build-Test documentation, which calculates the worst-case time complexity of this method to be O(n²). Our commitment to open-source collaboration encourages future contributions, ensuring that SaPhARI continues to evolve and adapt to the needs of the scientific community.

Purpose


SatelliteFinder: The Sole Tool


Prior to the development of SaPhARI, SatelliteFinder was the sole software tool designed to identify satellites within bacterial genomes. It follows a two-step pipeline, utilizing the open-source program MacSyFinder for identifying clusters of satellite-associated proteins, followed by a custom post-processing step to ensure accurate detection of satellite elements. Users can run SatelliteFinder through Galaxy or with Docker (de Sousa et al., 2023).


Step 1: MacSyFinder Integration


MacSyFinder is used to search for gene clusters representing satellite elements. Genome data in multi-protein FASTA format is scanned against predefined models developed by Institute Pasteur, which users cannot modify. These models specify gene presence, forbidden genes (like structural phage proteins), and the maximum distance between genes to identify satellites.

MacSyFinder uses HMMER to match input sequences to Hidden Markov Model profiles of satellite-associated genes. It then forms gene clusters based on proximity and evaluates them based on quorum requirements to select the most relevant satellite systems (Abby et al., 2014).

Step 2: Post-Processing Script


The post-processing script fixes two issues:
  1. Rescuing discarded clusters: Clusters excluded due to the presence of forbidden genes are reevaluated and rescued if they contain only one forbidden protein.
  2. Splitting merged satellites: Satellites that are too close together may be mistakenly merged. The script uses integrases as markers to split these clusters correctly.

Limitations of SatelliteFinder


SatelliteFinder has identified many satellite-like elements within bacterial genomes and demonstrated its effectiveness in detecting known satellite systems. However, the tool has several limitations that affect its broader applicability:

  1. Limited Input File Support: SatelliteFinder is restricted to processing amino acid translations in protein FASTA format and does not support nucleotide FASTA or GenBank flat files. This limitation hinders its application in metagenomic searches and requires users to reformat their data to be compatible, creating additional challenges.

  2. Lack of Transparency in Model Creation and Training Data: The literature accompanying the tool does not provide clear documentation regarding the curated annotations used in the development of the satellite family models. While the supplemental material includes a list of known satellites the software successfully operates on, this list is relatively small and lacks sufficient context for users to fully understand the underlying training data.

  3. Dependence on Predefined Protein Families: SatelliteFinder relies on MacSyFinder, which employs researcher-defined profile Hidden Markov Models (HMMs) for protein families. This dependence is built into a rigid framework for the existing satellite families, preventing users from creating new models. Even if an advanced bioinformatician attempts to modify the software to accommodate new families, identifying novel satellite systems remains challenging due to the requirement for prior knowledge of specific protein family signatures. Crucially, SatelliteFinder is not easily equipped to identify new satellite families as they are discovered, which is vital for a tool of this nature. This lack of flexibility is further compounded by the integration of the post-processing script, struggling to meet the diverse and evolving research needs.

  4. Usability Challenges: The reliance on Docker and Galaxy as primary platforms for running SatelliteFinder poses challenges for users. Galaxy’s web-environment prevents any model customization, while Docker can be particularly difficult for beginners to navigate. Additionally, the intricate output generated by SatelliteFinder necessitates a thorough understanding of the underlying literature for effective interpretation, which can overwhelm novice users. Consequently, the tool is better suited for intermediate and advanced bioinformaticians

In summary, while SatelliteFinder is a valuable tool for analyzing pre-annotated datasets and known satellite systems, its limitations reduce its effectiveness for discovering novel satellites in real-world or metagenomic contexts. The restricted input formats, lack of transparency in the training data, and reliance on predefined protein families make it less ideal for large-scale exploratory research. SatelliteFinder is best suited for researchers with substantial expertise in satellite systems and a solid bioinformatics background, rather than for those aiming to conduct large-scale discovery of new satellites in diverse environments.


SaPhARI: The Next-Generation Tool


To address the limitations of SatelliteFinder, SaPhARI was developed as a next-generation tool, offering significant improvements in flexibility, usability, and functionality for satellite discovery.

  1. Expanded Input File Support: Unlike SatelliteFinder, which is restricted to processing protein FASTA files, SaPhARI supports a wide range of input formats. Researchers can analyze nucleotide FASTA files, identify open reading frames (ORFs) within contigs or whole genomes, and process GenBank flat files, including CDS product annotations and translations. This broader input compatibility eliminates the need for users to reformat their data, making SaPhARI more suitable for metagenomic and exploratory research.

  2. Transparent and Evolving Training Data: SaPhARI includes a comprehensive, continuously updated database of bacteriophage satellites, curated from the latest literature. This evolving database not only addresses the transparency issues found with SatelliteFinder by clearly documenting the data used to model satellite families, but also provides more detailed information that enhances users' understanding of satellite genomes and their potential applications. By giving researchers deeper insights into the training data, SaPhARI fosters collaboration and further research, allowing users to better understand the family models and contribute to the ongoing refinement of satellite discovery.

  3. Customization and Flexibility in Model Creation: One of SaPhARI’s key strengths is its flexibility in allowing users to define new satellite families. Unlike SatelliteFinder, which is constrained by predefined Hidden Markov Model profiles (HMMs) for protein families, SaPhARI replaces HMMER with BLAST, an approach that is more widely used and accepted by synthetic biologists. This transition enables the assignment of protein functions through a user-friendly, customizable database. SaPhARI allows users to customize protein cluster searches, define the number of proteins required to elicit hits, exclude specific proteins, and set parameters for inter-protein distances and the overall length of satellite families. This level of customization empowers researchers to explore novel satellite systems without the need for extensive prior knowledge of specific protein family signatures, making the tool more adaptable to diverse research needs.

  4. Improved Usability: SaPhARI addresses usability challenges by providing multiple execution options. Compatible with Unix-like operating systems, it can be run locally for small datasets or through shell scripts on high-performance computing (HPC) systems for larger analyses. Unlike SatelliteFinder’s reliance on Docker and Galaxy, which limits customization and presents hurdles for beginners, SaPhARI offers a more user-friendly interface for its many parameters. Additionally, every parameter is clearly documented in the code for easy understanding, and users can modify the tool to suit their specific research objectives.

Overall, SaPhARI not only builds on the strengths of SatelliteFinder but also overcomes its limitations. With expanded input support, greater flexibility in family model creation, and a more transparent, up-to-date training database, SaPhARI is a powerful tool for both satellite experts and researchers in synthetic biology and metagenomics looking to discover novel satellite systems and more.

Engineering of SaPhARI

Novel Satellite Phage Database Development:


Our research initially focused on understanding the genomic structure of satellite phages. Through extensive analysis of discovery tools like SatelliteFinder and a thorough review of the literature, we identified a significant gap: the absence of a centralized, publicly accessible database compiling satellite phages and their related families. Despite their abundance and potential in bioengineering, no such resource existed. This realization motivated us to develop the first-ever satellite phage database, meticulously curated and continuously updated with data sourced from the scientific literature.


Database Scope:


Our database encompasses six satellite phage families:
  • PICI (Phage Inducible Chromosomal Islands)
  • cf-PICI (Capsid-Forming Phage Inducible Chromosomal Islands)
  • PICMI (Phage Inducible Chromosomal Minimalist Islands)
  • PLE (PICI-like Elements)
  • P4-Like
  • Phagelets/Novel (Newly identified M. aichiense satellite family unique to William & Mary)
For a clearer understanding of the satellite families, click here!
For each family, the database compiles comprehensive information, including:
  • Satellite name and aliases
  • Host/target bacterial strain
  • Host or satellite accession number
  • Verification level
  • Claimed length
  • Start and stop locations
  • att site core and att gene
  • Accessory, virulence, and phage interference genes
  • Inducing/helper phages
  • Additional notes on each satellite

To streamline the data collection process, we developed an automated scraper for the National Center for Biotechnology Information (NCBI) that searches the PubMed database for papers containing relevant keywords. Using the Google Sheets API, the tool automatically integrates the retrieved data into a Google Sheet. This significantly simplifies satellite phage research by retrieving key metadata from PubMed, including a paper’s:

  • Title
  • PubMed ID (ID)
  • Language
  • Journal
  • Year and Month of publication
  • PubMed Central ID (PMCID)

Once retrieved, the metadata is automatically uploaded to a designated Google Sheet, significantly reducing the need for manual data entry and ensuring researchers have access to up-to-date information. The Google Sheet is preformatted to work seamlessly with the provided code, automatically generating PubMed links and, when available, direct links to PubMed Central PDFs from the extracted data (Googleworkspace, 2023; Parnell et al., 2011).


Database Format and Accessibility:


The database is hosted as a Google Sheet and is available in both .xlsx and .tsv formats for each family on our GitLab. This ensures accessibility and ease of use for all users. The provided .xlsx format allows the public to view the data effortlessly, while the .tsv files are particularly suitable for parsing and integration with Python and other analytical tools. Additionally, we are including satellite genomes in nucleotide FASTA format as reference data.


William and Mary Software Tool


Supplementary Database Materials:


To streamline the use of our database, we offer additional tools and detailed instructions. Check out our Software GitLab for more information!


Relevance:


The Satellite Phage Database is a significant advancement in satellite phage research, providing a centralized and curated resource that greatly improves access to critical data. This enhanced accessibility streamlines the research process, fosters collaboration, and encourages innovation in bioengineering and synthetic biology.

The comprehensive dataset compiled within the database serves as the foundation for SaPhARI, our specialized tool for satellite phage detection and analysis. By leveraging this rich resource, SaPhARI enhances researchers' ability to analyze satellite phages with greater efficiency and understanding, ultimately driving progress in satellite phage discovery and its broader applications in scientific research.

Research on Existing Tools: 


Before developing a tool for satellite phage discovery, it's essential to understand the key tools used for prophage detection. We explored two leading softwares—PHASTEST and DEPhT—both of which use different methods to identify bacteriophages.



Phage Discovery with PHASTEST


PHASTEST, PHAge Search Tool with Enhanced Sequence Translation, is designed for the rapid identification of prophages within bacterial genomes, plasmids, and metagenomics. It combines gene-finding algorithms with sequence similarity searches to locate prophage sequences efficiently.

Identification and Scoring Methods:

  • Prodigal and FragGeneScan: Detects protein-coding genes in bacterial genomes and for metagenomic datasets.
  • BLAST+ and DIAMOND BLAST: These tools identify sequence similarities between the analyzed genome and known bacterial or phage sequences. By comparing the genome against curated phage and bacterial protein databases, they quickly detect proteins likely to be part of a prophage.
  • tRNAscan-SE, Aragorn, and Barrnap: These tools detect tRNA, tmRNA, and rRNA genes in bacterial genomes. Identifying these regions can help localize prophages and provide genomic context.

Once potential prophage regions are detected, PHASTEST assigns a completeness score up to 150. These scores are calculated based on the proportion of known phage genes in the region, the region's size, and the presence of key "cornerstone" genes, which include phage structural proteins and genes involved in DNA regulation and lysis (Wishart et al., 2023; Zhou et al., 2011).


Strengths:

  • Speed: PHASTEST is 31% faster than its predecessor, completing whole-genome analysis in 1.3 minutes with pre-annotated files, thanks to its strong server infrastructure.
  • Comprehensive Genome Annotation: PHASTEST identifies prophages and provides complete genome annotations, including coding and non-coding regions like tRNAs, rRNAs, and tmRNAs, with functional assignments where possible.
  • Interactive Visualizations: The tool features an intuitive interface with interactive genomic maps, allowing researchers to easily explore prophage locations and their genomic context.



Phage Discovery with DEPhT


DEPhT (Detection of Prophages in Temperate Hosts) focuses on precise prophage detection, particularly in bacteria with complex genomes like Mycobacterium. DEPhT offers three modes: screening, extraction, and annotation (Gauthier et al., 2022).


Identification and Scoring Methods:

  • Genomic Architectural Features: DEPhT distinguishes bacterial from phage sequences by analyzing genome architecture, identifying sharp transitions in gene content that mark the boundaries of prophage regions.
  • Homology Searches: DEPhT uses MMseqs2 and HHsuite3 to search for homology between the genome being analyzed and known phage genes, ensuring accurate detection.
  • Attachment Site Prediction: DEPhT predicts prophage attachment sites using BLAST searches, helping identify complete prophage sequences and allowing for more detailed downstream analysis.

Strengths:

  • High Precision: DEPhT is tailored for detecting prophages with high accuracy, especially in difficult-to-analyze genomes like Mycobacterium.
  • Versatility: With its three modes (screening, extraction, and annotation), DEPhT can adapt to different stages of analysis, offering flexibility for a variety of research needs.


Summary


PHASTEST and DEPhT are both powerful tools for identifying prophages, but PHASTEST stands out for its speed, ease of use, and flexible framework. These qualities make it particularly appealing to the William & Mary iGEM team, as it’s well-suited for large-scale, broad analyses across a variety of bacterial species and metagenomics. Its rapid annotation and phage identification, paired with interactive visualizations, create an accessible and efficient experience for researchers (Wishart et al., 2023).


Like PHASTEST, we concluded a satellite discovery tool could benefit from:

  • Rapid screening of large datasets using tools like Prodigal to quickly detect coding regions in both genomics and metagenomics, followed by BLAST-based searches to identify satellite-specific proteins.
  • Comprehensive genome annotation including tRNA, tmRNA, and rRNA, to provide additional genomic context.
  • Identification of satellite regions based on protein cluster completeness, aiding in satellite classification.

By building on PHASTEST's strengths, a similar yet more flexible tool for satellite phage discovery would streamline the process, offering researchers a fast, intuitive, and comprehensive platform for identifying these unique elements within bacterial ecosystems.


Design-Build-Test:


Final SaPhARI Pipeline: 

Summary

The SaPhARI pipeline was developed to provide a comprehensive, automated solution for the identification and classification of satellite families in bacteriophages. Leveraging a range of powerful bioinformatics tools, including Prodigal, DIAMOND BLAST, Aragorn, Barrnap, tRNAscan-SE, and custom Python and shell scripts, the pipeline offers researchers a high degree of flexibility and precision in annotating and clustering protein sequences. The final version allows users to seamlessly customize database searches, apply functional filters, and group proteins based on their target families, streamlining the discovery and characterization of satellite prophages.


Early Development and Tool Integration:


The initial phase of SaPhARI's development focused on integrating key tools for accurate protein annotation and functional grouping. We aimed to build a robust pipeline comparable to PHASTEST, capable of clustering proteins and offering detailed functional insights. The workflow was initially designed to process nucleotide sequences, filter out prophage regions, and pass the data through a sequence of specialized tools. The ultimate goal was to deliver high-quality gene annotations alongside functional protein clustering to facilitate downstream analysis.


The core tools integrated in this early version included:


  • Prodigal: A gene prediction tool used for identifying protein-coding regions from nucleotide sequences.
  • DEPhT: Initially incorporated for prophage detection but later removed from the pipeline to streamline processing.
  • DIAMOND BLAST: A high-speed protein alignment tool used for comparing sequences against large databases.
  • BLASTn: Facilitates nucleotide sequence alignments for DNA region comparisons.
  • Aragorn, tRNAscan-SE, and Barrnap: These specialized tools were incorporated for the detection of non-coding RNA elements such as tmRNA, tRNA, and rRNA genes, ensuring thorough gene annotation.

To manage the command-line interface (CLI) tools and optimize workflow efficiency, we utilized Nextflow as the core pipeline orchestration tool, enabling seamless integration and scalability (Buchfink et al., 2021; Camacho et al., 2009; Chan & Lowe, 2019; Di Tommaso et al., 2017; Gauthier et al., 2022; Hyatt et al., 2010; Laslett, 2004, Pruesse et al., 2012 ; Seemann, 2018).

Design-Build-Test Cycle: From Initial to Final Pipeline


The development of the SaPhARI pipeline was guided by an iterative Design-Build-Test cycle. Each component of the pipeline was tested and improved to achieve a robust system for bacteriophage satellite discovery. The following sections outline the design rationale, construction, and iterative testing of the pipeline's key tools and processes.


Design: Prodigal was chosen for its accuracy in predicting open reading frames (ORFs), making it a critical first step in identifying potential satellite prophage regions. The objective was to predict ORFs from nucleotide FASTA files of experimentally verified prophages, such as EcCIEDL933 and EcCICFT073 in our software experiments, while capturing essential annotations needed for downstream analysis.

Build: Prodigal was configured to accept nucleotide sequences and output either nucleotide or amino acid ORFs with comprehensive notations, including ORF number, strand direction, start and stop positions, and GC content. Additionally, metadata such as partial gene status and start codon types were retained to provide full context for each ORF.

Test: Early tests revealed that Prodigal’s raw output could not be directly utilized by downstream tools like BLAST+ due to inconsistent formatting in the headers. The outputs lacked essential notations making it impossible to trace back ORFs to their genomic locations.

Refinement: We introduced a Nextflow process (formatHeaders) to clean Prodigal’s output, specifically reformatting headers to preserve ORF annotations. This ensured that all downstream tools could correctly parse ORF data and maintain accurate linkages between ORF predictions and genomic coordinates.


Design: BLASTn and DIAMOND BLAST were selected for their high-speed sequence alignment capabilities against curated databases (e.g., PHASTEST). The goal was to enable comprehensive functional annotation of predicted ORFs by identifying homologous proteins across viral, archaeal, and bacterial datasets, critical for satellite phage analysis.

Build: We configured BLAST+ with a customizable output format (outfmt 6), presenting key alignment details in a tabular structure. This format was selected to streamline downstream analysis, providing essential information such as alignment length, bit scores, e-values, and taxonomic classification in each line.

Test: During initial runs, BLAST produced large, complex outputs that were difficult to parse for biologically meaningful insights. The volume of data slowed downstream processes and overwhelmed manual interpretation efforts, particularly when dealing with large datasets.

Refinement: To optimize the output, we introduced user-configurable filters, allowing the contents of static parameters—e-value, percent identity, percent coverage, and the number of matches—to be adjusted. This enhancement allowed users to focus on the top hits and tailor the alignment depth, ensuring that only the most relevant homologs were included in the results. This reduced processing time and made the outputs more manageable and meaningful for further analysis.


Design: Functional annotation of proteins is a critical step in satellite prophage identification. Our objective was to automate the assignment of protein functions based on BLAST results, while minimizing the inclusion of hypothetical proteins, which contextually contribute limited biological insight.

Build: A custom Python script (extract.py) was developed to parse BLAST outputs and assign functions to each ORF. The script works by selecting the majority function from the top BLAST hits, excluding hypothetical proteins where possible. If a consensus function is found, it is assigned to the ORF. If all matches are hypothetical, the ORF is flagged accordingly.

Test: Initial tests revealed that the script struggled with ambiguous BLAST results, especially when low-quality or irrelevant matches dominated the output. In cases where no clear majority function could be determined, the script either failed to assign a function or assigned inconsistent annotations.

Refinement: To improve accuracy, we introduced a scoring system that ranks protein functions based on the order of top BLAST hits. If no clear majority function exists, the highest-scoring match is selected. In cases where all hits are hypothetical, the function is explicitly labeled as hypothetical, distinguishing these proteins from poorly characterized ones. This refinement significantly improved the consistency and completeness of functional annotations, making the results more reliable for downstream analysis.


Design: In designing our pipeline, we aimed to leverage existing public databases for protein detection, expecting that PHASTEST’s focus on viral and bacterial proteins would efficiently identify relevant homologs, while NR’s comprehensive scope would ensure broader protein coverage. Our goal was to enable accurate detection of satellite prophage proteins without prematurely building a custom database.

Build: We implemented PHASTEST and NR as our primary databases for protein detection, integrating them into our pipeline for sequence alignment and functional annotation using BLASTn and DIAMOND BLAST. PHASTEST was chosen for its specificity, while NR provided extensive coverage of non-redundant proteins.

Test: During the initial tests with satellites EcCIEDL933 and EcCICFT073, we evaluated PHASTEST and NR separately. PHASTEST demonstrated faster alignment but failed to capture key satellite proteins such as AlpA, leading to detection gaps. In contrast, NR provided better protein coverage but proved computationally expensive, significantly slowing down the pipeline due to its large size.

Refinement: To address these issues, we developed a custom database incorporating the non-redundant bacterial, viral, and archaeal protein sequences from RefSeq, as well as a curated set of satellite-specific proteins from PLE’s. Additionally, we refined our analysis by shifting the focus to benchmarking with DIAMOND BLAST, which provides significantly faster alignment speeds while maintaining high accuracy. BLASTn is still available as an option in SaPhARI, but we prioritize DIAMOND for its efficiency. This shift, combined with our custom database, resulted in reduced processing times and improved detection accuracy for key satellite proteins, ultimately enhancing both performance and coverage.


Design: DEPhT, a prophage detection tool, was integrated into the pipeline to screen genomes for potential prophage elements as a pre-screening step before satellite detection to prevent false positive detection. The objective was to assess DEPhT's effectiveness in detecting wide host-range prophages.

Build: A nextflow process was developed to input genome sequences into DEPhT, enabling the collection of predictions regarding prophage regions. These predictions were then integrated into the broader SaPhARI workflow for subsequent satellite analysis.

Test: Although DEPhT yielded reliable results for certain prophages, challenges arose in managing and training the tool on diverse bacterial strains, leading to inconsistent performance.

Refinement: Given its performance limitations, DEPhT was ultimately removed from the pipeline in favor of direct satellite protein clustering utilizing BLAST and Prodigal outputs. This decision was made to allow researchers greater flexibility in identifying satellite prophages without the constraints imposed by a pre-screening process in the pipeline, which primarily focused on excluding prophage sequences rather than aiding in their identification.


Design: Non-coding RNA elements, including tRNAs, rRNAs, and tmRNAs, play crucial roles in prophage biology and serve as important indicators of satellite elements. The objective was to integrate specialized tools—Aragorn for tmRNA detection, Barrnap for rRNA, and tRNAscan-SE for highly accurate tRNA prediction—to effectively annotate these non-coding elements.

Build: Separate Nextflow processes were developed for Aragorn, Barrnap, and tRNAscan-SE, allowing adjustable parameters tailored to specific detection needs. Score thresholds were established for tRNAscan-SE, while e-value, rejection, and length cutoffs were defined for Barrnap. 

Test: Initial tests demonstrated successful detection of non-coding RNA elements, with parameters adjustable by users to optimize detection based on specific research needs. This flexibility allowed for enhanced accuracy in identifying non-coding elements within the genomic context.


Design: To classify satellite prophages based on genomic protein clusters, we developed a Python class titled Satellite. This class allows for flexible grouping of proteins, configurable aliases (e.g., treating 'major capsid' and 'major head' as synonyms), and customizable thresholds for the number of proteins required to elicit a match. Protein searching is performed through string-based matching of the protein titles from the extract.py outfile, which we opted for as it is easier for users to work with. By treating the protein titles as strings, the system can easily identify clusters based on similar or synonymous protein names. This approach also helps account for convergent evolution in proteins, where similar functions may arise independently, allowing for more flexible classification. The goal was to create a user-friendly framework for classifying satellites into distinct families (Cock et al., 2009).

Build: The class was built to manage unordered protein sets, with customization through aliases. Key parameters, which include minimum protein thresholds for family assignment, exclusion of forbidden proteins, inter-protein distance limits, and maximum satellite length, were incorporated to boost flexibility.

Test: Initial implementations encountered issues where multiple sets containing the same proteins were detected, leading to inconsistent outputs based solely on variations in size.

Refinement: To resolve this, the class was restructured to enhance alias matching and ensure the identification of the largest unique protein cluster. Extensive unit tests were implemented to validate the logic for grouping proteins and ensure accurate family classification across various input conditions. To further verify our approach, we used nucleotide FASTA genomes from individual satellite families in our novel database to test whether SaPhARI could successfully identify satellites. Additionally, we conducted negative control tests to ensure SaPhARI did not falsely identify satellites in non-satellite containing genomes, further confirming its accuracy.

Performance Evaluation: The find_it() method in the Satellite Python class is designed to identify and extract specific protein regions from an annotated SaPhARI file based on parameters defined for a satellite family. The method processes the file by reading all lines and extracting protein names along with their respective positions. A nested function checks for forbidden proteins while iterating through the lines to collect potential protein regions within the specified maximum length. The algorithm keeps track of distinct proteins found and ensures uniqueness of regions by avoiding subsets of previously identified regions. Additionally, it incorporates functionality to include flanking genes around each identified region, thus providing a comprehensive output. The overall time complexity of this approach is O(n²) in the worst case, attributed to the nested loops iterating through the lines; however, practical performance may vary depending on the content of the input genome.


Design: Simplicity in deployment was a top priority, aiming to allow users to run the entire SaPhARI pipeline with minimal configuration, while still providing options for customization.

Build: A Python script was developed as the primary interface for SaPhARI. This script captures user input for the many valuable parameters, such as database selection, e-value thresholds, and output location, and passes them to shell scripts for execution either locally or on an HPC system. The Python script automates the setup and execution of the pipeline, enabling users to customize parameters without needing to modify the underlying shell scripts directly.

Test: Early feedback from users indicated that while the pipeline functioned as intended, new users often found the setup process overwhelming, particularly those unfamiliar with Python environments or manual configuration.

Refinement: To enhance usability, we introduced a configuration template that walks users through setting key parameters and choosing workflows. The Python script was also updated with pre-configured defaults for common use cases, allowing novice users to run the pipeline with minimal setup. Advanced users still have the flexibility to fine-tune parameters directly within the template as needed.


Design: Initially, the pipeline was designed to output only the identified regions containing core proteins, with the goal of providing users with a concise summary of detected satellite families.

Build: The output structure focused on providing clear results for each detected satellite region. Users could define core proteins and set thresholds for family classification. The output was generated in .txt format, summarizing the identified regions containing the specified core proteins.

Test: During testing, it became evident that focusing solely on the identified regions provided insufficient genomic context. Users found it challenging to interpret how the detected satellite regions fit into the broader genomic landscape, especially when analyzing large genomes with complex prophage structures.

Refinement: To address this, we enhanced the Satellite class output by including the five flanking genes both upstream and downstream of each identified region. This provided valuable genomic context around the core protein regions, making it easier to understand the surrounding genomic structure. The refined output remains in .txt format, but now includes detailed context to offer a more complete view of each detected satellite region.


Analysis: 


The iterative Design-Build-Test cycle led to the evolution of SaPhARI into a powerful and flexible tool for the discovery and classification of satellite prophages. By integrating tools such as Prodigal and BLAST, and refining the pipeline through continuous feedback and testing, SaPhARI has become a robust system that balances precision and performance. Each step, from open reading frame prediction to functional annotation and satellite classification, was meticulously designed to address the unique challenges of large-scale genomic analysis.

The incorporation of a custom database, dynamic filtering, and enhanced context through flanking gene inclusion ensures that researchers are provided with biologically relevant insights in a manageable format. With its streamlined deployment and high degree of customization, SaPhARI is now well-equipped to handle complex datasets, offering valuable genomic context and improving the accuracy of satellite prophage detection.

The final version of the SaPhARI pipeline is a versatile, scalable, and user-friendly platform that empowers researchers to uncover novel or known satellites and analyze clusters of proteins in diverse biological contexts, deepening our understanding of bacteriophage satellite biology across diverse bacterial hosts and metagenomic datasets.

Real World Application: Metagenomics


Metagenomics involves the collection, DNA extraction, and sequencing of diverse environmental samples, providing a broad snapshot of microbial communities within a given environment rather than focusing on a single species. One key approach, shotgun metagenomics, randomly fragments DNA and sequences it in 150 bp increments, offering the potential to assemble complete genomes or large DNA fragments (Liu et al., 2022). This method enables researchers to explore entire microbial ecosystems and discover novel genetic elements without being limited to specific regions of the genome.

Metagenomic samples come from a variety of environments, including wastewater, soil, and animal guts. Many studies use metagenomic sequencing to understand microbial dynamics in natural settings, providing an immense amount of genetic information for researchers to analyze. Mining these datasets is crucial to our project, as our goal is to categorize and locate the diverse and ubiquitous families of satellite phages. Leveraging metagenomic data serves as the capstone for the bioinformatics portion of our project, bridging the gap between software tool development and its application in real-world environments.


SaPhARI Metagenomic Integration


As we developed SaPhARI, we simultaneously created a pipeline designed for assembling and preparing metagenomic data for analysis and annotation using open source software. This allowed us to create a uniform pipeline beginning with raw reads and ending with the identification of putative satellites using the software. The metagenomic SaPhARI pipeline includes a series of shell scripts optimized for SLURM-based HPC systems, leveraging open source software evaluated for optimal computational efficiency. Described below are the steps taken within this pipeline to prepare metagenomic data for analysis with SaPhARI. 


Assembly Pipeline:


  1. Input: Create a .txt file with accession numbers to download samples from the NCBI Sequence Read Archive (SRA) (U.S. Department of Health and Human Services, 2023). The file should have one accession number per line. 

  2. Sequence Download: Use FasterQDump to download raw reads from SRA separate reads into R1 and R2 raw FASTQ files, preventing the creation of interleaved FASTQ files that are not optimal for subsequent processing (U.S. Department of Health and Human Services, 2023).

  3. Quality Control: Perform an initial evaluation of the raw FASTQ files using FASTQC to assess metrics such as base quality scores, adapter content, and duplication levels. Following this, trim the reads with trim_galore to remove low-quality bases and adapter sequences, ensuring high-quality input for downstream assembly, while reprocessing to confirm the effectiveness of the trimming (Babraham Institute, 2023).

  4. Assembly: Assemble the trimmed reads using MEGAHIT with –min-contig-length 5000, ensuring the assembly of contigs with sufficient length for satellite discovery (Li et al., 2015).

  5. Sorting: Separate each multifasta assembly output into individual contigs utilizing the Biopython package and sort them into folders based on size to allow optimization of the SaPhARI algorithm, which prefers contigs over 20,000bp. Utilizes a custom python script (Cock et al., 2009).

The assembly pipeline outputs a parent directory containing subdirectories named by SRA accession number, with each subdirectory holding contigs organized by size. The folder named large_contigs is optimized for SaPhARI’s traditional nucleotide annotation workflow, while the folder named small_contigs is created for SaPhARI’s metagenomic nucleotide annotation workflow, which allows for the annotation of contigs under 20,000bp. 


Performance:


We have processed and annotated 7,659 contigs to date, with more being analyzed daily, yielding highly promising results for identifying satellite phages and achieving our goal of conducting a broad genomic search. So far, 37 out of 150 assembled metagenomes have been processed, with 87 putative satellites identified—a number that continues to grow each day. These results demonstrate the software's exceptional potential in successfully detecting satellite protein clusters.

For further validation, we isolated the contigs containing suspected satellites from each run and cross-referenced them using both PHASTEST and BLAST+ to check for similarities to known phage, bacterial, or satellite sequences Camacho et al., 2009; Wishart et al., 2023). This analysis revealed no similarity to any known phage, suggesting that we may be characterizing entirely novel satellite phages.

Metagenomic assembly generates an immense amount of genetic data, often resulting in thousands of contigs per sample and requiring extensive computation time to process. Depending on the sample size and reference database, it can take anywhere from a few hours to several days for SaPhARI to analyze all contigs from a single metagenome. To reduce the computational burden, we minimized our reference database by including only the RefSeq non-redundant bacterial, viral, archaeal proteins, and PLE proteins. Although this change significantly lowered computation time compared to the full NCBI Non-Redundant database, the load remains substantial and continues to present a major challenge. Despite this, we are committed to our ongoing in-depth analysis of metagenomic data, leveraging parallel software instances and the robust computational resources available through the High Performance Computing (HPC) cluster at William & Mary to further advance satellite phage discovery.


Evaluation of Current Metagenomic Software



Prior to the development of SaPhARI, we explored various metagenomic classification tools to identify phages in environmental samples. Our aim was to adapt these methods for satellite phage detection after constructing our novel database of characterized satellites. Early on, we encountered limitations in utilizing metagenomic data due to the dependence of modern classification software on reference databases. Nevertheless, we sought to assess the sensitivity of these tools in detecting viral sequences, in the hopes of eventually leveraging them for satellite identification.

To benchmark different taxonomic classification and profiling software, we used metagenomic data from the soil microcosm experiments conducted by William & Mary iGEM 2023. This dataset was selected due to its well-documented experimental design and expected outcomes, such as increased populations of M. smegmatis and Mycobacterium phage Kampy. By using these tools to detect Mycobacterium phage Kampy, we were able to evaluate their potential utility for satellite detection once a robust satellite database had been established.

Before selecting specific tools, we conducted a review of the latest metagenomic taxonomic identification software. Two particular papers, titled the Critical Assessment of Metagenomic Interpretation, composed of comprehensive benchmark studies from 2017 and 2022 which assessed the precision and recall of various metagenomic classification tools (Meyer et al., 2022; Sczyrba et al., 2017). In addition, we consulted with faculty experts at William & Mary, who recommended the use of k-mer-based alignment software due to its efficiency in reducing computational overhead compared to traditional nucleotide alignment strategies, such as BLAST. K-mer alignment operates by analyzing all possible combinations of nucleotide sequences of a fixed length (k-mers), offering a computationally efficient solution for large-scale metagenomic analyses. 


Tools Evaluated:


Kraken2/Bracken:


Kraken2 assigns taxonomic labels by comparing k-mers from sequences to a reference database, while Bracken refines these classifications by estimating the relative abundance of each taxon. The key difference lies in their focus: Kraken2 is primarily concerned with identifying taxa, while Bracken quantifies their abundance within the dataset. Notably, Kraken2’s confidence thresholds significantly influence its output. Higher precision settings tended to overestimate the percentage of Mycobacterium smegmatis within our iGEM 2023 soil metagenomes, while lower or default confidence levels improved recall but resulted in lower abundance estimates. Despite these adjustments, Kraken2 still struggled to classify the majority of reads, with 70% remaining unclassified at default confidence levels, and up to 99% unclassified at higher confidence settings. Moreover, Kraken2 typically classified sequences only to the lowest taxonomic level it could confidently identify, which was rarely species-specific (Lu et al., 2017; Wood et al., 2019).

Additionally, Kraken2’s memory usage increases substantially with larger databases. Although it is marketed as a compact tool, using the full NCBI Nucleotide database required 1TB of RAM to fit the hash table structure required for classification, which we fortunately had access to via our high-performance computing (HPC) system. We also explored memory-mapping the database to avoid fully loading it onto the computing node, but this dramatically slowed down classification, extending processing times to several days. Ultimately, we had to reduce our database to the RefSeq Archaea, Bacterial, and Viral dataset, which considerably lowered computational demands.

Kraken2 also had difficulty distinguishing phages due to their genetic similarities and the limited representation of phage sequences within our reads. For example, the Kampy phage appeared in very low numbers, while the genus Backyardiganvirus had a 100-fold higher representation. Interestingly, when we created a custom database containing only the Kampy phage, the classification numbers for both Kampy and Backyardiganvirus were almost identical. This suggests Kraken2 struggles to differentiate between species within the same genus, particularly with closely related phages. These limitations indicate that Kraken2 may face significant challenges in distinguishing small, highly similar satellite phage sequences.


Metalign:


Metalign utilizes a two-step algorithm, beginning with a pre-filtering step based on the Jaccard index, which measures the similarity between the k-mers in the query sequences and those in the reference database. This pre-filtering process selects a subset of the database that shares significant k-mer overlap with the sample, thereby reducing the search space and computational load. However, one drawback is that Metalign discarded 80% of the available k-mers in our sequencing data, limiting the amount of genomic information processed. Additionally, Metalign requires that read FASTQs be merged with external software such as Flash or BBmerge before running, which can exacerbate the loss of reads during processing (Bushell et al., 2017; LaPierre et al., 2020; Magoč et al., 2011). 

After establishing the pre-filtered database, Metalign aligns the reads to the filtered reference genomes and estimates the relative abundance of each detected taxon. As a taxonomic profiler rather than a classifier, Metalign focuses on providing an overview of microbial community composition and relative abundances, rather than precisely identifying the organism associated with each read within a dataset. While this broad profiling offered insights into community-level dynamics, Metalign struggled with organisms not represented in its reference database, which could not be updated by the user (Liu et al., 2022).

Phage detection also proved challenging for Metalign, as phages tend to exhibit high genetic variability, making it difficult to accurately profile them using short reads. In our samples, Metalign identified only a single Mycobacterium phage, which was not the expected phage Kampy from our experimental data. Moreover, since Metalign's database cannot be updated, it lacks the flexibility to identify newly discovered satellites, making it less suitable for satellite identification.


Centrifuge:


Centrifuge, like Kraken2, uses a k-mer-based approach for taxonomic classification of metagenomic reads, allowing for rapid assignment of taxa based on reference sequences. Both tools construct an index from a reference database, enabling quick k-mer matching to identify taxa associated with the given reads. These algorithms are designed to achieve high efficiency and speed, outperforming traditional nucleotide alignment tools like BLAST, by leveraging pre-built databases to reduce computational load during classification. (Kim et al., 2016)

While Centrifuge provided results comparable to Kraken2, it lacked the ability to adjust confidence levels, limiting user flexibility. Additionally, Centrifuge required significantly more computational resources and time to index the database compared to Kraken2. Installation through a conda environment took several hours, and indexing of the RefSeq Archaea, Bacterial, and Viral database took several days. Although both tools offered similar outcomes, Kraken2's adjustable confidence intervals and more efficient database construction process give it an advantage as a taxonomic classifier.


mOTUS:

mOTUS is a marker gene analysis profiler that relies on an internal database of species-specific indicative genes. Although tested for its ability to detect Mycobacterium smegmatis, it was ineffective, classifying less than 1% of the sequencing reads. Overall, this classification tool proved inadequate for detecting satellite phages in metagenomic data, and even struggled to classify any bacteria (Ruscheweyh et al., 2021).


Lessons Learned:


The performance of taxonomic classifiers is highly dependent on the confidence thresholds set during analysis. While these tools can help detect specific constructs in metagenomic reads, achieving precise results is challenging—especially when a large portion of reads remain unclassified. Additionally, relying on short genomic reads can be limiting, as they often lack distinct, informative regions. Another key limitation is that these classifiers can only identify satellite phages present in the reference database, restricting their ability to detect novel sequences. Ideally, these tools would allow us to identify satellite constructs within metagenomic data, as we have a clear target in mind for our analysis.

Though taxonomic classifiers like Kraken2 and Centrifuge are useful for estimating the microbial species present in a sample, they often fall short in reliably quantifying species-level data, especially for small or uncharacterized constructs like satellite phages. These tools could be more effective in identifying engineered constructs if we focused on assembling contigs rather than relying on sequencing reads, a more reliable approach for satellite phage discovery through metagenomics. 

These challenges highlight the need for specialized tools like SaPhARI. Unlike traditional phage identification tools that are confined to known reference databases, SaPhARI allows for the discovery of novel satellite phages by going beyond the constructs of existing knowledge. After beginning our development SaPhARI, we shifted our metagenomics strategy to complement this novel tool, better suited for identifying satellites.



Assembly in metagenomics involves using short DNA reads obtained from sequencing to reconstruct longer, contiguous sequences known as contigs. These contigs are formed by aligning overlapping sequences, providing a clearer and more comprehensive representation of the metagenome. A contig is a continuous stretch of DNA pieced together from shorter fragments, which alone are often too brief to represent entire genes or genomic regions. Assembling these reads into contigs enables researchers to analyze the genetic material of microbial communities more effectively (Wang et al., 2019).


One common approach to assembly is through the use of de Bruijn graphs, which are mathematical structures that map shared k-mers—short iterations of nucleotide sequences of a specified length—across different reads. In a de Bruijn graph, each distinct k-mer is represented as a node, and edges connect nodes that share overlapping k-mer sequences. This method allows researchers to trace paths that represent potential contiguous sequences, facilitating genome reconstruction from fragmented reads (Compeau et al., 2011).However, building and analyzing de Bruijn graphs is computationally intensive, as it requires processing vast amounts of sequencing data and managing complex overlaps between k-mers in large datasets. 

As mentioned earlier, existing metagenomic classification tools had significant limitations when it came to satellite phage discovery, particularly due to their reliance on reference-based databases and inability to classify at higher taxonomic orders based off of short read sequencing data, which restricted the identification of novel satellite sequences. These constraints underscored the need for a de novo assembly approach, focuses on reconstructing raw metagenomic data to identify novel satellite phages and other related entities using our custom software, SaPhARI. 

However, while assembly holds promise, it is a complex and error-prone process. A key challenge in metagenomic assembly is the risk of misassemblies, which can lead to uncertainty about the accuracy of the resulting contigs. Metagenomic samples often contain highly conserved sequences shared across species, making it possible to misclassify sequences from one species as belonging to another. Additionally, repetitive or overlapping sequences may result in the assembly of artificially longer constructs than what is actually reflected in the sample, creating a misleading picture of the genomic content. The complexity of microbial communities further compounds this issue, as low-abundance species may not be fully represented, leading to incomplete assemblies. Moreover, metagenomic assemblies rarely result in full genomes; instead, researchers often obtain fragmented sequences that do not overlap well. As such, the reliability of assembled contigs must be carefully evaluated, as the potential for misassemblies introduces a degree of uncertainty into the analysis.(Wang et al., 2019). Advances in bioinformatic software seek to minimize these potential errors, and the field of metagenomics continues to improve the quality of assemblies with the development of novel algorithms and error-correcting capabilities. While these software have their limitations, they can still provide crucial insights into microbial life in diverse environments and allow for the discovery and documentation of novel species (Meyer et al., 2022; Sczyrba et al., 2017).


Literature Review:


We evaluated several short-read metagenomic assemblers, applying the same benchmarking studies and algorithms previously used for taxonomic classification software. Our goal was to determine the most effective tool for assembling metagenomes to be used with SaPhARI. Based on our analysis, we arrived at the following conclusions:


MEGAHIT:


MEGAHIT is a highly efficient assembler specifically designed for metagenomic data. It utilizes succinct de Bruijn graphs, a memory-optimized variant of traditional de Bruijn graphs that represent overlapping k-mers between reads, enabling the assembly of short reads into longer contiguous sequences. By using this compressed structure, MEGAHIT efficiently handles large datasets with lower memory requirements, making it well-suited for high-throughput metagenomic sequencing. Its key advantages are speed and scalability, making it ideal for large-scale metagenomic projects, although the quality of the resulting assemblies can vary in quality compared to other assemblers (Liu et al., 2015).


metaSPAdes:


metaSPAdes is a more resource-intensive assembler than MEGAHIT, though slower, specifically tailored for metagenomic data and built on the widely-used SPAdes algorithm for single-genome assembly. Like MEGAHIT, metaSPAdes employs de Bruijn graphs, but it enhances the assembly process by incorporating advanced strategies such as read error correction and paired-end read utilization to improve both accuracy and assembly quality. While it requires more computational resources and time, metaSPAdes often delivers comparable or superior assembly quality, particularly for complex or low-abundance microbial communities. Its ability to handle intricate metagenomic datasets with greater precision makes it a strong choice for high-quality assemblies (Bankevich et al., 2012; Meyer et al., 2022; Nurk et al., 2012; Sczyrba et al., 2017).


Velvet:


Velvet, one of the earlier assemblers utilizing de Bruijn graphs, has largely fallen out of favor for metagenomic assembly due to its limitations. While effective for assembling single-genome data, it struggles with the complexity and variability inherent in metagenomic datasets, often resulting in less effective assemblies. Velvet also demands more computational resources and produces lower-quality results compared to modern assemblers like MEGAHIT and metaSPAdes, making it a less viable option for contemporary metagenomic projects (Terrón-Camero et al., 2022).

After reviewing the literature, MEGAHIT and metaSPAdes emerged as the most reliable assemblers for metagenomic data. MEGAHIT consistently outperformed in terms of speed and memory efficiency, while metaSPAdes demonstrated better assembly accuracy and genome fraction coverage, although MEGAHIT occasionally showed a slight advantage in genome fraction during benchmark studies. Ultimately, we chose to use MEGAHIT due to its scalability for large metagenomic projects and comparable quality, aligning with our goal of analyzing as many metagenomes as possible for satellite phage identification (Kumar et al., 2023; Wang et al., 2020).


Evaluating Assemblies:


metaQUAST is a specialized quality assessment tool designed for evaluating the accuracy and completeness of metagenomic assemblies. Building on the functionality of QUAST (Quality Assessment Tool for Genome Assemblies), metaQUAST includes features tailored to address the complexities of metagenomic data. It operates by mapping the original reads back to the assembled sequences, enabling the assessment of key metrics such as genome coverage, the number of misassemblies, contig lengths, and the presence of fragmented or missing regions. (Gurevich, 2013; Mikheenko, 2016).

By using metaQUAST, we were able to map our reads back to the assembled contigs to assess both the coverage and quality of our assemblies. A high proportion of reads mapping back to the contigs indicates a well-supported assembly with minimal gaps or errors. After running metaQUAST on the assemblies produced by MEGAHIT, we observed excellent read mapping coverage, with the vast majority of reads aligning back to the contigs. These results validated the high quality of the MEGAHIT assembly and confirmed that it was an optimal choice for our project.

References