Metagenomic Assembler Selection
We sought advice on whether the metagenomic assembler we were using, MEGAHIT, was the best approach for our project, given the diversity in metagenomic assembly software available. Dr. Negrón emphasized that there is no universally "best" assembler and that the suitability of the tool depends on its ability to generate reliable results aligned with the specific requirements of our project. After explaining how MEGAHIT met both of our needs for reduced computational time and low memory usage, he emphasized that if MEGAHIT meets the needs of our project, then we should continue to utilize the software.
Database Optimization
Our second topic of discussion concerned how to filter the NCBI Non-Redundant (NR) database, as utilizing the whole database was too computationally expensive for us to annotate large genomic samples efficiently while running SaPhARI . We searched for an efficient way to filter the NCBI NR database to include the protein sequences from only prokaryotic, archaea, and viral species. While we could find pre-filtered databases available on the NCBI FTP server for the NCBI Nucleotide database, such options did not exist for the NCBI NR database. The only form available for download was the entirety of the NR database.
Dr. Negrón offered two suggestions for optimizing database efficiency, the first being to construct a database utilizing a list of only our desired taxids, which would involve downloading and creating a multifasta file. His second suggestion was using the –taxids option within the BLAST command, which only searches the query across organisms limited to the given taxids. Dr. Negrón issued a warning of caution for the former option, due to the need to update the database frequently in the event NCBI switches the taxid for any species. This may cause a broader issue the future BLAST+ searches, thus we would need software that would be able to reassign the old taxid numbers to the new, as these outdated labels could compromise the results of the search.
We took an approach based on Dr. Negrón’s broader warnings on filtering the NCBI non-redundant (NR) database, as we sought to find pre-filtered databases publicly available online to avoid the potential issues associated with filtering the database ourselves. We ultimately decided on using the RefSeq Bacterial, Viral, and Archea non-redundant files from the NCBI FTP server, which we then compiled into a protein DIAMOND database. His advice helped us evaluate the computational trade-offs and influenced our final decisions on database management.
Pipeline Efficiency
Dr. Negrón recommended exploring SnakeMake for streamlining our workflow, a comparable software to NextFlow, the workflow management software of SaPhARI. As with metagenomic assemblers, his guidance reinforced the importance of selecting a tool that best supports the project’s needs and our team’s familiarity with the software. While SnakeMake seemed very promising in his description, we felt our needs were being suitably met through the use of NextFlow for this project. However, we are inclined to look into using the software for future projects, due to the increased simplicity of SnakeMake. For example, Snakemake operates on Python, a more universal language than the Java-based language, Groovy, utilized for NextFlow.
Furthermore, Dr. Negrón introduced us to NextStrain, a tool for tracking viral variants, which holds potential for future phases of our research. NextStrain and NextClade are software that predict the clade and similarity of viruses given a query sequence and a reference database of similar viral sequences. This tool could be extremely useful in identifying satellite families and visualizing their differences through NextStrain’s graphic user interface, especially once we have a comprehensive database of each phage satellite family.
Overall, Dr. Negrón reviewed our bioinformatics pipeline and found it clear and well-structured. His input provided sage professional guidance to our strategies and played a significant role in refining our overall approach, enabling us to make informed decisions that balance computational efficiency with the project’s objectives.
Further guidance on NextStrain
Following our meeting with Dr. Daniel Negrón in August, a follow-up meeting to discuss new bioinformatics-related questions was arranged to include both Dr. Abramson and Dr. Negrón of Noblis. As we discussed our initial reservations about using NextStrain, mostly due to concerns that our satellite phage database would not meet the sufficient amount of reference material to accurately model each phage satellite family. However, Dr. Negrón highlighted that not many reference genomes are required to utilize the software and encouraged us to experiment with the software locally on our computers.
Hypothetical Proteins
Following the construction of SaPhARI, we have sought additional metrics to contrast satellite phage families from each other. This metric could lie in the vast number of hypothetical proteins found in phage satellites, which are proteins of uncharacterized function. If we could assign each hypothetical protein a unique identifier code based on genomic sequence, then perhaps we could compare these unique identifier codes to compare genetic or proteomic similarities within satellite families. While assigning function to hypothetical proteins is an empirically difficult task, we could compare the similarities in their nucleotide or amino acid sequences to add an additional layer of similarity.
A tool potentially useful for this in the future is Conserved-Domain Search BLAST, which looks for functional or structural units within amino acid sequences that can provide insights into a protein’s structure or function. Hypothetical proteins could be mapped into broad protein families, which could then be compared among satellites to add an additional parameter when classifying a suspected satellite phage into a specific family. We look forward to exploring and potentially implementing this software tool in the future.
Satellite Family Clustering
After such excellent advice on how to leverage hypothetical proteins to strengthen our satellite identification software, we continued the conversation on how to strengthen our current method of differentiating satellite families. Both professionals suggested utilizing the Jaccard Index, a method of determining the similarity between two sets. Our current code takes protein sets from multiple files and finds combinations that maximize the number of files sharing a minimum threshold of proteins, akin to how the Jaccard index measures similarity between sets. By focusing on the intersection of protein sets, our code identifies the optimal set that maximizes coverage across files based on the ratio of shared proteins to the total unique proteins, similar to calculating the Jaccard index.
In addition to the Jaccard Index suggestion, Dr. Negrón suggested utilizing the Python SciKit Learn machine learning algorithms to build models based off of distinct satellite families. He recommended looking into the Bag of Words algorithm, which would model which protein names are present within a given satellite family. SciKit Learn package CountVectorizer does exactly this, where we could gather annotated protein names into single strings for each file. Then, by applying CountVectorizer, we could create a count matrix representing the frequency of these protein names across files, enabling further analysis of similarities between protein families.
We were extremely grateful for another opportunity to discuss with Dr. Negrón and Dr. Abramson how to better improve our software to better classify satellite families. Both scientists gave great suggestions on novel software to potentially implement into improving our SaPhARI software for future use.