A pipeline for identifying chromoproteins & pigment synthesis enzymes in organisms from genomic sequences
There exists severals different methods of identifying a potential chromoprotein, which often come with a number of advantages and disadvantages. One such method is protein sequencing, which often involve purification steps as well as sensitive mass-spectrometric methods. While these are effective methods for protein identification, they can often be cumbersome and specific methods will depend on the properties of the protein of interest. Pigments may also be difficult to identify. This is because the property of smaller molecules differ, and as a result, their separation methods must also vary accordingly.
An alternative approach that can be equally effective is analyzing genomic data to predict potential protein sequences. This method involves sequencing target organisms known to possess chromoproteins. A prediction software is then used to predict candidate genes in the sequenced genome. Thereafter, the candidate genes are translated into protein candidates. These are then compared to large databases with sets of known chromoproteins or known pigment producing enzymes1. This, in turn, may yield high-scoring hits that can be used to identify related homologs. With the significant drop in genomic sequencing costs, our method of protein sequence prediction offers a more affordable and faster approach for the identification and characterization of various proteins.
However, in order to convert the NGS data into protein candidates many different programs need to be run sequentially, in what is known as a pipeline.
Obtained sequenced genomic data (saved as .fasta) acts as the input for the pipeline. Using prodigal, the pipeline then identifies potetial protein coding sequences along with putative proteins for these sequences. Then, depending on the database of choice, these putative proteins are either compared with databases of known chromoproteins, pigment-producing enzymes or other proteins using a pBLAST-search. Putative protein sequences with sufficiently high E-values are then re-evaluated against the database, utilizing the Smith-Waterman local alignment tool for comparison. To then further assess the plasibility of our to our local alignment, a normalized score is provided based on the mass and length of the putative protein, equal to the normalized score divided by the number of amino acids of the protein. These scores can then be compared to those of the positive and negative results to assess the plausibility of the identified putative proteins. Furthermore, a Gumbel distribution is fitted to the normalized scores, since it effectively models the distribution of extreme values, capturing the tail behavior of the data. This is often critical in assessing the likelihood of rare or outlier events, such as a positive hit of a candidate protein.
In order to make sense of the results, positive and negative controls were implemented for reference along with the default settings for each step. The positive control was established by removing a protein sequence from the already available database, which should ideally2 yield a high score if searched against the database. The negative control meanwhile was instead implemented by the use of proteins known to be both sequentially and functionally different. In this case, hemoglobin was used. A fasta-file consisting of the positive control along with many negative controls were thereafter run against the database to make sense of the scores obtained from our actual samples. As may be discerned from the graph, the normalized scores of the positive controls ranged from 1.1-5. Any protein should therefore be expected to have a score somewhere in between the score of the positive and the negative control, if they are to be deemed a positive protein candidates. It should be noted, however, that this score alone is insufficient to determine whether a specific protein is responsible for the color of a particular bacterium or not. It can, however, help rule out many sequences and guide bioinformaticians in the right direction. It's equally important to note that the identification of a candidate protein as a chromoprotein in the analyzed genome does not necessarily imply that it can be transcribed and/or translated in the organism of interest.
After positive and negative control had been passed through the pipeline, the assembled genomic data was processed through our pipeline. Each genomic sample passed through the pipeline in two iterations, one for the chromoprotein database and another for the pigment-producing enzyme database. We utilized the default settings as specified in the code available at our GitLab repository.
Two candidate chromoproteins identified through ChromoSearch As outlined above, the ChromoSearch pipeline enables the rapid identification of potential chromoproteins in bacterial genomes. This functionality was employed to validate the lab findings related to environmental strains. For P91, S. faecium, our pipeline identified two proteins with homology to Deoxyribodipyrimidine photo-lyase from A. nidulans (Uniprot ID: PHR_SYNP6) and Cryptochrome DASH from S. lycopersicum (Uniprot ID: CRYD_SOLLC). These matches had normalized scores of 1.52 (p = 0.062) and 1.44 (p = 0.062), respectively. Both proteins bind the cofactor Flavin Adenine Dinucleotide (FAD), known for its yellow color. Notably, both candidates have a mass of approximately 50 kDa, consistent with the results from semi-native SDS-gel experiments. This provides substantial evidence that these proteins may be responsible for the observed color in the strains.
A similar result was observed for strain P350, M. glutamicibacter. ChromoSearch identified a single hit with a normalized score of 1.96 (p = 0.001), again linked to Deoxyribodipyrimidine photo-lyase from A. tumefaciens. This protein also binds the FAD cofactor, making it a strong candidate for the vibrant yellow color observed in the strain.
One of the most interesting pigments producing enzymes identified through the use of our pipeline is that of a putative protein with a normalized score close to 1.84 and a corrected p-value of roughly 0.002. This putative protein identified from the genome of P350 scored well against 4,4'-diaponeurosporen-aldehyde dehydrogenase, a known pigment-producing enzyme known to participate in the biosynthesis of of the yellow-orange carotenoid staphyloxanthin. Due to the yellow-orange color of bacteria, this protein candidate was at first considered to be likely to be responsible for the color. However, results from experimental work in our lab shows that the color most likely originates from a protein and not a pigment. However, this is still one of the strongest contenders for the colors identified in this particular strain.
Another interesting putative protein candidate idenfied in the genome of P298 also scored well against 4,4'-diaponeurosporen-aldehyde dehydrogenase. This is interesting because the color of the P298 strain is similar to that of the previously mentioned P350 strain, while sharing a high score for the same pigment producing enzyme. This might mean that the two share the same coloring mechanism, owing their color to a staphyloxanthin-like pigment.1. For chromoproteins, sequences were retrieved using (keyword:KW-0157) AND (reviewed:true) AND (taxonomy_id:2). For pigment-producing enzymes, the search used (go:0046148) AND (taxonomy_id:2). Data accessed from UniProt, July 2024.
2. However, the pipeline is far from perfect. BlastP only compare sequence identity. Consequently, proteins that exhibit biological similarity yet have low sequence identity may go unnoticed by our pipeline and get ruled out early. This is true for the positive control sequences assessed in our positive control as well as for the putative proteins characterized through our pipeline.