A pipeline for identifying chromoproteins & pigment synthesis enzymes in organisms from genomic sequences
There are several different methods of identifying a potential chromoprotein, which often come with a number of advantages and disadvantages. One such method is protein sequencing, which often involves purification steps as well as sensitive mass-spectrometric methods. While these are effective methods for protein identification, they can often be cumbersome and specific methods will depend on the properties of the protein of interest. Apart from chromoproteins, pigments may also be difficult to identify. This is because the property of smaller molecules differ, and as a result, their separation methods must also vary accordingly.
An alternative approach that can be equally effective is analyzing genomic data to predict potential protein sequences. This method involves sequencing target organisms known to possess chromoproteins. A prediction software is then used to predict candidate genes in the sequenced genome. Thereafter, the candidate genes are translated into protein candidates. These are then compared to large databases with sets of known chromoproteins or known pigment producing enzymes1. This, in turn, may yield high-scoring hits that can be used to identify related homologs. With the significant drop in genomic sequencing costs, our method of protein sequence prediction offers a more affordable and faster approach for the identification and characterization of various proteins.
Our ChromoSearch pipeline is open-source and freely available for anyone to use at:
Obtained sequenced genomic data (saved as .fasta) acts as the input for the pipeline. Using prodigal, the pipeline then identifies potential protein coding sequences along with putative proteins for these sequences. Then, depending on the database of choice, these putative proteins are either compared with databases of known chromoproteins, pigment-producing enzymes or other proteins using a pBLAST-search. Putative protein sequences with sufficiently low E-values are then re-evaluated against the database, utilizing the Smith-Waterman local alignment tool for comparison. To then further assess the plasibility of our local alignment, a normalized score is provided based on the mass and length of the putative protein, equal to the normalized score divided by the number of amino acids of the protein. These scores can then be compared to those of the positive and negative results to assess the plausibility of the identified putative proteins. Furthermore, a Gumbel distribution is fitted to the normalized scores, since it effectively models the distribution of extreme values, capturing the tail behavior of the data. This is often critical in assessing the likelihood of rare or outlier events, such as a positive hit of a candidate protein.
In order to make sense of the results, positive and negative controls were implemented for reference along with the default settings for each step. To statistically validate the capability of our pipeline to seperate possible chromoproteins from the rest of the bacterial genome, we were first tasked with finding representative positive controls. However, the proteins found in our database represent an exhaustive set of all quality annotated chromoproteins and pigment biosynthesis enzymes. Hence, this search would prove difficult. For this reason, we decided to do a variation of leave-one-out crossvalidation, with each entry in our database being searched against itself. For negative controls we used a balanced random selection of the Swiss-Prot database, limited to Eubacteria (taxonomy 2), and without any chromophore binding proteins.
Any protein should therefore be expected to have a score somewhere in between the score of the positive and the negative control, if they are to be deemed positive protein candidates. It should be noted, however, that this score alone is insufficient to determine whether a specific protein is responsible for the color of a particular bacterium or not. It can, however, help rule out many sequences and guide bioinformaticians in the right direction. It's equally important to note that the identification of a candidate protein as a chromoprotein in the analyzed genome does not necessarily imply that it can be transcribed and/or translated in the organism of interest.
In order to assess our pipeline, two different approaches were taken. In the first one, bootstrap 95% confidence intervals were created for the median of both the negative and positive controls, using the Bias-Corrected and Accelerated (BCa) method. The median was selected as an estimator due to the assumed extreme bias of our underlying sample distributions. For this approach, the positive controls had a normalized score of [5.1230, 5.1972] for 1e6 resamples, while the negative ones led to a score of [0.1888, 0.3784]. In light of this, we can assume that our pipeline is in general capable of discerning bacterial chromoproteins. The second method we used is a bootstrapped hypothesis test, which resulted in a p-value < 1e6. It should be noted that the median was also used for this.
As outlined above, the ChromoSearch pipeline enables the rapid identification of potential chromoproteins in bacterial genomes. This functionality was employed to validate the lab findings related to environmental strains. For P91, S. faecium, our pipeline identified two proteins with homology to Deoxyribodipyrimidine photolyase from A. nidulans (Uniprot ID: PHR_SYNP6) and Cryptochrome DASH from S. lycopersicum (Uniprot ID: CRYD_SOLLC). These matches had normalized scores of 1.52 (p = 0.062) and 1.44 (p = 0.062), respectively. Both proteins bind the cofactor Flavin Adenine Dinucleotide (FAD), known for its yellow color. Notably, both candidates have a mass of approximately 50 kDa, consistent with the results from semi-native SDS-gel experiments. This provides substantial evidence that these proteins may be responsible for the observed color in the strains.
A similar result was observed for strain P350, M. glutamicibacter. ChromoSearch identified a single hit with a normalized score of 1.96 (p = 0.001), again linked to Deoxyribodipyrimidine photolyase from A. tumefaciens. This protein also binds the FAD cofactor, making it a strong candidate for the vibrant yellow color observed in the strain.
One of the most interesting pigment producing enzymes identified through the use of our pipeline is that of a putative protein with a normalized score close to 1.84 and a corrected p-value of roughly 0.002. This putative protein identified from the genome of P350 scored well against 4,4'-diaponeurosporen-aldehyde dehydrogenase, a known pigment-producing enzyme known to participate in the biosynthesis of the yellow-orange carotenoid staphyloxanthin. Due to the yellow-orange color of bacteria, this protein candidate was at first considered to be likely to be responsible for the color. However, results from experimental work in our lab shows that the color most likely originates from a protein and not a pigment. However, this is still one of the strongest contenders for the colors identified in this particular strain.
Another interesting putative protein candidate identified in the genome of P298 also scored well against 4,4'-diaponeurosporen-aldehyde dehydrogenase. This is interesting because the color of the P298 strain is similar to that of the previously mentioned P350 strain, while sharing a high score for the same pigment producing enzyme. This might mean that the two share the same coloring mechanism, owing their color to a staphyloxanthin-like pigment.1. For chromoproteins, sequences were retrieved using (keyword:KW-0157) AND (reviewed:true) AND (taxonomy_id:2). Data accessed from UniProt, July 2024. As for the pigment-producing enzymes, proteins with the term GO:0046148 belonging to bacteria were selected from the Gene Ontology database at July of 2024. Available protein sequences were downloaded based on ID from Swiss-Prot.
2. However, the pipeline is far from perfect. BlastP only compare sequence identity. Consequently, proteins that exhibit biological similarity yet have low sequence identity may go unnoticed by our pipeline and get ruled out early. This is true for the positive control sequences assessed in our positive control as well as for the putative proteins characterized through our pipeline.