Identification of Chromoproteins & Pigments: A pipeline

There are several different methods of identifying a potential chromoprotein, which often come with a number of advantages and disadvantages. One such method is protein sequencing, which often involves purification steps as well as sensitive mass-spectrometric methods. While these are effective methods for protein identification, they can often be cumbersome and specific methods will depend on the properties of the protein of interest. Apart from chromoproteins, pigments may also be difficult to identify. This is because the property of smaller molecules differ, and as a result, their separation methods must also vary accordingly.

An alternative approach that can be equally effective is analyzing genomic data to predict potential protein sequences. This method involves sequencing target organisms known to possess chromoproteins. A prediction software is then used to predict candidate genes in the sequenced genome. Thereafter, the candidate genes are translated into protein candidates. These are then compared to large databases with sets of known chromoproteins or known pigment producing enzymes¹. This, in turn, may yield high-scoring hits that can be used to identify related homologs. With the significant drop in genomic sequencing costs, our method of protein sequence prediction offers a more affordable and faster approach for the identification and characterization of various proteins.

Check it out!

Our ChromoSearch pipeline is open-source and freely available for anyone to use at:

How does it work?

Obtained sequenced genomic data (saved as .fasta) acts as the input for the pipeline. Using prodigal, the pipeline then identifies potential protein coding sequences along with putative proteins for these sequences. Then, depending on the database of choice, these putative proteins are either compared with databases of known chromoproteins, pigment-producing enzymes or other proteins using a pBLAST-search. Putative protein sequences with sufficiently low E-values are then re-evaluated against the database, utilizing the Smith-Waterman local alignment tool for comparison. To then further assess the plasibility of our local alignment, a normalized score is provided based on the mass and length of the putative protein, equal to the normalized score divided by the number of amino acids of the protein. These scores can then be compared to those of the positive and negative results to assess the plausibility of the identified putative proteins. Furthermore, a Gumbel distribution is fitted to the normalized scores, since it effectively models the distribution of extreme values, capturing the tail behavior of the data. This is often critical in assessing the likelihood of rare or outlier events, such as a positive hit of a candidate protein.

Graph of the pipeline — **Figure 1:** This flow chart illustrates the sequential steps of the pipeline, which is divided into six
distinct parts, each corresponding to a script called upon by the main script (chromosearch.py).

Step by step explanation

Identification of putative genes and conversion to putative proteins (.fasta file):

Purpose: The pipeline begins by taking a .fasta file as input, which contains the complete genome sequence of the environmental strain. By then using a software tool called Prodigal, Hidden Markov Models (HMM) are used to identify possible open reading frames and coding sequences in the genome. Once the coding sequences are identified, they are translated from their DNA sequence into protein sequences. This is done for two main reasons. First, due to the redundant nature of DNA, working with protein sequences simplifies the process by reducing the number of comparisons required. Second, proteins are the ultimate products of gene expression, making them the central focus of our investigation into their role in coloration.

pBLAST Search:

Purpose: After translating the DNA into protein sequences, the putative protein sequences are compared against a local database of known chromoproteins or pigment-producing enzymes. This is done using a tool called pBLAST (Protein Basic Local Alignment Search Tool). The pBLAST search helps identify which proteins in our environmental strain are similar to known proteins involved in coloration, by sequence similarity. The reason why pBLAST is used in conjunction with the Smith-Waterman algorithm is to reduce the running time. pBLAST looks for exact matches of short words, it's less computantionally demanding and takes considerably less time. The algorithm is thought to filter out a greater number of poorly scoring hits, allowing for a more focused evaluation of higher-scoring proteins using the Smith-Waterman method later on.

Result Filtering and Sorting:

Purpose: The results from the pBLAST search are then filtered and sorted based on their significance. This is typically done using a threshold known as the E-value, which indicates the likelihood that a match between sequences occurred by chance. By setting an E-value threshold of 0.05, we ensure that only mostly meaningful matches are considered, helping us focus on the most promising putative proteins.

Smith-Waterman Alignment:

Purpose: For the protein sequences that showed significant similarity in the pBLAST search, a more detailed comparison is performed using the Smith-Waterman algorithm. This method provides a local alignment between sequences in the database and the putative proteins, offering deeper insights into how similar the sequences really are. This step is particularly useful for validating the potential of the identified proteins as chromoprotein candidates.

Calculation of molecular weight, length and normalized score:

Purpose: Once the scores have been calculated using the Smith-Waterman algoritm, the theoretical molecular weights and lengths of our candidate proteins are calculated based on the theoretical primary protein structure. These biochemical properties can be leveraged alongside experimental data obtained in the lab, to further assess and validate the potential of the chromoproteins. In our lab, a native gel eletrophoresis was run in conjunction with this pipeline, from which the information gathered could be used to rule out or strengthen our belief in certain putative proteins.

Calculating statistical outliers using a Gumbel Distribution

Purpose: In order to gain a deeper statistical perspective into our results a Gumbel distribution is fitted to our result. This distribution is commonly used to find statistical outliers and assign values of surprise to different samples. Here, an E-value is assigned.

Final Comparison and Analysis:

Purpose: The final step involves comparing the results from both the pBLAST and Smith-Waterman alignments, along with the normalized score and mass. The calculated E-value may also be assessed. By doing so, we can identify the most likely candidates for the color-giving proteins in the environmental strain. These results can then be further validated or used as a basis for experimental follow-up, as was done in our group.

Results: Making sense of our data

In order to make sense of the results, positive and negative controls were implemented for reference along with the default settings for each step. To statistically validate the capability of our pipeline to seperate possible chromoproteins from the rest of the bacterial genome, we were first tasked with finding representative positive controls. However, the proteins found in our database represent an exhaustive set of all quality annotated chromoproteins and pigment biosynthesis enzymes. Hence, this search would prove difficult. For this reason, we decided to do a variation of leave-one-out crossvalidation, with each entry in our database being searched against itself. For negative controls we used a balanced random selection of the Swiss-Prot database, limited to Eubacteria (taxonomy 2), and without any chromophore binding proteins.

Any protein should therefore be expected to have a score somewhere in between the score of the positive and the negative control, if they are to be deemed positive protein candidates. It should be noted, however, that this score alone is insufficient to determine whether a specific protein is responsible for the color of a particular bacterium or not. It can, however, help rule out many sequences and guide bioinformaticians in the right direction. It's equally important to note that the identification of a candidate protein as a chromoprotein in the analyzed genome does not necessarily imply that it can be transcribed and/or translated in the organism of interest.

In order to assess our pipeline, two different approaches were taken. In the first one, bootstrap 95% confidence intervals were created for the median of both the negative and positive controls, using the Bias-Corrected and Accelerated (BCa) method. The median was selected as an estimator due to the assumed extreme bias of our underlying sample distributions. For this approach, the positive controls had a normalized score of [5.1230, 5.1972] for 1e6 resamples, while the negative ones led to a score of [0.1888, 0.3784]. In light of this, we can assume that our pipeline is in general capable of discerning bacterial chromoproteins. The second method we used is a bootstrapped hypothesis test, which resulted in a p-value < 1e6. It should be noted that the median was also used for this.

Two candidate chromoproteins identified through ChromoSearch

As outlined above, the ChromoSearch pipeline enables the rapid identification of potential chromoproteins in bacterial genomes. This functionality was employed to validate the lab findings related to environmental strains. For P91, S. faecium, our pipeline identified two proteins with homology to Deoxyribodipyrimidine photolyase from A. nidulans (Uniprot ID: PHR_SYNP6) and Cryptochrome DASH from S. lycopersicum (Uniprot ID: CRYD_SOLLC). These matches had normalized scores of 1.52 (p = 0.062) and 1.44 (p = 0.062), respectively. Both proteins bind the cofactor Flavin Adenine Dinucleotide (FAD), known for its yellow color. Notably, both candidates have a mass of approximately 50 kDa, consistent with the results from semi-native SDS-gel experiments. This provides substantial evidence that these proteins may be responsible for the observed color in the strains.

A similar result was observed for strain P350, M. glutamicibacter. ChromoSearch identified a single hit with a normalized score of 1.96 (p = 0.001), again linked to Deoxyribodipyrimidine photolyase from A. tumefaciens. This protein also binds the FAD cofactor, making it a strong candidate for the vibrant yellow color observed in the strain.

Two candidate pigment-producing enzymes identified through ChromoSearch

One of the most interesting pigment producing enzymes identified through the use of our pipeline is that of a putative protein with a normalized score close to 1.84 and a corrected p-value of roughly 0.002. This putative protein identified from the genome of P350 scored well against 4,4'-diaponeurosporen-aldehyde dehydrogenase, a known pigment-producing enzyme known to participate in the biosynthesis of the yellow-orange carotenoid staphyloxanthin. Due to the yellow-orange color of bacteria, this protein candidate was at first considered to be likely to be responsible for the color. However, results from experimental work in our lab shows that the color most likely originates from a protein and not a pigment. However, this is still one of the strongest contenders for the colors identified in this particular strain.

Another interesting putative protein candidate identified in the genome of P298 also scored well against 4,4'-diaponeurosporen-aldehyde dehydrogenase. This is interesting because the color of the P298 strain is similar to that of the previously mentioned P350 strain, while sharing a high score for the same pigment producing enzyme. This might mean that the two share the same coloring mechanism, owing their color to a staphyloxanthin-like pigment.

Footnotes

1. For chromoproteins, sequences were retrieved using (keyword:KW-0157) AND (reviewed:true) AND (taxonomy_id:2). Data accessed from UniProt, July 2024. As for the pigment-producing enzymes, proteins with the term GO:0046148 belonging to bacteria were selected from the Gene Ontology database at July of 2024. Available protein sequences were downloaded based on ID from Swiss-Prot.

2. However, the pipeline is far from perfect. BlastP only compare sequence identity. Consequently, proteins that exhibit biological similarity yet have low sequence identity may go unnoticed by our pipeline and get ruled out early. This is true for the positive control sequences assessed in our positive control as well as for the putative proteins characterized through our pipeline.