ChromoSearch Pipeline

A pipeline for identifying chromoproteins & pigment synthesis enzymes in organisms from genomic sequences

Identification of Chromoproteins & Pigments: A pipeline


There exists severals different methods of identifying a potential chromoprotein, which often come with a number of advantages and disadvantages. One such method is protein sequencing, which often involve purification steps as well as sensitive mass-spectrometric methods. While these are effective methods for protein identification, they can often be cumbersome and specific methods will depend on the properties of the protein of interest. Pigments may also be difficult to identify. This is because the property of smaller molecules differ, and as a result, their separation methods must also vary accordingly.

An alternative approach that can be equally effective is analyzing genomic data to predict potential protein sequences. This method involves sequencing target organisms known to possess chromoproteins. A prediction software is then used to predict candidate genes in the sequenced genome. Thereafter, the candidate genes are translated into protein candidates. These are then compared to large databases with sets of known chromoproteins or known pigment producing enzymes1. This, in turn, may yield high-scoring hits that can be used to identify related homologs. With the significant drop in genomic sequencing costs, our method of protein sequence prediction offers a more affordable and faster approach for the identification and characterization of various proteins.

However, in order to convert the NGS data into protein candidates many different programs need to be run sequentially, in what is known as a pipeline.

How does it work?


Obtained sequenced genomic data (saved as .fasta) acts as the input for the pipeline. Using prodigal, the pipeline then identifies potetial protein coding sequences along with putative proteins for these sequences. Then, depending on the database of choice, these putative proteins are either compared with databases of known chromoproteins, pigment-producing enzymes or other proteins using a pBLAST-search. Putative protein sequences with sufficiently high E-values are then re-evaluated against the database, utilizing the Smith-Waterman local alignment tool for comparison. To then further assess the plasibility of our to our local alignment, a normalized score is provided based on the mass and length of the putative protein, equal to the normalized score divided by the number of amino acids of the protein. These scores can then be compared to those of the positive and negative results to assess the plausibility of the identified putative proteins. Furthermore, a Gumbel distribution is fitted to the normalized scores, since it effectively models the distribution of extreme values, capturing the tail behavior of the data. This is often critical in assessing the likelihood of rare or outlier events, such as a positive hit of a candidate protein.

Graph of the pipeline
The figure shows the steps of the pipeline used.

Step by step explanation

  1. Identification of puttive genes and conversion to putative proteins (.fasta file):
    • Purpose: The pipeline begins by taking a .fasta file as input, which contains the complete genome sequence of the environmental strain. By then using a software tool called Prodigal, Hidden Markov Models (HMM) are used to identify possible open reading frames and coding sequences in the genome. Once the coding sequences are identified, they are translated from their DNA sequence into protein sequences. This is done for two main reasons. First, due to the redundant nature of DNA, working with protein sequences simplifies the process by reducing the number of comparisons required. Second, proteins are the ultimate products of gene expression, making them the central focus of our investigation into their role in coloration.

  2. pBLAST Search:
    • Purpose: After translating the DNA into protein sequences, the putative protein sequences are compared against a local database of known chromoproteins or pigment-producing enzymes. This is done using a tool called pBLAST (Protein Basic Local Alignment Search Tool). The pBLAST search helps identify which proteins in our environmental strain are similar to known proteins involved in coloration, by sequence similarity. The reason why pBLAST is used in conjunction with the Smith-Waterman algorithm is to reduce the running time. pBLAST looks for exact matches of short words, its less computantionally demanding and takes considerably less time. The algorithm is thought to filter out a greater number of poorly scoring hits, allowing for a more focused evaluation of higher-scoring proteins using the Smith-Waterman method later on.

  3. Result Filtering and Sorting:
    • Purpose: The results from the pBLAST search are then filtered and sorted based on their significance. This is typically done using a threshold known as the e-value, which indicates the likelihood that a match between sequences occurred by chance. By setting an E-value threshold of 0.05, we ensure that only mostly meaningful matches are considered, helping us focus on the most promising putative proteins.

  4. Smith-Waterman Alignment:
    • Purpose: For the protein sequences that showed significant similarity in the pBLAST search, a more detailed comparison is performed using the Smith-Waterman algorithm. This method provides a local alignment between sequences in the database and the putative proteins, offering deeper insights into how similar the sequences really are. This step is particularly useful for validating the potential of the identified proteins as chromoprotein candidates.

  5. Calculation of molecular weight, length and normalized score:
    • Purpose: Once the scores have been calculated using the Smith-Waterman algoritm, the theoretical molecular weights and lengths of our candidate proteins are calculated based on the theoretical primary protein structure. These biochemical properties can be leveraged alongside experimental data obtained in the lab, to further assess and validate the potential of the chromoproteins. In our lab, a native gel eletrophoresis was run in conjunction with this pipeline, from which the information gathered could be used to rule out or strengthen our belief in certain putative proteins.

  6. Calculating statistical outliers using a Gumbel Distribution
    • Purpose: In order to gain a deeper statistical perspective into our results a Gumbel distribution is fitted to our result. This distribution is commonly used to find statistical outliers and assign values of surprise to different samples. Here, an E-value is assigned.

  7. Final Comparison and Analysis:
    • Purpose: The final step involves comparing the results from both the pBLAST and Smith-Waterman alignments, along with the normalized score and mass. The calculated E-value may also be assessed. By doing so, we can identify the most likely candidates for the color-giving proteins in the environmental strain. These results can then be further validated or used as a basis for experimental follow-up, as was done in our group.

Results: Making sense of our data


In order to make sense of the results, positive and negative controls were implemented for reference along with the default settings for each step. The positive control was established by removing a protein sequence from the already available database, which should ideally2 yield a high score if searched against the database. The negative control meanwhile was instead implemented by the use of proteins known to be both sequentially and functionally different. In this case, hemoglobin was used. A fasta-file consisting of the positive control along with many negative controls were thereafter run against the database to make sense of the scores obtained from our actual samples. As may be discerned from the graph, the normalized scores of the positive controls ranged from 1.1-5. Any protein should therefore be expected to have a score somewhere in between the score of the positive and the negative control, if they are to be deemed a positive protein candidates. It should be noted, however, that this score alone is insufficient to determine whether a specific protein is responsible for the color of a particular bacterium or not. It can, however, help rule out many sequences and guide bioinformaticians in the right direction. It's equally important to note that the identification of a candidate protein as a chromoprotein in the analyzed genome does not necessarily imply that it can be transcribed and/or translated in the organism of interest.

After positive and negative control had been passed through the pipeline, the assembled genomic data was processed through our pipeline. Each genomic sample passed through the pipeline in two iterations, one for the chromoprotein database and another for the pigment-producing enzyme database. We utilized the default settings as specified in the code available at our GitLab repository.

Two candidate chromoproteins identified through ChromoSearch


Two candidate chromoproteins identified through ChromoSearch As outlined above, the ChromoSearch pipeline enables the rapid identification of potential chromoproteins in bacterial genomes. This functionality was employed to validate the lab findings related to environmental strains. For P91, S. faecium, our pipeline identified two proteins with homology to Deoxyribodipyrimidine photo-lyase from A. nidulans (Uniprot ID: PHR_SYNP6) and Cryptochrome DASH from S. lycopersicum (Uniprot ID: CRYD_SOLLC). These matches had normalized scores of 1.52 (p = 0.062) and 1.44 (p = 0.062), respectively. Both proteins bind the cofactor Flavin Adenine Dinucleotide (FAD), known for its yellow color. Notably, both candidates have a mass of approximately 50 kDa, consistent with the results from semi-native SDS-gel experiments. This provides substantial evidence that these proteins may be responsible for the observed color in the strains.

A similar result was observed for strain P350, M. glutamicibacter. ChromoSearch identified a single hit with a normalized score of 1.96 (p = 0.001), again linked to Deoxyribodipyrimidine photo-lyase from A. tumefaciens. This protein also binds the FAD cofactor, making it a strong candidate for the vibrant yellow color observed in the strain.

Two candidate pigment-producing enzymes identified through ChromoSearch


One of the most interesting pigments producing enzymes identified through the use of our pipeline is that of a putative protein with a normalized score close to 1.84 and a corrected p-value of roughly 0.002. This putative protein identified from the genome of P350 scored well against 4,4'-diaponeurosporen-aldehyde dehydrogenase, a known pigment-producing enzyme known to participate in the biosynthesis of of the yellow-orange carotenoid staphyloxanthin. Due to the yellow-orange color of bacteria, this protein candidate was at first considered to be likely to be responsible for the color. However, results from experimental work in our lab shows that the color most likely originates from a protein and not a pigment. However, this is still one of the strongest contenders for the colors identified in this particular strain.

Another interesting putative protein candidate idenfied in the genome of P298 also scored well against 4,4'-diaponeurosporen-aldehyde dehydrogenase. This is interesting because the color of the P298 strain is similar to that of the previously mentioned P350 strain, while sharing a high score for the same pigment producing enzyme. This might mean that the two share the same coloring mechanism, owing their color to a staphyloxanthin-like pigment.

Footnotes


1. For chromoproteins, sequences were retrieved using (keyword:KW-0157) AND (reviewed:true) AND (taxonomy_id:2). For pigment-producing enzymes, the search used (go:0046148) AND (taxonomy_id:2). Data accessed from UniProt, July 2024.

2. However, the pipeline is far from perfect. BlastP only compare sequence identity. Consequently, proteins that exhibit biological similarity yet have low sequence identity may go unnoticed by our pipeline and get ruled out early. This is true for the positive control sequences assessed in our positive control as well as for the putative proteins characterized through our pipeline.