Model

The heavy chain fibroin protein from caddisfly silk is a long highly repetitive protein, making it difficult for current protein modeling algorithms like AlphaFold. Faced with this challenge, our team adopted a novel strategy of partial modeling, using advanced tools such as AlphaFold2, AlphaFold3, and RFDiffusion. These neural network techniques enabled us to model specific segments of the protein. We compared these different techniques using PyMol to visualize the regions in which the protein structure varied from model to model.

One of the primary challenges in our work was combining these partial models into a cohesive representation of the entire protein. We considered addressing this challenge by modeling combinations of the protein’s repeated sections and comparing them to similar proteins from other caddisfly species. Additionally, we had the idea of employing homology modeling to enhance lower-quality segments by utilizing models from closely related species. If a partial model had secondary structures that looked promising, we could integrate them into molecular modeling software like Gromacs to adjust the parts in relation to one another. Our partial modeling approach allowed us to identify beta sheets in the repeated sections and uncover a double alpha helix at the C-terminus, structures crucial for connecting with the light chain. By advancing beyond the standard crystallography-solved structures typically used in the field, our modeling efforts provided new insights that set a precedent for studying large, complex proteins.

To develop the model, we leveraged protein sequences from various caddisfly species and utilized measurements from AlphaFold and RFDiffusion to assess the accuracy of different sections of the protein. These measurements were essential for evaluating the model’s reliability, though their accuracy varied between tools. AlphaFold2, for instance, often yielded less reliable measurements compared to AlphaFold3, and RFDiffusion showed even more inconsistencies in some cases. This disparity likely arose because these tools were trained to predict complete protein structures in their natural biological contexts, while our approach involved modeling isolated parts. The sequences were segmented into functional domains based on primary structure research, but full validation of the model remains a challenge due to the complexities of expressing the protein.

Beyond the structural modeling, the project also considered several other promising directions. One idea was to implement an annotator specifically designed to predict the repeats within biological sequences, even when those repeats contained a high degree of variation and fragmentation. This tool would help in identifying and characterizing the complex repetitive regions more efficiently. Another key objective was to find the kinase gene responsible for the phosphorylation of the heavy-chain fibroin protein, a modification critical for its functional properties. We used AUGUSTUS, a gene predictor to identify gene regions in the atopsyche davidsoni species. We then identified gene regions that contained anthropod kinase regions in them.

Our approach to modeling serves as a valuable example for others seeking to study large, complex proteins. We documented our process thoroughly in a Notion database, which included detailed information about each sequence, the corresponding species, domain names, visual outputs, and the certainty metrics for each result. This careful documentation allowed for comprehensive analysis and can serve as a framework that other teams can replicate for similar modeling efforts. We found Notion particularly useful for its intuitive design, collaborative features, and ability to manage various data types in an organized format. Additionally, we used PyMOL to track visualizations systematically, ensuring that our modeling process was consistently recorded. A key strength of our approach is its accessibility: by using open-source software, we created a zero-cost, easily replicable solution for other teams. Although computational modeling cannot match the reliability of experimental 3D structure identification, it offers a viable and efficient method for studying the 3D structures of large proteins that are otherwise difficult to analyze.

How to Interpret the Models

The species name is listed at the top

The specific motif or region of the sequence is labelled, allowing different models to be compared across species

The first set is based on AlphaFold2, while the second is based on AlphaFold3, which came out halfway through our modelling journey.

The different tags correspond to the confidence levels - this is how confident the model is in predicting that the protein will be folded in this specific configuration.

Generally, we are looking for secondary structures, including alpha helices and beta sheets. Some foldings lack this, but this does not mean the real protein has no structure, only that we are unable to predict the structure using modelling for now.

Note that if you try modelling the same repeats, there will be some slight variation in your results.

How to Interpret the Models

Overview

Inspirations