AlphaFold Decoded: iGEM Education Initiative

48

Minutes of Content

3

Chapters

1,934

Views on YouTube

Contents

Introduction

Why AlphaFold Decoded Matters

Challenges to Overcome

Implementation Code

Guided Learning with Jupyter Notebooks

Rolling Out AlphaFold Decoded

Course Content Overview

Reception and Impact

Outlook

References

AlphaFold Decoded: iGEM Education Initiative

Introduction

In recent years, AI tools have reshaped the landscape of biology, offering insights once thought beyond our grasp. Among these tools, AlphaFold has been a game-changer in protein structure prediction. However, while biologists are quickly adapting to use such tools, there's a gap in learning how to build them. This is where our education project, AlphaFold Decoded, steps in. We aim to make this AI marvel more accessible by guiding learners through the process of reconstructing AlphaFold from scratch.

Why AlphaFold Decoded Matters

Transforming
Biology

AlphaFold changed the game, but we’re not done yet. Imagine pushing molecular prediction further - where designing enzymes or predicting mutations could become nearly automatic. By mastering AlphaFold, you're stepping into a future where computational power drives breakthroughs in biology.

Building on
AlphaFold

AlphaFold is massive, complex, and expensive, making it challenging to replicate in typical research settings. However, mastering its principles opens up exciting possibilities. For example, you could connect AlphaFold's output to language models to generate detailed descriptions of protein structures or predict their functions.

Machine Learning
in Action

Even if you're not looking to advance AlphaFold itself, the skills you gain are invaluable. You’ll learn how to tackle complex, real-world problems using machine learning, understand and manipulate 3D structures, and gain practical coding experience that translates theoretical concepts into actionable solutions.

Challenges to Overcome

Mastering AlphaFold comes with genuine, often overlooked challenges – hurdles that likely explain why projects of this scale haven’t been attempted more frequently. Let’s explore the two most significant difficulties that stand in the way.

Bridging Missing
Prerequisites

AlphaFold isn't just another machine learning model – it demands knowledge beyond the usual toolkit. A solid grasp of machine learning fundamentals like attention mechanisms is required, but so is understanding more specialized topics like quaternions and homogeneous coordinates in structural biology. These gaps can be daunting, especially for those without a structured learning path.

The Realization
Challenge

AlphaFold’s extensive codebase can make it seem impossible to replicate as an individual, even more so when one realizes that a comprehensive understanding alone doesn’t suffice. The sheer volume of interconnected components can feel overwhelming, making it difficult to know where to start or what’s essential. This raises the question of whether all the details matter if you can’t fully commit to reproducing or applying them afterward.

The first challenge led us to carefully craft a structured content line, which we’ll discuss in later sections as our chosen roadmap. We assume only basic Python knowledge and aim to guide you through the rest. The second challenge is a deeper question: How complex does an AlphaFold implementation really need to be? This brings us to our next section.

Implementation Code

Quality education starts with quality code. Our goal was to provide an implementation of AlphaFold that prioritizes readability and clarity, allowing learners to focus on understanding the underlying concepts. Our version condenses the core functionality into 1,411 logical lines of code, closely following the original paper’s pseudocode, compared to 5,287 lines in AlphaFold’s key files. This direct approach from theory to practice is a hallmark of our implementation.

Following is a code comparison of selected components of AlphaFold, comparing the pseudocode from the paper against our implementation and the original AlphaFold repository. It’s important to note that the code taken from the AlphaFold repository has been cropped as best as possible to match the parts covered in the pseudocode and our implementation. However, AlphaFold’s code isn’t just bloat – it contains more efficient implementations and features like ensembling, template stacks, and detailed loss calculations. These additional features, while valuable for production, contribute to the complexity that can make learning from the code more challenging than necessary.

The AlphaFold code snippets displayed on this page are from DeepMind's AlphaFold repository. We have modified the displayed content by cropping it from the surrounding code to highlight relevant sections. The AlphaFold repository is licensed under the Apache 2.0 License, and we adhere to its terms. The code snippets labeled as AlphaFold Decoded were created by us. Additionally, the pseudocode images shown are reproductions of algorithms presented in the original AlphaFold paper by DeepMind.

Model
Evoformer
Geometry
Input Embedder
Pseudocode
AlphaFold Visualization
AlphaFold Decoded
AlphaFold Original
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 def forward(self, batch): """ Forward pass for the Alphafold model. """ N_cycle = batch['msa_feat'].shape[-1] N_seq, N_res = batch['msa_feat'].shape[-4:-2] batch_shape = batch['msa_feat'].shape[:-4] device = batch['msa_feat'].device dtype = batch['msa_feat'].dtype c_m = self.c_m c_z = self.c_z outputs = {} prev_m = torch.zeros(batch_shape+(N_seq, N_res, c_m), device=device, dtype=dtype) prev_z = torch.zeros(batch_shape+(N_res,N_res,c_z), device=device, dtype=dtype) prev_pseudo_beta_x = torch.zeros((N_res, 3), device=device, dtype=dtype) for i in range(N_cycle): print(f'Starting iteration {i}...') current_batch = { key: value[...,i] for key, value in batch.items() } m, z = self.input_embedder(current_batch) m_rec, z_rec = self.recycling_embedder(prev_m, prev_z, prev_pseudo_beta_x) m[..., 0, :, :] += m_rec z += z_rec e = self.extra_msa_embedder(current_batch) z = self.extra_msa_stack(e, z) del e m, z, s = self.evoformer(m, z) F = current_batch['target_feat'].argmax(dim=-1) structure_output = self.structure_module(s, z, F) prev_m = m prev_z = z prev_pseudo_beta_x = structure_output['pseudo_beta_positions'] for key,value in structure_output.items(): if key in outputs: outputs[key].append(value) else: outputs[key] = [value] outputs = { key: torch.stack(value, dim=-1) for key, value in outputs.items() } return outputs

Comparing Features and Predictions

While our implementation doesn’t include some of AlphaFold’s advanced features – such as ensembling, the use of protein structures as templates, or training capabilities (our version focuses solely on inference and does not implement loss calculation or backpropagation) – it still accomplishes its primary goal of predicting protein structures. This comparison shows the prediction of the hemoglobin beta subunit by AlphaFold (in blue, generated with ColabFold) and our implementation (in red). Despite the differences in features, our model achieves a result that captures the essential structure.

Comparison of Hemoglobin Beta Subunit Predictions by AlphaFold and Our Implementation

Comparison of Hemoglobin Beta Subunit Predictions by AlphaFold and Our Implementation

Guided Learning with Jupyter Notebooks

Our Jupyter Notebooks are at the heart of AlphaFold Decoded’s educational experience, offering a structured, hands-on approach to learning. We’ve divided our implementation into nine separate folders, each containing a Python file with skeleton code for the core methods and a corresponding Jupyter Notebook that guides students through the implementation process. This setup ensures that learners can follow along step-by-step, building their understanding in a structured way.

One of the biggest challenges in understanding or building complex projects is their non-linear structure. In the original AlphaFold repository, you often find yourself jumping from one file to another, tracking down various methods, which can be overwhelming. Our material addresses this issue by providing a clear, linear pathway, allowing students to focus on building their understanding without getting lost in the complexity.

Skeleton Implementations & Concrete Guidance

We provide detailed skeleton implementations of key methods, complete with concrete tips and explanations to guide students. From these foundations, they are tasked with implementing all the logic themselves, ensuring they engage deeply with the material and develop a strong grasp of how AlphaFold operates, rather than simply following pre-written code.

Automatic Error Checking

Our automatic error-checking system verifies each step of the user’s implementation, providing immediate feedback. This feature is crucial because constructing a project as complex as AlphaFold from scratch is incredibly challenging without continuous validation. By pinpointing issues early, students can progress confidently, learning from their mistakes and refining their code incrementally.

Google Colab Integration

Integration with Google Colab was proposed by one of our professors, who highlighted the difficulties students often face in setting up their development environments. Using Colab, students can bypass these challenges and work on AlphaFold Decoded without any installation hurdles. Furthermore, Colab’s cloud-based infrastructure allows students without access to high-performance hardware to still participate in running computationally demanding tasks, like protein structure prediction.

Try It Out Yourself

If you'd like to explore AlphaFold Decoded, you can try it out yourself! Simply download the tutorials folder from our GitLab repository and upload it to your Google Drive. Make sure to add the Colab extension from the Google Workspace Marketplace. When you open a notebook, give Google Colab all necessary permissions when prompted. Once everything is set up, you can double-click the first notebook, tensor_introduction.ipynb, in Google Drive to open it in Colab and work through the guided exercises.

Alternatively, if you prefer to see the full AlphaFold model in action, you can download the solutions folder from our GitLab repository, upload it to Google Drive, and open model/model.ipynb in Colab. Don’t forget to select a GPU under Runtime > Change runtime type to experience running AlphaFold on a precomputed alignment. This provides a great way to engage with AlphaFold's protein structure prediction process firsthand!

Rolling Out AlphaFold Decoded

Our journey to share AlphaFold Decoded started with offline workshops at our university, where we engaged students from biology and computer science. These sessions included lectures and interactive coding workshops, where participants completed individual lessons from our AlphaFold Decoded curriculum. The workshops provided a hands-on learning experience, combining theoretical and practical elements.

Offline Workshop for AlphaFold Decoded Lesson 1 on May 14th, 2024

Offline Workshop for AlphaFold Decoded Lesson 1 on May 14th, 2024

Offline Workshop for AlphaFold Decoded Lesson 1 on May 14th, 2024

Offline Workshop for AlphaFold Decoded Lesson 1 on May 14th, 2024

As we received enthusiastic responses at international iGEM meetups, we recognized the need for a more flexible learning option, which led us to adopt hybrid workshops. However, even in these workshops, many participants expressed a desire for an asynchronous format, as fixed timelines often clashed with their academic schedules or individual pacing. This feedback pushed us to develop a more comprehensive solution, resulting in a nine-part video series on YouTube. This format allows learners worldwide to engage with the material at their own pace, ensuring that no one is left behind regardless of their other commitments.

Our YouTube Series

The transition to a video format was a significant undertaking, requiring us to produce nine detailed videos packed with rich visualizations and animations. This effort ensured that complex concepts were presented in an engaging and accessible manner, making it easier for viewers to grasp everything from the foundational elements to the more advanced aspects of AlphaFold. By leveraging this medium, we’ve been able to reach a much wider audience and offer a more immersive learning experience than traditional workshops alone could provide.

Bridging Fields: The Reach of AlphaFold Decoded

Biology students will find AlphaFold Decoded valuable as it deepens their understanding of AI-driven protein structure prediction, directly enhancing their knowledge in molecular biology. Computer science students can benefit from AlphaFold Decoded by applying machine learning concepts to real-world biological challenges, expanding their expertise in AI applications. For the iGEM community, AlphaFold Decoded provides a powerful tool for integrating protein modeling into synthetic biology projects, potentially boosting their research and competition outcomes. Young professionals can leverage AlphaFold Decoded to gain cutting-edge skills in AI and bioinformatics, positioning themselves for success in rapidly evolving interdisciplinary fields.

Course Content Overview

Our educational series follows a structured pathway, ensuring learners progress from foundational knowledge to complex AlphaFold-specific content. Here’s a visual representation of our nine-part roadmap:

1. Introduction to Tensors

In this introductory lesson on tensors, we learned that tensors are n-dimensional arrays fundamental to machine learning and deep learning models like AlphaFold. PyTorch provides efficient operations on tensors and automatic differentiation for gradient-based learning. We covered how to create and manipulate tensors in PyTorch, perform axis-wise operations, and use advanced methods like broadcasting and torch.einsum for efficient tensor operations. These concepts are essential as we move forward in implementing AlphaFold and understanding protein structure prediction.

Learn more about the lesson

+

2. Introduction to Machine Learning

In this lesson, we introduced the fundamentals of machine learning, focusing on building a two-layer feed-forward neural network for handwritten digit classification using basic tensor operations in PyTorch. We discussed hierarchical feature extraction, non-linearities like ReLU and sigmoid, and how to optimize models using gradient descent. Additionally, we explored how the loss function helps evaluate model performance and how gradients are computed to adjust parameters during training.

Learn more about the lesson

+

3. Introduction to Attention

In this lesson, we explored the concept of attention in machine learning, a crucial mechanism for working with sequential data like text or proteins. We discussed how attention determines which parts of an input sequence are most relevant, using key, query, and value vectors to compute attention scores. These scores help focus on important input elements, making it easier to generate accurate outputs. Additionally, we covered advanced topics such as self-attention, multi-head attention, and the practical applications of attention in models like AlphaFold.

Learn more about the lesson

+

4. Feature Extraction

This lesson introduces feature extraction in AlphaFold, where biological data is converted into machine learning-ready tensors. We focus on three key inputs: the amino acid sequence, multiple sequence alignment (MSA) data, and template structures. We explain how evolutionary information from MSA is used to predict protein structures and walk through the steps to process this data, including deletion handling, clustering, and sequence masking. The extracted features are critical inputs to AlphaFold’s Evoformer module, which will be covered in upcoming lessons.

Learn more about the lesson

+

5. The Evoformer

In this lesson, we explore AlphaFold's Evoformer, which is responsible for most of the model's parameters and processes evolutionary and structural information about proteins. The Evoformer consists of multiple identical blocks, with both an MSA (multiple sequence alignment) stack and a pair stack that exchange information via row- and column-wise attention mechanisms. This lesson breaks down the Evoformer's components, including its specialized triangle attention mechanism that helps encode relationships between residues, and discusses how it integrates these features to help predict protein structure.

Learn more about the lesson

+

6. Feature Embedding

In this lesson, we focus on feature embedding in AlphaFold, which bridges the gap between feature extraction and the Evoformer. Feature embedding involves transforming extracted features, such as MSA and target features, into learned embeddings and adding positional encodings. Additionally, we explore the Recycling Embedder, which updates the pair and MSA representations using predicted outputs from previous iterations, and the Extra MSA Stack, which refines these features similarly to the Evoformer but with optimizations for large datasets.

Learn more about the lesson

+

7. Geometry

This video delves into 3D geometry concepts essential for AlphaFold’s Structure Module. It covers how to convert the model's tensor outputs into atomic positions using quaternions for rotations and torsion angles for side chains. We also explore how transformations (rotations and translations) are applied to construct and globalize the atomic coordinates for amino acids. These concepts are crucial for building the protein’s 3D structure in AlphaFold.

Learn more about the lesson

+

8. Structure Module

This video covers AlphaFold's Structure Module, which translates Evoformer outputs into 3D atom positions. It introduces Invariant Point Attention (IPA), where 3D distance between predicted points, rather than dot-products, determines attention weights. The module iteratively refines the backbone's transforms and torsion angles over several rounds, with each iteration adjusting the backbone's orientation and position, ultimately predicting atom coordinates for the final protein structure.

Learn more about the lesson

+

9. Full Model

This final video of the AlphaFold implementation series brings together all the modules previously built into a complete model. The modules include the Input Embedder, Extra MSA Embedder, Recycling Embedder, Evoformer Stack, Extra MSA Stack, and Structure Module, all working in tandem to predict protein structures. The AlphaFold inference pipeline processes inputs iteratively, using previous outputs to refine predictions. The video concludes with a note on potential troubleshooting and thanks to the viewers for completing the series.

Learn more about the lesson

+

Reception and Impact

So, this is the content – it’s certainly comprehensive and covers a lot of ground. Despite the complexity and depth of the topic, we were genuinely excited to see that our videos reached a much wider audience than expected, gathering an impressive 2,920 views. This level of engagement shows there’s a real interest in diving deep into such challenging material. Even more rewarding were the comments we received from viewers, sharing their thoughts, questions, and appreciation. Here’s a look at some of the feedback that’s helped shape and motivate our journey.

"Wonderful project. Wondering if you're going to Cover AF3?"

@lialexwei509

"Nice, dude! And I don't care what they say about you recording this while being held hostage at gunpoint by DeepMind, with blood saturated with Adderall, I believe you're just really passionate about this stuff, and I'm gonna binge watch all your videos!"

@matveyshishov

"What a time to be alive!"

@ufuoma833

"Thank you so much for this amazing series! I’ve found the in-depth exploration of AlphaFold’s theory and implementation incredibly helpful. I’m curious to know if you have any plans to create a series on AlphaFold 3 in the future? I’m really interested in this field and would love to see more content on the latest developments. Thanks for all your hard work!"

@xiangtianlin8006

"That's a very comprehensive summary for the input feature of AlphaFold2, which helps clarify things a lot. Thanks very much for your work!"

@于睿-r6e

"This is wonderful. Thank you brother."

@rizbaruah752

"Wow this video made my day! Thank you!"

@fjgkjkojk

"Please finish this series - looking forward to the whole thing. Breaking down AlphaFold is going to have huge value. I have one suggestion. While you can, start off the series differently with the high-level on AlphaFold first/videos on that and a more first principled breakdown of what we're building/high-level understanding before going into the code. Right now, the video about tensors without context on much else is less interesting. Especially because things that lack context don't tap into the right motivation circuits to make people interested, whereas if you explain first, then talk about the topics, it will be far more engaging. Good luck with this channel!"

@adammajmudar889

"Can't wait for more lessons."

@cariyaputta

"Wow! I saw this video and instantly subscribed! That is great value content. Thanks for sharing. Greetings from Brazil!"

@douglasespindola5185

"Nice and simple explanation."

@zandacr0ss86

Prof. Edmund Beck, an accomplished scientist with extensive experience in both academia and industry, shared his thoughts on our videos. As a Distinguished Science Fellow at Bayer AG and an honorary Professor at Technical University Dortmund he is actively involved in the open source initiative OpenFold. With expertise ranging from quantum theory to life sciences, Prof. Beck offers a unique perspective on modern science communication. Here’s what he had to say about our work:

Statement from Prof. Dr. Edmund Beck

You can see the new generation of scientists at work here. This is how communication should be today - accessible and engaging, yet scientifically rigorous.

AlphaFold Decoded is more than just a project; it’s a community-driven effort to empower the next generation of scientists with the tools and knowledge to actively contribute to AI-driven breakthroughs in biology.

Outlook

As we wrap up this project, it’s clear that many participants are eager to keep going, with numerous questions already about the future and potential developments like AlphaFold 3. This enthusiasm reinforces our belief that there’s a real need for accessible, in-depth education in computational biology. Our goal isn’t just to share our work but to help establish a format for learning that engages and inspires others in the community. Toward this goal, we made our entire project, including the build logic of tutorial materials and the scripts for the chapters, available on the iGEM Aachen Gitlab. This kind of comprehensive education is still rare in our field, and we hope that by pioneering this approach, we can encourage others to take the lead, build on it, and create their own initiatives. Together, we can shape the future of computational biology education.

References

References

+