. (). (view online)

Modeling | TU-Eindhoven - iGEM 2024

Modeling

Every person is unique, but more importantly,
every cancer is unique

~Lucas

On this page we describe how we integrated state-of-the-art algorithms to create a complete pipeline for predicting neoantigens from DNA and RNA data. This input data is inexpensive and often already available in a clinical setting. It works directly on real data without any manual preprocessing. Our pipeline is user-friendly and can be used by people with little bioinformatics knowledge. It was developed in several cycles as described on the engineering page. Our software is public and can be adapted to new purposes. We will show how we used this pipeline to predict neoantigens for a lung-cancer patient. Finally, we show how to integrate these predictions in our vaccine platform using various modeling tools, informing future lab work.


× 1. Patient Journey Through the Pipeline 2. Molecular Design Tools 3. A Note on Sequencing from Healthy Tissue 4. An Interview with an Oncologist 5. Frequently Asked (Research) Questions A1. Software Availability A2. Data Availability A3. Results A4. References
Hover over key terms

Patient Journey Through the Pipeline

Neoantigens are immunogenic proteins that arise due to mutations in a tumor. Since they are not present in healthy cells, they are a good target for precise immunotherapy . Not every mutated peptide is a neoantigen, only those that bind to a HLA protein, are presented on the cell surface and are recognized by CD8+ T-cells may be immunogenic . Due to the tumor microenvironment, neoantigens are often hidden from the immune system and exposure via a vaccine can help trigger an immune response . In the context of a vaccine, the neonantigen epitopes are discovered and presented on a dendritic cell's surface instead of the tumor cell's surface. This avoids the microenvironment effects and triggers an immune response . Neoantigens that are common to many tumors in the population may be used for off-the-shelf vaccines that are effective for many patients . The vision of PROMISE, however, is to also create a personalized vaccine that targets the unique neoantigens of the patient and thus is maximally effective with minimal side effects.

The classic approach to identify neoantigens is by a cDNA library screening, in which the RNA of a set of known neoantigens is selectively amplified and detected. However, this approach is time-consuming, complicated, expensive, and it cannot find novel neoantigens . More recently, machine learning has been used to predict neoantigens from whole genome sequencing (WGS) data . This has the advantage of being cheap, fast and can detect new neoantigens. Currently, the required tools are difficult to use: they do not support Windows, they require a lot of bioinformatics and domain knowledge, they are difficult to install and often only work on the command line. We aim to change this.

Patient X is a 76-year-old woman of European ancestry. In 2021 she was diagnosed with grade 3 lung adenocarcinoma. A resection procedure was performed and the resulting cells were cultured to learn about their growth behaviour. Whole genome sequencing was performed on the tumor cells as well as RNA sequencing to learn about the mutations that occurred and their expression. Let us take you through her hypothetical patient journey with the PROMISE pipeline.

Figure 1: Patient X.
My real identity is hidden
Figure 2: DNA and RNA reads from tumor cells.

Sequencing

Click the pipe for more

DNA and RNA is extracted from cultured cells. The molecules are broken down into small fragments by restriction enzymes. These fragments are then sequenced by one of several sequencing technologies, resulting in a large amount of reads. In the next step, we use computational tools to put these reads back together into a full genome and transcriptome.

The source of the data we used is described below.

The samples from our patient were sequenced with Illumina technology . The sequencing was limited to only the exome sequences (WES). The output of this process is two FASTQ files, because the sequencing is performed in two directions (paired-end) . Even though the exome represents a small part (around 1%) of the genome, these files are around 70GB in size because many overlapping reads are present.

RNA sequencing is very similar, but involves an additional step where the RNA is reverse transcribed into cDNA using an RNA-dependent DNA polymerase taken from viruses. After this, the cDNA is sequenced in the same way as the DNA. This results in transcriptome data. This is useful because it shows which genes are expressed, what the peptide sequences are since no introns are present, and the level of expression .

Figure 3: The process of alignment.
Figure 3: Variants are identified.

Preprocessing

We align the reads to a reference genome. This allows us to find the genetic differences between the patient and the reference genome. These differences are called genetic variants and are often documented in the literature. We use these literature annotations to understand the effect of the variants on the proteins. The mutated proteins will be used to generate our candidate neoantigens in a later step of the pipeline.

The DNA reads are aligned using the BWA aligner to the Ensembl GRCh38 reference genome . This tool makes use of an index to quickly find specific regions in the genome. Alignment results in a BAM file. The next step is to determine the variants using Mutect2 . This algorithm handles errors from sequencing by taking a consensus value of reads that overlap at the same position. The output of Mutect2 is a VCF file. This file also includes the DNA coverage, which is a measure of how well a region is sequenced. We then use VEP to add literature annotations of mutation effects on proteins to the variants . We limit our analysis to variants on chromosomes 6 and 17 in order to speed up our analysis and because these contain enough neoantigens. Because of the random distribution of neoantigen-inducing mutations the selection of chromosomes is arbitrary. The analysis can be trivially extended to more chromosomes.

Ideally the variants are determined in both the healthy and tumor DNA to identify tumor-specific mutations, but for patient X only a tumor sample was sequenced. We go into the impact of this in the section A Note on Sequencing from Healthy Tissue.

The RNA reads were aligned to the Ensembl build 112 transcriptome reference using HISAT2 . A lot of RNA reads overlap and the number of overlapping reads is called the depth. Based on this depth, stringtie estimates the expression level of the transcripts and genes .

We reduced 70 GB of raw data to less than 1 GB of processed data by making use of the genetic similarity between humans. Because humans are 99.9% genetically similar, we only need to store the differences. We have also reduced the overlapping reads to a consensus value, which is a more accurate representation of the actual sequences.

Figure 4: Each person has two alleles for each HLA gene.

HLA calling

The HLA type of the patient needs to be determined. HLA genes are highly variable and different HLA proteins bind differently to peptides, so the HLA type is important for predicting HLA-peptide affinity . The HLA type can be determined with standard histocompatibility tests. We determine the alleles computationally using the already available genetic data beacause this is faster, cheaper, and has near perfect accuracy . We will later use the HLA type we identified in the epitope prediction step.

The HLA class I proteins' function is to present small antigens (8-11 aa) originating inside the cell to the immune system. The most important genes are HLA-A, HLA-B and HLA-C . Over forty thousand alleles of these genes have been described . We use xHLA to determine the HLA type based on the genetic data .

The HLA class II proteins' function is to present slightly larger antigens (12-20 aa) originating from outside the cell to the immune system . HLA-II binding is more complex and more difficult to predict.

The identified alleles are shown in .

  • HLA-A∗2.1/∗23.1
  • HLA-B∗27.5/∗42.1
  • HLA-C∗1.2/∗17.1
  • DPB1∗1.1/∗2.1
  • DQB1∗3.1/∗3.1
  • DRB1∗1.3/∗11.2
The HLA alleles of patient X.

For the class I proteins this agreed with the histocompatibility test results which were available at the data source. The class II proteins were not validated independently.

Note that we limit our analysis to epitopes of an average length, 10 and 15 for HLA class I and II respectively. This limitation was because of time constraints. We still identified a sufficiently large number of epitopes. The analysis can be easily extended to more lengths.

Figure 5: Immunogenicity prediction involves predicting interactions in different steps of the immune system activation. The numbers in the images above correspond to the steps on the right.

Epitope predictions

We use different machine learning algorithms to predict the epitopes binding with HLA, presentation on the cell surface, binding with the TCR and immunogenicity. These algorithms were trained on data from sources like IEDB. We make use of pVACtools, which utilizes these algorithms to predict metrics and offers visualizations . This results in a filtered and ranked list of neoantigen epitopes from which we select our targets in the next step.

From tumor RNA data we generate a list of mutated peptides. By taking a subsequence of a peptide a candidate epitope is generated. Different algorithms are used to predict the different immune system interactions: the peptide needs to bind to the HLA class I or HLA class II protein, the peptide-HLA complex (pHLA) needs to be presented on the cell surface, the TCR protein needs to bind to the pHLA complex, and the T-cell needs to be activated. These steps are visualized on the left in .

HLA-I
  1. Peptide-HLA binding prediction: MHCflurry is used to predict the binding affinity of the epitopes to various HLA proteins. This is the most important and well understood immune system interaction. MHCflurry works with two deep neural networks that each predict different features of the interaction. It also makes use of convolutional layers to better represent the spatial structure. The algorithm was trained on known peptide-HLA pairs in the IEDB database with associated binding affinitity as measured with an assay . Some limitations of this data are described in the FAQ section.
  2. Presentation prediction: BigMHC EL, is used to predict the presentation of the pHLA complex on the cell surface. The model consists of an ensemble of seven deep learning models with 612 million parameters in total . It utilizes long-short term memory and attention blocks, in order to represent dependencies between residues across the sequence. Additionally, an anchor block is present, which puts extra emphasis on the four first and last residues of the peptide as these are of particular importance to binding. The model was trained on presented pHLA pairs from the IEDB database that were validated using mass-spectrometry.
  3. TCR and pHLA binding: There is very little data available for this step. Some algorithms for this purpose exist, such as NetTCR . However, this algorithm is not yet integrated in pVACtools. Some aspects of this step are also covered by the next step, but more research should be done to explicitly model this step.
  4. Immunogenicity prediction: BigMHC IM, is a model based on BigMHC EL, but finetuned towards immunogenicity using transfer learning. Using transfer learning addresses the problem of limited data. Data is limited since most peptides on IEDB are not immunogentic .
HLA-II
  1. Peptide-HLA binding prediction: NetMHCIIpan predicts the binding affinity of the epitopes to various HLA class II proteins. It uses an ensemble of 150 different single-layer neural networks to increase the accuracy. In general, there is less data available for HLA class II binding and it is more complex, resulting in lower accuracy. An additional advantage of this model is that a physical explanation for the results is given .
  2. Presentation prediction: NetMHCIIpan EL is the same algorithm as discussed above, since the output layer consists of both binding affinity and presentation scores .

  3. Note that because of the significant difference between the HLA class I and II proteins, different algorithms are used for binding and presentation prediction. In constrast, the same tools are used for presented pHLA complexes because this is mostly dependent on epitope sequence.

Figure 6: Epitopes are parts of neoantigens that are recognized by the immunesystem.

Resulting epitopes

We identified many neoantigen epitopes for patient X. These epitopes are filtered and ranked based on expression, coverage, binding affinity, and other metrics . The top three epitopes are shown in below. We describe how we use these neoantigens to design a vaccine in the section Molecular Design Tools.

These predictions were made entirely in silico, saving a lot of time and money compared to a cDNA screening. No manual preprocessing was needed to run this pipeline and it can easily be reused for other patients by non-experts. In the Appendix we show the full results, data sources and how to adapt the pipeline for other purposes.

In total, 1889 candidates were found of which 190 passed all filters. The top three epitopes are shown in . Besides the epitope peptide sequence in FASTA format, the gene in which the change occurred, and predicted values for several metrics are shown. Epitopes for which metrics fall outside the expected range are filtered out. A ranking is also given, the epitopes with the best combined value for the metrics are ranked the highest .

Metrics
  • IC5O (nM): A measure of the binding affinity between an epitope and the HLA protein. It represents a concentration of a competing substance required to inhibit epitope-HLA binding by 50%. A lower IC50 value indicates a stronger binding affinity. For cancer immunotherapy, an IC50 value below 500 nM is often considered a strong binder. Typically, different thresholds are specified for different HLA alleles to correct for the rarity of some alleles Additionally, you want the IC50 of the mutated epitope to be significantly lower than the IC50 of the wildtype epitope. .
  • Allele expression: RNA expression is expressed as FPKM, which is how frequently reads of a specific transcript are measured. Furthermore, VAF is the fraction of reads that align to the tumor versus the healthy DNA. Taken together, this gives us the expression of the mutated allele. A higher value means that the gene is expressed more and thus makes the epitope more likely to be presented to the immune system. A value of 2.5 is used as a filtering threshold.
  • RNA depth: Depth is simply the number of reads that align to this position. A higher value means that the gene is sequenced better and thus the expression level is more reliable.
  • Elution (%): A measure of how likely it is that the epitope is presented on the cell surface. This is a prediction based on the machine learning model. It is expressed as a percentile score where a lower score is better. For instance, a score of 1% means that the epitope is better presented than 99% of other epitopes.
  • Immunogenicity: A measure of how likely the T-cell is to be activated. It is expressed as a percentile score.

The computational approach enables us to identify many novel neoantigens unique to the patient. This means we cannot meaningfully compare the results to literature. In the FAQ section we go into more detail on how we could validate our results.

The top three epitopes for patient X.
Epitope Gene IC5O mutant (nM) Allele expression RNA depth Elution Immunogenicity
AALPEPNIFL
MRC2 45.194 41.248 26 0.72 0.55
ASIPDIMQQL
MOXD1 43.906 31.186 499 0.03 0.35
HRIQNQLQLF
ROS1 56.637 154.332 278 0.96 0.71

Molecular Design Tools

Our pipeline returns predictions for neoantigen epitopes to use in the PROMISE vaccine. Using pVACvector, we can design a DNA part that encodes the selected epitopes. pVACvector performs a couple of steps: it filters epitopes based on stability, it combines the epitopes using spacers with restriction sites, and it adds auxilary signal sequences . This results in a DNA part that can be used in the vaccine.

Method

To include this part in the delivery system, we need to couple it to our BMVs. This involves fusing the epitope DNA part to a SpyTag as shown in . In parallel, we fuse a membrane protein of the BMV to a SpyCatcher as shown in . After expression of the spytag vector we mix the product with the modified BMVs, which ensures that the fusion protein is displayed on the outside of the BMV as shown in . More information on the BMVs can be found on the project description page.

For the design of the membrane protein coupled to a SpyCatcher, it is important that the membrane proteins retain their functional domains, to enable effective display of the SpyCatcher on the outside of the BMV. Therefore, the membrane proteins that are fused to the antigen are either

  • full membrane proteins
  • truncated membrane proteins: consisting of the signal peptide and one or more transmembrane domains
  • truncated membrane proteins: consisting of the signal peptide only

In each case, it is necessary that the antigen is fused to the periplasmic side of the BMV to be presented on the outside of the membrane and lead to relevant immune cell activation caused by the vaccine.

Tools

We used several modeling tools to design relevant fusion protein constructs. We used this to develop a library of fusion proteins that is described on the parts page.

The following steps are conducted:

  1. Manual inspection: We use literature to characterize membrane proteins of E. coli and M. smegmatis that end up in BMVs and can be used to create fusion protein constructs. After this, we use PDB files that we load into PyMOL to manually inspect the proteins on their 3D structure and their C- and N-terminal ends that can be used for protein fusion.
  2. Transmembrane domain prediction: Since we want our antigen to be displayed on the outer membrane, we select proteins with transmembrane domains that span the membrane and have an access point to the periplasm. We used DeepTMHMM to predict the transmembrane domains of the protein . To check the predictions, we compared them to literature.
  3. Signal peptide prediction: Using SignalP 6.0 we predicted if the membrane protein contains a signal peptide sequence . This is needed for the protein to be transported to the membrane, and can be used to develop truncated proteins that consist of at least a signaling sequence, a transmembrane domain and a periplasmic access site.
  4. Predict fusion protein structure: Using the information above, we designed our fusion protein constructs. These constructs consists of the (truncated) membrane protein, a linker and the SpyCatcher. We used AlphaFold to predict if the fusion protein we designed folded correctly .
  5. Experimental validation: After designing the fusion proteins, experimental validation is essential to confirm their location and functionality. You can see the results of our experiments on the results page.

Finally, the protein sequence can be converted into a DNA sequence using simple conversion tools. This sequence can then be added in a plasmid for recombinant expression using Snapgene. In designing a plasmid, many considerations are taken into account, such as the promotor, antibiotic resistance, immunogenicity, etc. The plasmid can then be ordered from a company that synthesizes DNA, such as our sponsor IDT.

Conclusion

In conclusion, using modeling tools we gained a greater understanding of our parts and system as a whole. This allowed us to quickly design a vaccine in silico. In the lab protocols we use GFP to verify the loading of the BMVs. You can view the parts we designed on the parts page. Our modeling pipeline informs what the payload should be for a given patient. This informs the future steps of our projects.

Figure 7: A plasmid that fuses the epitopes part to spytag.
Figure 8: Another plasmid combines spycatcher with a membrane protein.
The resulting expressed proteins bind strongly when combined. The colors of the proteins correspond with the plasmids above.

A Note on Sequencing from Healthy Tissue

A limitation of the open source data of patient X is that only the DNA from the tumor cells was sequenced. This makes sense since the sequencing was performed for clinical purposes for which the healthy tissue was not of interest. This does mean, however, that we cannot determine with certainty which variants are tumor-specific. pVACtools accepts this data, but it may find false positives. In this section we investigate the importance of having sequencing data from healthy tissue.

Method

We entered into a data access agreement for data of patient Y, a Chinese man diagnosed with lung adenocarcinoma. The advantage is that this research dataset contains DNA and RNA data from tumor and healthy cells of the same patient, i.e., a matched sample. Because of time constraints we only perform a limited analysis on this data.

We run our pipeline on the data of patient Y with and without the healthy DNA data. In order to incorporate the matched data we use the same pipeline as described above with the following changes:

  • The DNA reads for the healthy data are aligned, as well as the tumor DNA reads.
  • We then use Mutect2 to do somatic mutation calling based on both alignments.
  • Note that pVACtools does not make use of RNA data from healthy cells since RNA is used to generate the list of mutated peptides .
  • The HLA type is determined using the healthy DNA data because we are interested in the HLA proteins on the dendritic cell.
  • Finally, pVACtools is called with the mutations and the HLA type.
Figure 9: Patient Y
Results

There are no differences in HLA type between the healthy and tumor DNA. In the FAQ section we go into more detail about what would happen if this were the case. The HLA alleles are shown in .

The top epitopes identified are the same for the matched and tumor-only samples and are shown in .

In total 1899 candidates are found in the matched sample, while only 1706 are discovered in the tumor-only sample. 1636 (86%) of the candidates overlap. Of the candidates, 234 are identified as neoantigens in the matched sample by the filtering criteria. 233 of these are found in the tumor-only sample as well. This is summarized in . This means that one neoantigen is missing in the tumor-only sample, a false negative. Additionally, a neoantigen is found in the tumor-only sample but not in the matched sample, a false positive. This epitope is in the gene UNC93A, has a IC50 of 338 nM, is ranked 231, and has unknown expression and RNA coverage.

We conclude that, for patient Y, incorporating the healthy sample affects the results. In this case it does not affect our vaccine because the false positive candidate was ranked low and is not likely to be selected. It is also important to note that false negatives are not really a problem. The tumor has many neoantigens, and we only need to target a few of them to have strong effect.


  • HLA-A∗2.1/∗11.1
  • HLA-B∗40.2/∗40.1
  • HLA-C∗3.4/∗15.2
  • DPB1∗5.1/∗14.1
  • DQB1∗3.3/∗5.2
  • DRB1∗9.1/∗16.2
HLA types for Patient Y

Diagram showing the overlap between the results of the matched and tumor-only sample.
The top three epitopes for patient Y.
Epitope Gene IC5O mutant (nM) Allele expression RNA depth Elution mutant Immunogenicity mutant
STAMYSAAGR
PSMB1 48.154 272.807 2194 0.02 0.05
FGFSSGPVHL
LAMA4 42.575 69.006 163 0.21 0.25
FLPCYGLVLL
MPDU1 36.057 39.327 193 0.01 0.04
Discussion

More research is needed to determine if tumor-only data is sufficient. When the healthy data is not available, the reference genome is used as a proxy for the healthy DNA in pVACtools. A given person has many differences from the reference genome, this is natural genetic variation and is present in healthy cells. Without a matched sample these variants are considered to be tumor mutations and this can lead to false positives. A false positive neoantigen is an issue because it can lead to autoimmune responses for the vaccine.

In general, this is not a large problem because tumor cells have many more mutations than naturally occuring variants. Also, these variants are not expected to have a high binding affinity prediction by the algorithms, but we saw from the results that we cannot rely on this. An additional safeguard would be to perform proteome similarity search, in which results are compared to the overall population of proteins to filter out candidates that are too similar to normal proteins.

The final consideration is of sample purity. When the tumor sample is contaminated by healthy cells, the DNA and RNA may not give a clear picture. Using a matched sample can help overcome this issue by identifying true mutations . However, in practice, the tumor sample is often sequenced at multiple sites to ensure that the results are reliable.

An Interview with an Oncological Surgeon

A pitfall for researchers and programmers is to get stuck in the technical possibilities and challenges. However, to really help patients with our personalized pipeline, it is important to understand the clinical context. We interviewed Dr. R.G. Orsini, an oncological surgeon at Maastricht University Medical Center, to get a better understanding of the practical challenges in the field. We paraphrased the interview below.

What are the main challenges of a personalised treatment like ours?
When a patient is diagnosed, clinical guidelines suggest that we start treatment within five weeks. We spend the first weeks running tests and scans. Any personalization steps have to be done very fast and your model has to run quickly. Additionally, this process cannot be too expensive. We use different indicators and tests to determine the likelihood that conventional treatments will work. Only for the other cases we consider a personalized treatment. Otherwise, cancer treatments would become too expensive for the healthcare system.

What do you look for in software that supports personalized treatments?
The main output should be the likelihood that a certain treatment will work. This is most important for the patient and thus their doctor as well. Your software looks promising but is more aimed at researchers. We are not as interested in how the software works internally, as long as we can check quality metrics and see what the predictions are based on.

How do you see the future of cancer treatments?
The problem with lung cancer is that we have not found a very effective treatment yet. Immunotherapy has changed the field a lot and often has an advantage over conventional treatments, but they do not work for everyone. While personalized treatments are generally more effective, I do not believe it will be feasible to do for all patients unless it becomes cheaper and faster. I think that a new treatment will come along that will be more effective for lung cancer. Perhaps it will be what you are researching.


From this interview we found a few important future directions for our model. First, we should also do a population-wide analysis to identified common neoantigens. Along with a test for compatibility, this allows us the quickly use the model for many patients. For patients that have rarer neoantigens, a personalized treatment would still be more effective. Therefore, more research is needed to make the model run faster and on a cheaper hardware. Finally, if this model is employed in practice we should create a new user interface for the doctors with different metrics.






Figure 10: Dr. R.G. Orsini, oncological surgeon at Maastricht UMC.

Frequently Asked (Research) Questions

In this FAQ section we will go into short research questions that you may have after reading the above approach.

In general a tumor does not need to have any neoantigens because tumor-driving mutations are not usually surface-expressed. However, in the case of lung cancer, often caused by environmental factors like smoking smoking, the tumor usually has a lot of mutations and thus neoantigens. This makes it a good candidate for a personalized vaccine. The same is true for melanoma, which is often caused by UV radiation. Our treatment focuses on cancer types that have neoantigens .

When healthy DNA/RNA data is available, it is possible to filter out the antigens that are present on healthy cells. This is done by pVACtools .

However, in most clinical settings only the tumor DNA/RNA is sequenced to check for the presence of certain markers. We explore whether tumor data alone is sufficient in the section A Note on Sequencing from Healthy Tissue.

pVACtools is the most widely used tool for neoantigen prediction and has been used for designing vaccines currently in clinical trials . The algorithms used by pVACtools were validated indvidually and had a high accuracy. The complete pVACtools suite was validated using sequencing data from TCGA. The tumor DNA data was from melanoma patients, which is similar to lung cancer in the number of mutations and neoantigens.

In this validation, they were able to predict 80% of the validated neoantigens. Additionally, they were able to elimate 68% of the false positives . While it is not important to predict all neoantigens, it is important to eliminate false positives to avoid auto-immune responses.

In the previous question we discussed one method of eliminating false positives when healthy DNA data is available. In the next question we discuss a method for validating the predictions experimentally.

Our pipeline returns high-confidence predictions for neoantigens. However, in developing a vaccine we may want to validate some of these predictions. Here we outline a protocol to validate if the vaccine specifically targets the epitopes on the tumor cells. This protocol could form the basis of a preclinical study but would not be used in clinical practice because it is too time-consuming and expensive.

  1. Cell cultures: Cultures are grown of both tumor and healthy cells of the patient, as well as a cell-free control.
  2. DNA/RNA sequencing: Illumina WES and RNA sequencing data are performed on a sample of the cell cultures.
  3. Pipeline: The pipeline, as outlined in patient journey through the pipeline, is run which outputs neoantigen epitope sequences.
  4. Insert design: We design a vector with the epitopes and a His-tag.
  5. Expression: We express the vector in E. Coli. We extract the expressed protein using cell lysis, do clarification and purify it using His-tag affinity chromatography.
  6. Antibody production: We immunize a mouse with our antigen and extract the polyclonal antibody from the serum.
  7. Secondary antibody production: We order anti-mouse antibodies with a conjugated dye.
  8. Cell-based ELISA: A blocking buffer ensures only specific binding, we incubate with the primary and secondary antibodies and washing. Absorbance of the dye can be measured to estimate the level of the secondary antibody.

If the neoantigen predictions are correct we expect a significant color change only in the tumor cell culture and not in the healthy cell culture or cell-free control. This indicates that the neoantigens are specifically present on the tumor cells.

The various algorithms in pVACtools were trained on data from sources like IEDB . HLA peptides are usually encoded as a pseudosequence. Instead of using the full HLA protein sequence, only certain key positions with the most information content are used. This is based on a multiple sequence align of multiple HLA alleles and focuses on the most conserved positions.

Peptides, on the other hand, can be encoded in a number of different ways. In the case of BigMHC, a one-hot encoding is used, where each amino acid is represented by a vector of length 20, with a 1 at the position of the amino acid. For MHCflurry, the amino acids sequences is padded to a fixed length and then copied to the beginning, middle and end of the sequence. This allows the model to easily attend to the beginning, middle and end of the sequence no matter the length. Additionally, the sequence is encoded numericallly using a BLOSUM62 substitution matrix, which represents the similarity between amino acids and is derived from the relative frequencies of amino acids and their substitution probabilities.

One issue with the available data in IEDB is the limited number of available data points. Additionally, there are many more examples of strongly binding peptides, than of weakly or non-binding peptides. This may bias the models towards predicting high binding affinity . Furthermore, less data on HLA-II binding affinity is available. Different algorithms used methods to counteract these biases, such as SMOTE and oversampling. The structural solution, however, is for the research community to increase the number of examples, and also the number of negative examples. This would provide a larger and more balanced dataset.

Finally, data is mostly taken from individuals of European descent. This may cause the models to work less well for people from other populations. Again, more data should be collected from more diverse populations to improve the models. Some have also suggested that the models should be trained on data from animal MHC proteins .

You can find the data we used below in the data availability section.

Our pipeline software is publicly available as discussed in the software availability section. Here you can find the instructions on how to install the software and run the pipeline with this data.

We use different algorithms in pVACtools to predict the individual steps of immune system activation. An alternative approach would be to train a single machine learning model that directly predicts the immunogenicity of a neoantigen. However, this type of data is limited. By modeling the immune system steps seperately, it is possible to make use of more data sources and thus incorporate more knowledge into the predictions .

A single aminoacid change can already make a protein a neoantigen although many such changes are too similar to normal proteins . Larger structural variants are more likely to be immunogentic, but have been studied less than single nucleotide variants (SNVs) . Currently, pVACtools does not support structural variations or copy number variations. More research is needed to include these in the pipeline.

Because of alternative splicing, we cannot always reliably predict the transcript sequence from the DNA sequence. For many genes the transcripts are known, but this is not always the case. We can predict the peptide sequence from the transcript and do not have to consider folding, because the epitopes are small. Additionally, the expression level of a neoantigen is important for the immune response.

A related note is that we use WES data instead of sequencing targeted regions because a mutation resulting in a neoantigen can be in any protein coding region. Using WGS is also possible, since the mutations in non-coding regions will be filtered out by the pipeline.

In the pipeline we use healthy DNA when available, and tumor DNA when not, to determine the HLA type of the patient. The HLA type is important because it affects the binding affinity prediction made by pVACtools. Mutations in the HLA gene are quite rare, but if this occurs this may present a problem to working of the vaccine. A change in HLA proteins may change the binding affinity to the neoantigen, and non-functional HLA proteins do not present it to the immune system at all .

This problem is decreased by that fact that we select multiple neoantigens that bind to different HLA alleles. This way, if one alleles changes, the vaccine still works. However, total loss of function for the HLA protein would prevent the vaccine from working.

In the case of patient Y, we also determined the HLA type of the tumor DNA and verified that there was no change. An analysis of the tumor HLA type compared to healthy DNA may be useful for determining if a vaccine can work. More work is needed to meaningfully incorporate this in the pipeline.

Often there are multiple tumors in a lung cancer patient, not all originating from the same parent cell . Because of this, there may not be a single shared neoantigen between them. This issue can be partially mitigated by taking multiple samples from the patient at different tumor sites to get a larger number of neoantigens. Multiple neoantigens from the output should be selected, that are present on as many tumors as possible.

Appendix

A problem with many bioinformatics tools is that they are not user-friendly. They often only offer a command line interface, require a lot of manual steps, are difficult to install, do not run on Windows, and not well documented. Additionally, each tool is aimed at a specific task and combining them is difficult. We wanted to make our pipeline more user-friendly and easier to install on platforms like Windows. Additionally, we find it important that the pipeline is reusable and extendible for other purposes by future iGEM teams. The resulting interface is simple and can be inspected by experts and non-experts alike.

Architecture

Our software architecture is visualized in . A controller executes commands in Docker containers that have dependencies from different bioinformatics tools installed. This ensures that installing is easy and that the pipeline can be run on any platform. The different containers can also run on other machines, allowing for distributed processing. This architecture makes it easy to add new tools to the pipeline, since these are often already available as Docker containers. We were inspired by our sponsor XOLLX Lab for our software architecture. Their XYNA.BIO software is also built on the idea of combining Docker containers.

You can find how we iteratively developed this pipeline on the engineering page.

Reusing and extending

The pipeline software described above is available at the iGEM Gitlab. You can also find the installation instructions, hardware requirements and documentation there. We validated the ease of installation by having a non-bioinformatician run the pipeline.

We define a simple syntax for defining pipelines that allows you to combine commands in a clear way. The controller then executes these commands in the containers and manages files.
Using our syntax, the pipeline can be visualised allowing for inspection by experts and non-experts alike. A screenshot of the webinterface is shown in as well a video showing the layout in more detail ().
Our pipeline uses standard bioinformatics tools and file formats to be maximally compatible with other projects. Additionally, we tried to conform to the GATK best practices as much as possible, since this is a widely used standard in the bioinformatics community.

Contribution

Our pipeline is available to the open source community. We hope that it will be useful for other iGEM teams and researchers. We encourage you to use it and to contribute to it. We are open to suggestions for improvements and new features.
In setting up this pipeline we encountered several bugs, things that were unclear or data format problems in the tools we used. We reported these to the developers of the tools (such as VATools and pVACtools) and they were fixed. In this way we also contributed to the open source community.

Software architecture for the PROMISE pipeline.
A screenshot of one pipeline step in the software.
An overview video without narration showing the layout of the software. Press fullscreen for the best experience.


Patient X

The WES and RNA sequencing data of patient X is available at the NCI patient derived models repository. Here you can download the raw FASTQ data as well as a processed VCF file. Demographic details about the patient are available as well as test results like HLA typing or growth curves. This data is publicly available and can be used without special permission. A limitation of this data is that no healthy DNA control data is available.

As a result of our analysis we identified some inconsistencies in the VCF files that were available. We notified the repository manager and they have since corrected the files.


We used the following files from patient X, identified by id 145666:

Patient Y

The data for patient Y can be found at the European Genome-Phenome Archive. This data is protected and can only be accessed by entering into a data transfer agreement with a research institution. We got limited access to this data which is owned by the GIS. We balanced security with the need for high computation resources by running the analyses on the Dutch SURF Research Cloud.
The advantage of this data is that it contains both tumor and healthy DNA data. However, a limitation for us was that any publication needed to be approved in advance by the GIS, which limited our time and thus how much we could analyse.


The following datasets contain files for patient Y. In this dataset the patient has the id A006:

Data access can be requested at the EGA portal. Instructions on how to download the data can be found in the software availability section.



The full neoantigen predictions for patient X can be downloaded here: results-patient-x.csv (direct download).

The full neoantigen predictions for patient Y can be downloaded here: results-patient-y-tumor.csv (direct download) and here: results-patient-y-matched.csv (direct download).



Images created with BioRender.com

Images created with OpenAI DALL-E