Dry Lab

At NTU Dry Lab, we envisioned a project to alter T7 RNA Polymerase (RNAP) sequences from the Wild Type, and configure alternate forms of the protein.

Project Overview

We recognized early on that our goal was adventurous and bold, and far too large to accomplish in one massive program. We knew that to have any hopes of success, we had to break the project down to its simpler parts. So, we structured our approach in three key components:

1. Build and Train a Model


  • Ancestral sequence data was used as the training set for the model we built.
  • This model was rigorously evaluated for its performance using HuberLoss, achieving a minimised loss of 0.1542. While this value is formidable for an alpha program, it falls short of our target of 0.1000. To enhance our assessment, we adapted our code to generate alternate RNAP sequences, enabling comparative analyses against the ancestral data.

  • The model demonstrated significant promise, particularly with additional time and an expanded datapool. To ensure our project was robust, the team opted not to rely on an alpha version of our model. So we turned to ProGen2.
  • 2. Generate Alternate RNAP Sequences


  • ProGen2 is a far more robust and superior model. Harnessing the powers of ProGen2 through HuggingFace, the second component of the project came to life: generation of alternative RNAP sequences became possible.
  • The 5 alternate RNA sequences produced are as follows:
  •     1. MNTINIAKNDFSDIELAAIPFNTLADHYGERLAREQLALEHESYEMGEARFRKMFERQLKAGEVADNAAAKPLITTLLPKMIA
        SGKTTWFEEVKAKRGKRPTAFQFLQEIKPEAVAYITIKTTLACLTSADNTTVQAVASAIGRAIEDEARFGRIRDLEAKHFKKNV
        EEQLRLIKEHVYKKAFMQVVEADMLSKGLLGGEAWSSWHKEDSIHVGVRCIEMLIESTGMVSLHRQNAGVVGQDSETIELA
        PEYAEAIATRAESLLDISPMFQPCVVPPKPWTGITGGGYWANGRRPLALVRTHSKKALMRYEDVYMPEVYKAINIAQNTAW
        KINKKVLAVANVITKWKHSSFKAIPAIEREELPMKPEDIDMNPEALTAWKRAAAAVYRKDKARKSRRISLEFMLEQANKFANH
        KAIWFPYNMDWRGRVYAVSMFNPDAPKTTKGLLTLAKGKPIGKEGYYWLKIHGANCAGVDKVPFPERIKFIEENHENIMAC
        AKSPLENTWWAEQDSPFCFLAFCFEYAGVQSFVKSYNCSLPLAFDGSCSGIQHFSAMLRDEVGGRAVNLLPSETVQDIYGI
        VAKKVNEILQADAINGTDNEVVTVTDENTGEISEKVKKLLVKLAGQWLAYGVTRSVTKRSVMTLAYGSKEFGFRQQVLEDTI
        QPAIDSGKGLMFTQPNQAAGYMAKLIWESVSVTVVAAVEAMNWARRRGKLLAAEVKDKKTGEILRKRCAVHWVTPDGFPV
        WQEYKKPIQTRLNLMFLGQFRLQPTINTNKDSEIDAHKQESGIAPNFVHSQIIEKSRKTVVWAHEKYGIESFALIHDSFGTIPA
        DAANLFKAVRETMVDTYESCDVLADFYDQFADQLHESQLDKMPALPAKGNLNLRDILESDSEAVE
    
        2. MNTINIAKNDFSDIELAAIPFNTLADHYGERLAREQLALEHESYEMGEARFRKMFERQLKAGEVADNAAAKPLITTLLPKMIE
        AGKTPWFEEVKAKRGKRPTAFQFLQEIKPEAVAYITIKTTLACLTSADNTTVQAVASAIGRAIEDEARFGRIRDLEAKHFKKNV
        EEQLAKLEKHVYKKAFMQVVEADMLSKGLLGGEAWSSWHKEDSIHVGVRCIEMLIESTGMVSLHRQNAGVVGQDSETIEL
        APEYAEAIATRAADVLAISPMFQPCVVPPKPWTGITGGGYWANGRRPLALVRTHSKKALMRYEDVYMPEVYKAINIAQNTA
        WKINKKVLAVANVITKWKHKEGLSIPAIEREELPMKPEDIDMNPEALTAWKRAAAAVYRKDKARKSRRISLEFMLEQANKFAN
        HKAIWFPYNMDWRGRVYAVSMFNPSELKETKGLLTLAKGKPIGKEGYYWLKIHGANCAGVDKVPFPERIKFIEENHENIMAC
        AKSPLENTWWAEQDSPFCFLAFCFEYAGVQLVKKGYNCSLPLAFDGSCSGIQHFSAMLRDEVGGRAVNLLPSETVQDIYGI
        VAKKVNEILQADAINGTDNEVVTVTDENTGEISEKVKEVAVDLAGQWLAYGVTRSVTKRSVMTLAYGSKEFGFRQQVLEDTI
        QPAIDSGKGLMFTQPNQAAGYMAKLIWESVSVTVVAAVEAMNWRTAAAKLLAAEVKDKKTGEILRKRCAVHWVTPDGFPV
        WQEYKKPIQTRLNLMFLGQFRLQPTINTNKDSEIDAHKQESGIAPNFVHSQSSSGSRKTVVWAHEKYGIESFALIHDSFGTIP
        ADAANLFKAVRETMVDTYESCDVLADFYDQFADQLHESQLDKMPALPAKGNLNLRDILESDSGDLY
    
        3. MNTINIAKNDFSDIELAAIPFNTLADHYGERLAREQLALEHESYEMGEARFRKMFERQLKAGEVADNAAAKPLITTLLPKMIR
        RLEDGWFEEVKAKRGKRPTAFQFLQEIKPEAVAYITIKTTLACLTSADNTTVQAVASAIGRAIEDEARFGRIRDLEAKHFKKNV
        EEQLKKRLKHVYKKAFMQVVEADMLSKGLLGGEAWSSWHKEDSIHVGVRCIEMLIESTGMVSLHRQNAGVVGQDSETIEL
        APEYAEAIATRAASLVRISPMFQPCVVPPKPWTGITGGGYWANGRRPLALVRTHSKKALMRYEDVYMPEVYKAINIAQNTA
        WKINKKVLAVANVITKWKHGEKKTIPAIEREELPMKPEDIDMNPEALTAWKRAAAAVYRKDKARKSRRISLEFMLEQANKFAN
        HKAIWFPYNMDWRGRVYAVSMFNPDSPATTKGLLTLAKGKPIGKEGYYWLKIHGANCAGVDKVPFPERIKFIEENHENIMAC
        AKSPLENTWWAEQDSPFCFLAFCFEYAGVQFGASGYNCSLPLAFDGSCSGIQHFSAMLRDEVGGRAVNLLPSETVQDIYG
        IVAKKVNEILQADAINGTDNEVVTVTDENTGEISEKVKELIDKLAGQWLAYGVTRSVTKRSVMTLAYGSKEFGFRQQVLEDTI
        QPAIDSGKGLMFTQPNQAAGYMAKLIWESVSVTVVAAVEAMNWSSLRAKLLAAEVKDKKTGEILRKRCAVHWVTPDGFPV
        WQEYKKPIQTRLNLMFLGQFRLQPTINTNKDSEIDAHKQESGIAPNFVHSQLVALARKTVVWAHEKYGIESFALIHDSFGTIP
        ADAANLFKAVRETMVDTYESCDVLADFYDQFADQLHESQLDKMPALPAKGNLNLRDILESDVSEVV
    
        4. MNTINIAKNDFSDIELAAIPFNTLADHYGERLAREQLALEHESYEMGEARFRKMFERQLKAGEVADNAAAKPLITTLLPKMID
        AGIVEWFEEVKAKRGKRPTAFQFLQEIKPEAVAYITIKTTLACLTSADNTTVQAVASAIGRAIEDEARFGRIRDLEAKHFKKNVE
        EQLKEAQKHVYKKAFMQVVEADMLSKGLLGGEAWSSWHKEDSIHVGVRCIEMLIESTGMVSLHRQNAGVVGQDSETIELA
        PEYAEAIATRALLLRAISPMFQPCVVPPKPWTGITGGGYWANGRRPLALVRTHSKKALMRYEDVYMPEVYKAINIAQNTAWK
        INKKVLAVANVITKWKHGLLKGIPAIEREELPMKPEDIDMNPEALTAWKRAAAAVYRKDKARKSRRISLEFMLEQANKFANHK
        AIWFPYNMDWRGRVYAVSMFNPRTREFTKGLLTLAKGKPIGKEGYYWLKIHGANCAGVDKVPFPERIKFIEENHENIMACAK
        SPLENTWWAEQDSPFCFLAFCFEYAGVQIPIPKYNCSLPLAFDGSCSGIQHFSAMLRDEVGGRAVNLLPSETVQDIYGIVAK
        KVNEILQADAINGTDNEVVTVTDENTGEISEKVKEAKDALAGQWLAYGVTRSVTKRSVMTLAYGSKEFGFRQQVLEDTIQPA
        IDSGKGLMFTQPNQAAGYMAKLIWESVSVTVVAAVEAMNWRKSKAKLLAAEVKDKKTGEILRKRCAVHWVTPDGFPVWQE
        YKKPIQTRLNLMFLGQFRLQPTINTNKDSEIDAHKQESGIAPNFVHSQAAALARKTVVWAHEKYGIESFALIHDSFGTIPADAA
        NLFKAVRETMVDTYESCDVLADFYDQFADQLHESQLDKMPALPAKGNLNLRDILESDGADGL
    
        5. MNTINIAKNDFSDIELAAIPFNTLADHYGERLAREQLALEHESYEMGEARFRKMFERQLKAGEVADNAAAKPLITTLLPKMIK
        ELSKKWFEEVKAKRGKRPTAFQFLQEIKPEAVAYITIKTTLACLTSADNTTVQAVASAIGRAIEDEARFGRIRDLEAKHFKKNV
        EEQLRALGAHVYKKAFMQVVEADMLSKGLLGGEAWSSWHKEDSIHVGVRCIEMLIESTGMVSLHRQNAGVVGQDSETIEL
        APEYAEAIATRAAAIVRISPMFQPCVVPPKPWTGITGGGYWANGRRPLALVRTHSKKALMRYEDVYMPEVYKAINIAQNTAW
        KINKKVLAVANVITKWKHLSPLIIPAIEREELPMKPEDIDMNPEALTAWKRAAAAVYRKDKARKSRRISLEFMLEQANKFANHK
        AIWFPYNMDWRGRVYAVSMFNPSPELRTKGLLTLAKGKPIGKEGYYWLKIHGANCAGVDKVPFPERIKFIEENHENIMACAK
        SPLENTWWAEQDSPFCFLAFCFEYAGVQGALKAYNCSLPLAFDGSCSGIQHFSAMLRDEVGGRAVNLLPSETVQDIYGIVA
        KKVNEILQADAINGTDNEVVTVTDENTGEISEKVKDADIILAGQWLAYGVTRSVTKRSVMTLAYGSKEFGFRQQVLEDTIQPA
        IDSGKGLMFTQPNQAAGYMAKLIWESVSVTVVAAVEAMNWLSLRSKLLAAEVKDKKTGEILRKRCAVHWVTPDGFPVWQE
        YKKPIQTRLNLMFLGQFRLQPTINTNKDSEIDAHKQESGIAPNFVHSQSLSSARKTVVWAHEKYGIESFALIHDSFGTIPADAA
        NLFKAVRETMVDTYESCDVLADFYDQFADQLHESQLDKMPALPAKGNLNLRDILESDDELAA
                

    3. Produce Alternate Protein Models


    Using these alternate sequences, we could now activate the final component of our plan. The program built was designed to use the sequences generated to predict protein structure. The code was also able to conduct a docking process and perform molecular dynamics simulation (MD) to simulate a 5 ns dynamic interaction between protein and targeted DNA sequence in a 1-nm size water cube.

    Fig 1: End-results of protein models for 5 alternate RNA sequences generated.