- Overview -
In the digital age, data security and privacy protection have become increasingly important. Our project has developed an innovative online software tool that enables the encoding of Chinese text into DNA codon sequences for concealed storage and secure transmission of information. This tool not only supports encoding Chinese characters into DNA sequences but also includes encoding rules that disguise the information at the genetic level, making it difficult to detect and decode. Additionally, our software provides functionality for analyzing the readability of mutated information and a hash value verification feature to ensure the integrity and accuracy of the data during transmission and storage.
- Highlights -
  • Innovative Chinese Encoding Scheme: Utilizes the logic of the Wubi input method to encode Chinese characters into DNA codon sequences, employing a method where every 12 base pairs encode a single Chinese character, ensuring efficient information conversion and storage.
  • Structural Disguise Functionality: Uses codons that form alpha-helices and beta-sheets to disguise encoded DNA sequences as natural proteins, making them difficult to recognize at the biological level.
  • Mutation Readability Analysis: An integrated model within the software simulates the effects of random mutations and predicts the readability of information under different mutation levels, helping users understand the mutational robustness of the data.
  • Hash Value Verification: Adds a hash value at the end of the encoded information to detect potential data distortion due to mutations, ensuring the accuracy and integrity of the information.
  • User-Friendly Interface: An intuitive web-based interface that supports multiple input and decoding modes, allowing users to easily operate and manage their information.


- Background and Technical Approach -
In today's digital age, the security and privacy of information have become increasingly critical. Traditional methods of information storage and transmission face growing threats and challenges. The rapid development of bioinformatics provides a novel approach to information security, allowing us to hide and transmit data within DNA sequences. However, existing genetic encoding methods face significant challenges when handling Chinese text, particularly in terms of efficient encoding and maintaining data integrity.

To address these challenges, we developed a Chinese encoding scheme based on DNA codons inspired by the Wubi input method. The Wubi method demonstrates an efficient character input system by decomposing Chinese characters into fundamental components (radicals) and inputting them based on these radicals. Similarly, our encoding scheme encodes each Chinese character into a set of DNA codons, utilizing the principles of the Wubi method to achieve efficient encoding of Chinese text.

software1


In our encoding scheme, each Chinese character is encoded using 12 base pairs, ensuring efficient storage space utilization and high-density data storage. Additionally, our scheme classifies codons into two main categories: those that form alpha-helices and those that form beta-sheets, enabling the information to be disguised at the protein level. This disguise not only enhances information concealment but also makes it difficult to detect and decode at the genetic level.

software2


Moreover, to ensure the integrity of the information during transmission and storage, we introduced a hash value verification mechanism. By adding a hash value (recording the counts of A, T, C, and G) at the end of the encoded information, we can detect if the data has been distorted due to mutations. This method effectively improves the accuracy and reliability of information retrieval.

Through these innovations, our project not only achieves efficient Chinese text encoding but also enhances information security and concealment, providing new solutions in the field of information security.
- Key Features -
Our software tool offers several unique features that make it an innovative solution for information security and concealed storage.
· 1. Chinese to Codon Encoding
Our software utilizes a novel encoding scheme based on the logic of the Wubi input method to convert Chinese characters into DNA codon sequences.

  • Encoding Rules and Process: In traditional text encoding methods, handling Chinese characters often involves a significant amount of data storage. To optimize this process, our scheme encodes each Chinese character into 12 base pairs (bp) and uses 3 bp to represent a single letter. Each letter corresponds to a radical in the Wubi input method, allowing efficient conversion of Chinese text into DNA sequences.
  • Algorithm Implementation:In the architecture with no separation of front and back ends, HTML5 and CSS3 are used to build the front-end interface, and jQuery is used as the front-end framework to assist page interaction. At the back end, Java is used to implement server logic, and MySQL is used as a database to store data. The data table projection query algorithm calculates the required information based on the known data through the associated query of the database table, and uses the JOIN statement and sub-query functions of SQL to optimize the query efficiency, reduce unnecessary data redundancy, and improve the response speed and user experience of the system. The entire system achieves efficient data processing and presentation through tightly integrated front and back end collaboration.
  • Practical Application Case: In the field of synthetic biology, this feature can be used to store confidential information in laboratory bacteria in a biological encoding format, making it difficult to recover the original text without the correct decoding software, even if the information is stolen.
software3


· 2. Codon to Chinese Decoding
  • Decoding Rules and Process: The decoding feature also follows the logic of the Wubi input method, converting DNA codon sequences back into the original Chinese text. The software uses a codon-to-radical table, employing a forward-matching algorithm to decode each input codon sequence step by step.
    For instance, if a user inputs the DNA sequence GCTGCC, the sequence will first be parsed into radicals “亻” and “尔,” which are then combined to form the Chinese character “你.”
  • Technical Implementation of Decoding: The decoding algorithm uses a reverse hash table lookup mechanism, allowing the mapping of codons to radicals with O(1) time complexity, greatly enhancing decoding speed. After decoding, the system automatically performs syntax and logical checks to ensure translation accuracy and text coherence.
  • Challenges and Solutions: During development, we discovered that some radicals have multiple possible codon encodings, adding complexity to decoding. To address this, we developed a priority algorithm that automatically selects the most likely radical combination, ensuring decoding accuracy.
software3
· 3. Structural Disguise Functionality
  • Disguise Strategy: Our software can disguise the encoded DNA sequences as natural proteins by selecting codons that tend to form alpha-helices and beta-sheets. This method makes the DNA sequences more difficult to detect and decipher when analyzed with biological tools, thereby enhancing the information's concealment and security.
    We use a Monte Carlo-based optimization strategy to select the best codon combination for maximizing disguise effects. The software also offers manual codon adjustment, allowing users to customize the disguise strategy according to specific needs.
  • AlphaFold and Simulation Results: By integrating the AlphaFold prediction tool, we can generate and visualize the three-dimensional protein structures corresponding to the DNA sequences. The results show that these disguised protein structures are often indistinguishable from natural ones.
  • Application Scenarios: This disguise function is particularly suitable for biological encryption and concealed storage of information, such as in cross-border transmission of sensitive information. By using DNA as a carrier, it avoids the scrutiny and interception associated with traditional electronic data transmission.


software3
· 4. Mutation Readability Analysis
  • Mutation Simulation and Survey Method: In our project, we employed a straightforward approach to study the impact of mutations on information readability. Specifically, we encoded Chinese information into DNA sequences, introduced random mutations (ranging from a single nucleotide to multiple nucleotides), and then decoded these mutated DNA sequences back into Chinese text. This method allowed us to assess how mutations affected the integrity of the information.
  • Questionnaire Survey and Results Analysis: To evaluate whether the mutated information remained readable, we designed a questionnaire survey. In this survey, we presented the mutated information to the public and recorded whether participants could correctly understand or guess the original meaning. The results showed that as the number of mutations increased, the readability of the information significantly decreased. Additionally, we observed that mutations at certain key positions (such as those involving numbers) could render the entire message unreadable.
  • Results Charts: We collected extensive questionnaire data and conducted statistical analyses. The results indicated that the effect of mutations on information readability is not linear; instead, readability gradually declines as the number of mutations increases. This finding suggests that in applications related to information security and covert transmission, mutations can significantly impact the integrity and accuracy of the information.


  • software4
    · Hash Value Verification
  • Verification Mechanism: To ensure the integrity of the information, the software appends a hash value at the end of each DNA sequence, recording the counts of each base (A, T, C, G). This verification mechanism allows users to quickly detect any potential mutations or data errors when decoding information.
  • Implementation Method: The generation and verification of the hash value use a fast hashing algorithm optimized for high efficiency in handling large data volumes. The system automatically generates and verifies the hash value, and issues a warning if a mismatch is detected, indicating possible mutations or data corruption.
  • Enhanced Security: This verification method significantly improves the accuracy and reliability of information retrieval, especially in cases of multiple transmissions and copies.
  • - User Interface -
    Our software tool is designed with a straightforward and intuitive user interface to make the encryption and decryption process as simple and efficient as possible. Here are the specific usage methods and steps:

    1. Encryption Process:
      • Enter Chinese Text: The user types the Chinese text to be encrypted into the "Chinese Input Part."
      • Complete Input: After entering the text, click the "Input Complete" button to confirm that the text is correctly inputted.
      • Select Structure Type: Depending on the desired secondary structure, the user can choose the type of encoding sequence. Clicking the "Alpha Helix" button will generate a sequence that forms alpha-helices easily (the text will turn red). Clicking the "Beta-sheet" button will generate a sequence that forms beta-sheets easily (the text will turn blue).
      • Start Conversion: Once all selections are made, click the "Start Compiling" button. The software will automatically convert the Chinese text into the corresponding DNA codon sequence, completing the encryption process.
    2. Decryption Process:
      • Enter DNA Sequence: Input the DNA sequence to be decrypted into the "Gene Encoding Input Part," removing the 5' ATG and 3' TAA to strip away the disguise.
      • Start Conversion: Click the "Start Compiling" button, and the software will automatically decode the DNA sequence back into Chinese text, completing the decryption process.
    3. Enter DNA Sequence:
      • Cross-Device and Multi-Browser Support: Our software tool supports multiple versions of browsers, including Chrome, Firefox, Safari, Opera, etc., ensuring a consistent user experience across different platforms and devices such as iPads and smartphones. This allows users to conveniently perform encryption and decryption operations regardless of location.



    · Instructions for gene compilation
    您的浏览器不支持此功能,请尝试使用其他浏览器。
    - Technical Implementation -
    In our software, beyond the core encoding and decoding functionalities, we have integrated protein structure simulation and data integrity verification mechanisms to enhance information concealment and accuracy.
    · 1. Introduces the technical background of the software, the programming languages, frameworks and algorithms used
    In an architecture that does not separate the front end from the front end, the front end uses HTML5 and CSS3 to build a modern web interface, providing rich multimedia support and a better user experience. As a lightweight JavaScript library, jQuery simplifies HTML document traversal, event processing, animation and other operations, and enhances the interactivity of web pages. The backend uses the Java language, a widely used object-oriented programming language with good cross-platform and strong ecosystem support for building complex enterprise-class applications. As a relational database management system, MySQL provides reliable data storage and management functions, supports transaction processing and complex queries, and is suitable for a variety of application scenarios. By means of SQL JOIN statement and sub-query, the algorithm realizes effective estimation based on existing data, improves query efficiency, reduces data redundancy, and improves the overall performance and response speed of the system. This combination of technologies not only guarantees the development efficiency, but also provides good maintainability and extensibility.
    · 2. AlphaFold for Protein Structure Prediction
    AlphaFold is a deep learning-based protein structure prediction tool developed by DeepMind. Our software utilizes AlphaFold to simulate the protein structures corresponding to the encrypted DNA sequences to enhance information concealment. The specific steps are as follows:

    • Sequence Generation: After the user selects the alpha-helix or beta-sheet option and completes the encoding of Chinese text into a DNA sequence, the generated DNA sequence is further processed to predict its potential protein structure.
    • Disguise Strategy: By adding an ATG start codon at the 5' end and a TAA stop codon at the 3' end of the encoded sequence, we disguise the DNA sequence as a natural coding sequence. This sequence is then submitted to AlphaFold as input.
    • Structure Prediction: AlphaFold uses its trained neural network model to predict the three-dimensional structure of the protein corresponding to the sequence. The model is trained on a vast amount of known protein structure data and can provide highly accurate structure predictions.
    • Result Analysis: After the simulation is completed, the generated protein structure can be visualized to check if it meets the expected disguise effect. This functionality greatly enhances the concealment of information, making it more challenging to detect and decipher on a biological level
    · 3. Hash Value Verification Mechanism
    To ensure the integrity of information during transmission and storage, we have implemented a hash value verification mechanism at the end of the DNA sequence. The specific implementation steps are as follows:

    • Hash Value Generation: After encoding is completed, the software calculates the number of each nucleotide (A, T, C, G) in the DNA sequence and records this information in binary form at the end of the sequence. Specifically, the hash value added to the end follows the format CAA + binary count (representing the quantities of A, T, C, G) + CAA.
    • Data Verification: During the decoding process, the software first extracts and checks the hash value at the end of the sequence. It recalculates the number of A, T, C, and G in the sequence and compares these with the hash value.
    • Distortion Detection: If the verification process finds a mismatch between the actual nucleotide counts and the hash value, the software alerts the user that the information may have been distorted and recommends re-extraction or using alternative methods for verification. This mechanism effectively enhances the accuracy and reliability of information during transmission.


    Through these technical implementations, our software tool not only effectively encrypts and disguises information but also ensures its integrity and accuracy, providing users with a higher level of information security.
    - Future Work and Improvements -
    While our software tool has effectively achieved the functionalities of information encryption, disguise, and decryption, there are still several areas that can be further optimized and expanded. Here are the directions and potential improvements we plan to explore in future work:

    · 1. Algorithm Optimization and Performance Enhancement:
    • Improving Encoding and Decoding Algorithms: Although the current encoding and decoding processes are relatively efficient, we plan to further optimize these algorithms to enhance processing speed, especially when handling large-scale text or DNA sequences. By employing parallel computing and efficient data structures, we aim to reduce computation time and resource consumption.
    • Automated Optimization of Disguise Effectiveness: During the encoding disguise process, we aim to develop an automated optimization algorithm that can select the optimal codon combinations and structural disguise strategies based on different application needs, further enhancing information concealment.
    · 2. Diversified Input and Output Options:
    • Support for Encoding in Multiple Languages: Currently, our software primarily focuses on encoding and decoding Chinese. In the future, we plan to expand support for encoding texts in multiple languages, including Japanese, Korean, and other languages using non-Latin scripts, to cater to a global user base.
    • Integration of More Output Formats: Beyond the current DNA sequence output, we aim to add support for other biomolecular formats, such as RNA sequences or protein sequences, providing a wider range of output options.
    · 3. Continuous Improvement of User Experience:
    • Enhancing Interactivity of the User Interface (UI): We plan to further improve the software's user interface to make it more intuitive and interactive. For example, adding drag-and-drop text input and real-time previews of encoding results will allow users to operate and view the encoding and decoding process more conveniently.
    • Offering Personalized Settings Options: In the future, we hope to add personalized settings options, allowing users to customize the interface theme, encoding methods, output formats, and more, thereby increasing the software's flexibility and user experience.
    · 4. Enhancement of Security and Privacy Protection:
    • Developing Advanced Encryption Options: To further enhance information security, we plan to develop advanced encryption options, including multi-layer encryption and disguise strategies, ensuring that information remains well-protected in highly sensitive application scenarios.
    • Strengthening Privacy Protection Mechanisms: During information storage and transmission, we will explore new privacy protection technologies, such as steganography and zero-knowledge proofs, to enhance user privacy protection.
    · 5. Community Feedback and Collaboration:
    • Collecting User Feedback: We plan to establish a user feedback system to actively gather user opinions and suggestions during their use of the software, enabling continuous improvement of software features and user experience.
    • Collaborating with Academic and Industry Partners: We aim to collaborate with more academic institutions and industry partners to explore additional real-world application scenarios, enhancing the practicality and broad applicability of the software.
    Through these future work and improvement directions, we are committed to continually enhancing the functionality and performance of the software, meeting diverse user needs, and creating more opportunities for innovation and application in the field of information security.
    - Conclusion -
    In this project, we developed an innovative software tool capable of encoding Chinese information into DNA sequences, while also supporting decoding and disguise functionalities. By integrating modern bioinformatics techniques and deep learning models such as AlphaFold, we have achieved effective information encryption and concealed storage. The development of this tool not only demonstrates the potential of combining biology and information science but also provides a novel solution for information security.

    Our software tool features an easy-to-use interface and efficient encoding algorithms, making the process of encryption and decryption simple and fast. Additionally, by using AlphaFold for protein structure simulation, we can disguise DNA sequences as natural proteins, enhancing information concealment. The introduction of a hash value verification mechanism further improves data integrity and accuracy during information transmission and storage.

    Looking ahead, we plan to continue optimizing and expanding the software's functionalities, including improving algorithm performance, supporting multi-language encoding, enhancing user experience, and developing advanced security options. We also aim to continuously refine our tool by gathering user feedback and collaborating with academic and industry partners, promoting its broader application in the field of information security.