Project Description | USTC-Software

Currently, many universities in China have advanced developments in bioinformatics and provide cutting-edge undergraduate education in this field. In comparison, USTC has room for improvement in these aspects. Prot-DAP was initially developed by the 2023 iGEM-Software team as a web-based platform for protein design. Users can complete all the computational modeling processes in the protein sequence design workflow in a "one-stop" manner on this platform. Moreover, Prot-DAP provides intuitive visualization of key information about proteins, such as mutation-prone sites and critical residues.

In 2024, with the support of the new iGEM-Software team, we will build our new platform Mo-BASE upon the existing functionalities of Prot-DAP and introduce new features, including but not limited to a bioinformatics learning forum, undergraduate course learning assistance tools, laboratory academic interviews, and laboratory introductions. Our goal is to transform Mo-BASE into a comprehensive platform that caters to the needs of undergraduate students.

Our Website——Mo-BASE

To determine the name of this year's project, we engaged in in-depth discussions with several professors and students specializing in multi-omics research. For many IGEMers and researchers, handling extensive bioinformatics data is often unavoidable. During the process of multi-omics data analysis and integration, numerous steps are involved, each requiring various tools that can be quite time-consuming to utilize. Our research has shown that while many websites offer relevant services, it is challenging to find a platform that provides a one-stop solution. Therefore, after reviewing a significant amount of literature, we decided to integrate some well-validated models and create a platform rich in design links and resources—Mo-BASE, aiming to support undergraduates in understanding and applying synthetic biology more effectively.

This section provides an overview of our optimized single-cell RNA sequencing (scRNA-seq) analysis workflow, highlighting the key steps and algorithms used to ensure high-quality results.

Step 1: Data Acquisition and Preprocessing. We begin with the acquisition and reading of scRNA-seq datasets. This analysis workflow incorporates the scDEAL model for data transfer learning, ensuring that data from various sources is unified in format. The transfer learning model scDEAL enhances the compatibility and comparability of different datasets, making them ready for downstream analysis. Additionally, the workflow preprocesses raw sequencing data by normalizing gene expression values and log-transforming them to stabilize variance.

Batch Distribution

Step 2: Quality Control. Quality control (QC) is crucial to ensure the reliability of the analysis results. We check for low-quality cells, such as those with low sequencing depth or high mitochondrial gene proportions, and filter them out. Additionally, we use the scrublet model for doublet detection and removal, adjusting its parameters to maximize performance. This guarantees a dataset of single, high-quality cells.

Quality Evaluation

Step 3: Batch Effect Correction. To address batch effects introduced by different experimental conditions, we apply the scDML trained model for batch effect correction. This method eliminates systematic biases across datasets, ensuring the uniformity and consistency of the data when multiple batches are combined. scDML allows us to correct for variability, providing a cleaner and more integrated dataset.

combat

harmony

scanorama

Batchnorm with different methods

Step 4: Dimensionality Reduction. Dimensionality reduction is performed using principal component analysis (PCA) to identify the most informative features, followed by t-SNE and UMAP to visualize the high-dimensional data in a 2D or 3D space. This step helps reveal the underlying structure of the data and facilitates clustering.

PCA

UMAP

tSNE

Dimensionality Reduction(PCA first, then t-SNE or UMAP)

Step 5: Clustering. Clustering is a core step in scRNA-seq analysis. We use the SIMLR algorithm to group cells with similar expression profiles, identifying distinct subpopulations. SIMLR is highly effective at handling the high-dimensionality of single-cell data and provides accurate cell type classifications.

leiden

Clustering with Leiden

Step 6: Differential Gene Expression Analysis. After clustering, we perform differential gene expression (DGE) analysis to identify genes that are differentially expressed across clusters. This step is critical for understanding the biological functions and pathways that define each cell population. The results of DGE help us further characterize cell types and their functional roles in biological processes.

Step 7: Gene Annotation and Cell Type Identification. Finally, we apply GPT-celltype for automated gene function annotation and cell type identification. This model enables detailed classification of cell types, enhancing the biological interpretation of the results. Using this tool, we can provide deeper insights into the cellular composition of the sample, contributing to the understanding of complex biological systems.

Gene Annotation and Cell Type Identification

Integrated Full Process & User-friendly Operation:

Comprehensive scRNA-seq Data Processing: The platform provides end-to-end support for processing and analyzing single-cell RNA sequencing (scRNA-seq) data, enabling users to perform tasks from data preprocessing to advanced analysis with ease.
Optimized Single-cell Analysis Pipeline: The platform is designed to guide users through a complete workflow for scRNA-seq analysis, including quality control, dimensionality reduction, clustering, and visualization.
User-Friendly Communication Platform: A built-in academic community allows users to share insights, access learning materials, and communicate with peers for collaborative learning and problem-solving.
Batch Effect Correction Tools: The platform offers various methods for correcting batch effects in scRNA-seq data, ensuring accurate and reliable results across different experimental conditions.
Interactive Visualization: Users can easily explore and visualize data with interactive plots, allowing for dynamic exploration of single-cell populations and clusters.