Engineering Success | USTC-Software

Preliminary Design

After recognizing the diversity and complexity of platforms in the scRNA-seq workflow, along with the lack of a one-stop service and the unfriendliness to beginners, we embarked on the integration of various platforms. Our initial idea is as follows: Our initial idea is as follows:

Step 1: Users upload raw scRNA-seq data(we also provide sample data).
Step 2: The backend processes the raw data using Cell Ranger, including quality control, normalization, and initial clustering.
Step 3: Display a summary of the processed data, highlighting key metrics and identified cell types.
Step 4: Provide an interactive visualization of the scRNA-seq data, allowing users to explore gene expression patterns across different cell types using tools like Plotly and Bokeh.
Step 5: Perform differential expression analysis and identify marker genes for specific cell clusters.
Step 6: Conduct gene set enrichment analysis using GSEA or Enrichr to identify overrepresented biological pathways and processes.
Step 7: Use PAGA for trajectory inference and pseudotime analysis to understand the developmental trajectories and transitions between different cell states.
Step 8: Generate comprehensive reports that include all analyses, visualizations, and customizable options, ready for download or further exploration.

The advantages of this design compared to other software are:

Higher integration and user-friendliness. Unlike other bioinformatics tools that focus on singular aspects of scRNA-seq analysis, this design covers the entire workflow from data processing to final reporting, making it accessible even to those with limited bioinformatics experience.
More valuable information is provided to users. The platform not only processes the scRNA-seq data but also offers in-depth analysis and interactive visualizations. This enables users to gain deeper insights into their data and make informed decisions based on the biological context

Build

Tools and Software Selection

Tools and Software for scRNA-seq Analysis

Scanpy

Scanpy is our primary toolkit for analyzing single-cell RNA sequencing data. It provides efficient algorithms for large-scale data processing, including neighborhood graph construction, clustering, and embedding. Scanpy's integration with other Python libraries and its scalability make it an ideal choice for handling the high-dimensional nature of scRNA-seq datasets.

Cell Ranger

Cell Ranger, developed by 10x Genomics, is used for the initial processing of raw scRNA-seq data. It aligns reads, generates feature-barcode matrices, and performs preliminary clustering and gene expression analysis, providing a solid foundation for downstream analysis.

STAR (Spliced Transcripts Alignment to a Reference)

STAR is a highly accurate and fast tool for aligning RNA-seq reads to a reference genome. It is crucial for ensuring the quality of the input data, especially when dealing with spliced transcripts.

UMAP (Uniform Manifold Approximation and Projection)

UMAP is a powerful dimension reduction technique that preserves both global and local structure in the data. In Scanpy, UMAP is frequently used for visualizing clusters and identifying cell types, making it easier to interpret high-dimensional data.

Gene Set Enrichment Analysis (GSEA) and Enrichr

For gene set enrichment analysis, we use tools like GSEA and Enrichr. These tools help identify overrepresented biological pathways and processes in differentially expressed genes, providing deeper insights into the functional significance of the observed changes.

PAGA (Partition-based Graph Abstraction)

PAGA is a method for trajectory inference and pseudotime analysis. It helps to understand the developmental trajectories and transitions between different cell states, which is particularly useful in studies of cellular differentiation and development.

Interactive Visualization Tools: Plotly, Bokeh, and Dash

To enhance user-friendliness and provide interactive visualizations, we integrate tools such as Plotly, Bokeh, and Dash. These tools allow users to explore their data interactively, zoom in on specific clusters, and customize visualizations to suit their needs.

Customization and Flexibility

Our platform is designed to be highly customizable. Users can choose from a variety of parameters and settings to tailor the analysis to their specific research questions. This flexibility ensures that the platform can adapt to a wide range of experimental designs and biological contexts.

Tools for Website Development

Back-end

Alibaba Cloud

Alibaba Cloud is a comprehensive cloud computing platform that provides a wide range of services, including cloud servers, databases, and machine learning. We use Alibaba Cloud to host our back-end applications, ensuring high availability and scalability.

Django

Django is a high-level Python web framework that encourages rapid development and clean, pragmatic design. It is built by experienced developers and takes care of much of the hassle of web development, so you can focus on writing your app without needing to reinvent the wheel. We use Django to build and manage our local website, providing a robust and scalable solution for our web application.

FastAPI

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints. The key features are speed, fast to code, and fewer bugs. We use FastAPI to create and manage our API endpoints, which interact with our deep learning models and provide real-time results to the front-end.

Gunicorn

Gunicorn (Green Unicorn) is a Python WSGI HTTP Server for UNIX. It's a pre-fork worker model, which means it can handle multiple requests simultaneously by spawning multiple worker processes. We use Gunicorn to run both our Django and FastAPI applications, ensuring efficient and reliable handling of incoming requests.

Nginx

Nginx is an open-source reverse proxy server for HTTP, HTTPS, SMTP, POP3, and IMAP protocols, as well as a load balancer, HTTP cache, and a web server (origin server). It is known for its high concurrency, high performance, and low memory usage. We use Nginx as a reverse proxy to handle incoming requests and forward them to the Gunicorn server, providing an additional layer of security and performance optimization.

scRNA-seq Testing

During the initial testing of our scRNA-seq pipeline, we found that the data preprocessing step using Cell Ranger was robust and efficient. However, the large size of the raw sequencing data required significant computational resources, which posed a challenge for users with limited hardware. To address this, we optimized the pipeline to use more efficient parameters and provided detailed documentation on how to handle large datasets.
When performing quality control (QC) checks, we noticed that some users were having difficulty interpreting the QC metrics and deciding on appropriate filtering criteria. To improve user experience, we integrated an interactive QC dashboard using Plotly, which allows users to visually explore the distribution of QC metrics and make informed decisions about data filtering.
In the clustering and dimensionality reduction steps, we initially used UMAP and t-SNE. While these methods worked well, some users reported that the visualization results were not always intuitive, especially for complex datasets. To enhance interpretability, we added support for additional visualization techniques such as PHATE and TriMap, which provide alternative perspectives on the data.
During gene set enrichment analysis, we observed that the output from GSEA and Enrichr was sometimes difficult to interpret, particularly for users without a strong background in bioinformatics. To make the results more accessible, we developed a custom visualization tool that generates interactive plots and tables, highlighting the most significant pathways and providing context for the enriched gene sets.
For trajectory inference, we tested several tools including PAGA and Monocle. Users reported that while PAGA provided good results, the learning curve was steep for new users. To simplify the process, we created a user-friendly interface that guides users through the steps of trajectory inference, with pre-defined settings and clear explanations of each parameter.
When integrating multiple datasets, we encountered issues with batch effects, which can confound downstream analyses. To mitigate this, we implemented a batch correction step using tools like Harmony and MNN (Mutual Nearest Neighbors). We also provided detailed guidelines on how to assess and correct for batch effects, ensuring that the integrated data is reliable and consistent.
Finally, we conducted a usability test with a group of wet lab researchers. They provided valuable feedback on the overall user experience, suggesting improvements in the documentation, error messages, and the clarity of the visualizations. Based on their input, we made several enhancements, including more detailed tutorials, clearer error messages, and improved visualizations that are more intuitive and informative.

Learn

Iterations

Iterations in scRNA-seq Data Processing: Enhance Efficiency and Accuracy
Initially, our scRNA-seq pipeline relied heavily on Cell Ranger for data preprocessing. While robust, this approach was resource-intensive and not always user-friendly for those with limited computational resources. To address these issues, we implemented several iterations:
- Optimized Parameters: We fine-tuned the parameters of Cell Ranger to balance between computational efficiency and data quality, making it more accessible for users with varying hardware capabilities.
- Alternative Preprocessing Tools: We integrated support for alternative tools like Alevin and STARsolo, which offer more flexibility and can handle different types of sequencing data more efficiently.
- Automated QC Checks: We developed an automated quality control (QC) module that provides detailed reports and recommendations, simplifying the process of identifying and filtering low-quality cells and genes.
Iterations in Integration: Enhance Data Integration and Accessibility
To improve the integration of multiple datasets and make the platform more user-friendly, we introduced the following enhancements:
- Batch Effect Correction: We incorporated advanced batch effect correction methods such as Harmony and MNN (Mutual Nearest Neighbors), ensuring that integrated datasets are consistent and reliable.
- Public Dataset Integration: We added a feature to directly import and integrate public datasets from repositories like GEO and ArrayExpress, streamlining the process of combining external data with user-generated data.
- Interactive Data Exploration: We developed an interactive data exploration dashboard using Plotly and Dash, allowing users to visualize and explore their data in real-time, without the need for extensive programming knowledge.
Iterations in User-Friendliness: Enhance Output Information and Visualization
To provide users with more intuitive and informative outputs, we made the following improvements:
- Enhanced Visualizations: We introduced a variety of visualization techniques, including UMAP, t-SNE, PHATE, and TriMap, each with customizable settings, to help users better understand the structure and relationships within their data.
- Interactive Heatmaps and Plots: We implemented interactive heatmaps and plots for gene expression, differential expression, and pathway enrichment analysis, allowing users to easily identify and interpret key biological insights.
- Customizable Reports: We developed a feature to generate customizable reports, enabling users to export their results in various formats (PDF, HTML, etc.) with tailored content and visualizations.

Improvement of Build

Build a User-Friendly GUI Interface
Our interface design aims to minimize cognitive load for the user. Through a clear and straightforward layout, consistent color coding, and intuitive icons, users can quickly become familiar with the platform upon their first visit. We also provide detailed hints and documentation to ensure users never feel lost or confused during their experience.
Suitable for a Wide Range of Users, from Beginners to Seasoned Researchers
Recognizing the diversity of user backgrounds, skills, and experiences, our interface was rebuilt to offer a high degree of customization. Whether a beginner or an expert, everyone can adjust the tool's parameters and display according to their needs and preferences. For beginners, we provide guided workflows and default settings, while advanced users have access to a wide range of configurable options.
Integrated Documentation and Tutorials
We have integrated comprehensive documentation and step-by-step tutorials directly into the platform. These resources cover everything from basic data upload and processing to advanced analysis and interpretation, ensuring that users can maximize the utility of the platform regardless of their prior experience.
Community and Support
To foster a community of users and provide ongoing support, we have established a forum and a dedicated support team. Users can share their experiences, ask questions, and receive timely assistance, creating a collaborative environment that enhances the overall user experience.

Reference

[1] Hao, Y., Hao, S., Andersen-Nissen, E., Mauck, W. M., Zheng, S., Butler, A., ... & Satija, R. (2021). Integrated analysis of multimodal single-cell data. Cell, 184(13), 3573-3587.

[2] Stuart, T., Srivastava, A., Madad, S., Lareau, C. A., Satija, R. (2022). Multimodal single-cell analysis. Annual Review of Biomedical Data Science, 5, 199-218.

[3] Korsunsky, I., Fan, J., Slowikowski, K., Zhang, F., Wei, K., Baglaenko, Y., ... & Raychaudhuri, S. (2021). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 18(4), 471-478.

[4] Haghverdi, L., Lun, A. T., Morgan, M. D., & Marioni, J. C. (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature Biotechnology, 36(5), 421-427.

[5] Packer, J. S., Zhu, Q., Hu, Q., Lin, D., Ma, Y. A., Liu, J. Z., ... & Trapnell, C. (2020). A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution. Science, 367(6482), eaax3198.

[6] Svensson, V., da Veiga Beltrame, E., & Pachter, L. (2020). A curated database reveals trends in single-cell transcriptomics. Database, 2020, baaa073.

[7] Efremova, M., Vento-Tormo, M., Teichmann, S. A., & Vento-Tormo, R. (2020). CellPhoneDB: inferring cell–cell communication from combined expression of multi-subunit ligand–receptor complexes. Nature Protocols, 15(4), 1484-1506.

[8] Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12), 1053-1058.

[9] Tran, H. T., Ang, C. S., Chevrier, M., Zhang, X., Lee, N. Y. S., Chen, J., ... & Newell, E. W. (2020). A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biology, 21(1), 1-16.

[10] Argelaguet, R., Arnol, D., Bredikhin, D., Deloro, Y., Velten, B., Marioni, J. C., & Buettner, F. (2021). Computational principles and challenges in single-cell data integration. Nature Biotechnology, 39(4), 421-430.