Preliminary Design
After recognizing the diversity and complexity of platforms in the scRNA-seq workflow, along with the lack of a one-stop service and the unfriendliness to beginners, we embarked on the integration of various platforms. Our initial idea is as follows: Our initial idea is as follows:
- Step 1: Users upload raw scRNA-seq data(we also provide sample data).
- Step 2: The backend processes the raw data using Cell Ranger, including quality control, normalization, and initial clustering.
- Step 3: Display a summary of the processed data, highlighting key metrics and identified cell types.
- Step 4: Provide an interactive visualization of the scRNA-seq data, allowing users to explore gene expression patterns across different cell types using tools like Plotly and Bokeh.
- Step 5: Perform differential expression analysis and identify marker genes for specific cell clusters.
- Step 6: Conduct gene set enrichment analysis using GSEA or Enrichr to identify overrepresented biological pathways and processes.
- Step 7: Use PAGA for trajectory inference and pseudotime analysis to understand the developmental trajectories and transitions between different cell states.
- Step 8: Generate comprehensive reports that include all analyses, visualizations, and customizable options, ready for download or further exploration.
The advantages of this design compared to other software are:
- Higher integration and user-friendliness. Unlike other bioinformatics tools that focus on singular aspects of scRNA-seq analysis, this design covers the entire workflow from data processing to final reporting, making it accessible even to those with limited bioinformatics experience.
- More valuable information is provided to users. The platform not only processes the scRNA-seq data but also offers in-depth analysis and interactive visualizations. This enables users to gain deeper insights into their data and make informed decisions based on the biological context
Build
Tools and Software Selection
Tools and Software for scRNA-seq Analysis
Scanpy
Scanpy is our primary toolkit for analyzing single-cell RNA sequencing data. It provides efficient algorithms for large-scale data processing, including neighborhood graph construction, clustering, and embedding. Scanpy's integration with other Python libraries and its scalability make it an ideal choice for handling the high-dimensional nature of scRNA-seq datasets.
Cell Ranger
Cell Ranger, developed by 10x Genomics, is used for the initial processing of raw scRNA-seq data. It aligns reads, generates feature-barcode matrices, and performs preliminary clustering and gene expression analysis, providing a solid foundation for downstream analysis.
STAR (Spliced Transcripts Alignment to a Reference)
STAR is a highly accurate and fast tool for aligning RNA-seq reads to a reference genome. It is crucial for ensuring the quality of the input data, especially when dealing with spliced transcripts.
UMAP (Uniform Manifold Approximation and Projection)
UMAP is a powerful dimension reduction technique that preserves both global and local structure in the data. In Scanpy, UMAP is frequently used for visualizing clusters and identifying cell types, making it easier to interpret high-dimensional data.
Gene Set Enrichment Analysis (GSEA) and Enrichr
For gene set enrichment analysis, we use tools like GSEA and Enrichr. These tools help identify overrepresented biological pathways and processes in differentially expressed genes, providing deeper insights into the functional significance of the observed changes.
PAGA (Partition-based Graph Abstraction)
PAGA is a method for trajectory inference and pseudotime analysis. It helps to understand the developmental trajectories and transitions between different cell states, which is particularly useful in studies of cellular differentiation and development.
Interactive Visualization Tools: Plotly, Bokeh, and Dash
To enhance user-friendliness and provide interactive visualizations, we integrate tools such as Plotly, Bokeh, and Dash. These tools allow users to explore their data interactively, zoom in on specific clusters, and customize visualizations to suit their needs.
Customization and Flexibility
Our platform is designed to be highly customizable. Users can choose from a variety of parameters and settings to tailor the analysis to their specific research questions. This flexibility ensures that the platform can adapt to a wide range of experimental designs and biological contexts.
Tools for Website Development
Back-end
scRNA-seq Testing
- During the initial testing of our scRNA-seq pipeline, we found that the data preprocessing step using Cell Ranger was robust and efficient. However, the large size of the raw sequencing data required significant computational resources, which posed a challenge for users with limited hardware. To address this, we optimized the pipeline to use more efficient parameters and provided detailed documentation on how to handle large datasets.
- When performing quality control (QC) checks, we noticed that some users were having difficulty interpreting the QC metrics and deciding on appropriate filtering criteria. To improve user experience, we integrated an interactive QC dashboard using Plotly, which allows users to visually explore the distribution of QC metrics and make informed decisions about data filtering.
- In the clustering and dimensionality reduction steps, we initially used UMAP and t-SNE. While these methods worked well, some users reported that the visualization results were not always intuitive, especially for complex datasets. To enhance interpretability, we added support for additional visualization techniques such as PHATE and TriMap, which provide alternative perspectives on the data.
- During gene set enrichment analysis, we observed that the output from GSEA and Enrichr was sometimes difficult to interpret, particularly for users without a strong background in bioinformatics. To make the results more accessible, we developed a custom visualization tool that generates interactive plots and tables, highlighting the most significant pathways and providing context for the enriched gene sets.
- For trajectory inference, we tested several tools including PAGA and Monocle. Users reported that while PAGA provided good results, the learning curve was steep for new users. To simplify the process, we created a user-friendly interface that guides users through the steps of trajectory inference, with pre-defined settings and clear explanations of each parameter.
- When integrating multiple datasets, we encountered issues with batch effects, which can confound downstream analyses. To mitigate this, we implemented a batch correction step using tools like Harmony and MNN (Mutual Nearest Neighbors). We also provided detailed guidelines on how to assess and correct for batch effects, ensuring that the integrated data is reliable and consistent.
- Finally, we conducted a usability test with a group of wet lab researchers. They provided valuable feedback on the overall user experience, suggesting improvements in the documentation, error messages, and the clarity of the visualizations. Based on their input, we made several enhancements, including more detailed tutorials, clearer error messages, and improved visualizations that are more intuitive and informative.
Learn
Iterations
-
Iterations in scRNA-seq Data Processing: Enhance Efficiency and Accuracy
Initially, our scRNA-seq pipeline relied heavily on Cell Ranger for data preprocessing. While robust, this approach was resource-intensive and not always user-friendly for those with limited computational resources. To address these issues, we implemented several iterations:- Optimized Parameters: We fine-tuned the parameters of Cell Ranger to balance between computational efficiency and data quality, making it more accessible for users with varying hardware capabilities.
- Alternative Preprocessing Tools: We integrated support for alternative tools like Alevin and STARsolo, which offer more flexibility and can handle different types of sequencing data more efficiently.
- Automated QC Checks: We developed an automated quality control (QC) module that provides detailed reports and recommendations, simplifying the process of identifying and filtering low-quality cells and genes.
-
Iterations in Integration: Enhance Data Integration and Accessibility
To improve the integration of multiple datasets and make the platform more user-friendly, we introduced the following enhancements:- Batch Effect Correction: We incorporated advanced batch effect correction methods such as Harmony and MNN (Mutual Nearest Neighbors), ensuring that integrated datasets are consistent and reliable.
- Public Dataset Integration: We added a feature to directly import and integrate public datasets from repositories like GEO and ArrayExpress, streamlining the process of combining external data with user-generated data.
- Interactive Data Exploration: We developed an interactive data exploration dashboard using Plotly and Dash, allowing users to visualize and explore their data in real-time, without the need for extensive programming knowledge.
-
Iterations in User-Friendliness: Enhance Output Information and Visualization
To provide users with more intuitive and informative outputs, we made the following improvements:- Enhanced Visualizations: We introduced a variety of visualization techniques, including UMAP, t-SNE, PHATE, and TriMap, each with customizable settings, to help users better understand the structure and relationships within their data.
- Interactive Heatmaps and Plots: We implemented interactive heatmaps and plots for gene expression, differential expression, and pathway enrichment analysis, allowing users to easily identify and interpret key biological insights.
- Customizable Reports: We developed a feature to generate customizable reports, enabling users to export their results in various formats (PDF, HTML, etc.) with tailored content and visualizations.
Improvement of Build
-
Build a User-Friendly GUI Interface
Our interface design aims to minimize cognitive load for the user. Through a clear and straightforward layout, consistent color coding, and intuitive icons, users can quickly become familiar with the platform upon their first visit. We also provide detailed hints and documentation to ensure users never feel lost or confused during their experience.
-
Suitable for a Wide Range of Users, from Beginners to Seasoned Researchers
Recognizing the diversity of user backgrounds, skills, and experiences, our interface was rebuilt to offer a high degree of customization. Whether a beginner or an expert, everyone can adjust the tool's parameters and display according to their needs and preferences. For beginners, we provide guided workflows and default settings, while advanced users have access to a wide range of configurable options.
-
Integrated Documentation and Tutorials
We have integrated comprehensive documentation and step-by-step tutorials directly into the platform. These resources cover everything from basic data upload and processing to advanced analysis and interpretation, ensuring that users can maximize the utility of the platform regardless of their prior experience.
-
Community and Support
To foster a community of users and provide ongoing support, we have established a forum and a dedicated support team. Users can share their experiences, ask questions, and receive timely assistance, creating a collaborative environment that enhances the overall user experience.
Reference
[1] Hao, Y., Hao, S., Andersen-Nissen, E., Mauck, W. M., Zheng, S., Butler, A., ... & Satija, R. (2021). Integrated analysis of multimodal single-cell data. Cell, 184(13), 3573-3587.
[2] Stuart, T., Srivastava, A., Madad, S., Lareau, C. A., Satija, R. (2022). Multimodal single-cell analysis. Annual Review of Biomedical Data Science, 5, 199-218.
[3] Korsunsky, I., Fan, J., Slowikowski, K., Zhang, F., Wei, K., Baglaenko, Y., ... & Raychaudhuri, S. (2021). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 18(4), 471-478.
[4] Haghverdi, L., Lun, A. T., Morgan, M. D., & Marioni, J. C. (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature Biotechnology, 36(5), 421-427.
[5] Packer, J. S., Zhu, Q., Hu, Q., Lin, D., Ma, Y. A., Liu, J. Z., ... & Trapnell, C. (2020). A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution. Science, 367(6482), eaax3198.
[6] Svensson, V., da Veiga Beltrame, E., & Pachter, L. (2020). A curated database reveals trends in single-cell transcriptomics. Database, 2020, baaa073.
[7] Efremova, M., Vento-Tormo, M., Teichmann, S. A., & Vento-Tormo, R. (2020). CellPhoneDB: inferring cell–cell communication from combined expression of multi-subunit ligand–receptor complexes. Nature Protocols, 15(4), 1484-1506.
[8] Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12), 1053-1058.
[9] Tran, H. T., Ang, C. S., Chevrier, M., Zhang, X., Lee, N. Y. S., Chen, J., ... & Newell, E. W. (2020). A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biology, 21(1), 1-16.
[10] Argelaguet, R., Arnol, D., Bredikhin, D., Deloro, Y., Velten, B., Marioni, J. C., & Buettner, F. (2021). Computational principles and challenges in single-cell data integration. Nature Biotechnology, 39(4), 421-430.