Omics study report | Squirrel-Beijing 2024

Project Introduction

Transcriptomics study is based on the Illumina high throughput sequencing platform, aiming to analyze all mRNA transcripts produced by a specific tissue or cell under certain conditions. As one of the foundations for studying gene function and structure, transcriptomics plays a crucial role in understanding organismal development and the onset of diseases. The company's transcriptomics projects primarily utilize the RNA-seq workflow, which includes library preparation, sequencing, and bioinformatics analysis. This method has advantages such as high throughput and sensitivity, as well as broad applicability, making it the main approach for current transcriptome research.

To ensure the quality and reliability of samples and data, the company strictly controls every step of the process, including sample testing, library preparation, and sequencing.

library preparation and sequencing workflow

2.1 sample testing

The company uses standard extraction methods to isolate RNA from tissues or cells. Subsequently, the company perform strict quality control on the RNA samples with Agilent 2100 Bioanalyzer to ensure RNA integrity.

2.2 Library Construction and Sequencing

The company obtains mRNA samples for library construction through two methods. The first one is to enrich mRNA with a polyA tail, which most eukaryotic mRNAs have, using Oligo(dT) magnetic beads. The second one is to removes ribosomal RNA from Total RNA to obtain mRNA. Subsequently, in Fragmentation Buffer, mRNA is randomly fragmented using divalent cations, and library construction is carried out using either the regular library construction method or the strand-specific library construction method (Parkhomchuk et al., 2009). The library construction process for both methods is shown in the figure 1.

Figure 1. Schematic Diagram of Library Construction Principle

Figure 2. Flowchart of Conventional Library Preparation (Left) and Flowchart of Strand-Specific Library Preparation (Right)

After the library construction is completed, it is preliminarily quantified using a Qubit2.0 Fluorometer, and the library is diluted to 1.5ng/ul. Then, the insert size of the library is checked using an Agilent 2100 bioanalyzer. Once the insert size meets the expectations, qRT-PCR is used to accurately quantify the effective concentration of the library (the effective concentration of the library is higher than 1.5nM) to ensure the quality of the library. The library construction process is shown in the figure 2.
After the library inspection is qualified, different libraries are pooled according to the effective concentration and the target next-generation sequencing data volume requirements, and then Illumina sequencing is carried out to obtain sequence information.

Data Analysis Process

In the study, the company compared the gene expression levels of cell samples cultured under different conditions and determined the specific genes that responded to the conditions. After obtaining the target gene samples and raw data, the company aggregated the data, conducted quality control, and filtered to obtain clean reads. They also analyzed its biological significance through a variety of methods such as comparison, quantification, significance of difference analysis, and functional enrichment. Finally, gene functions are analyzed using data processing methods such as GO enrichment, KEGG enrichment and protein interaction network analysis.

Analysis Results

4.1 Sequencing Data Quality Control

The company utilized the Illumina high-throughput sequencing platform to complete sequencing of nucleic acid samples and conducted data quality control. Various methods were employed to filter clean reads, resulting in the retention of approximately 96.62% usable sequencing data. The unusable data were primarily caused by the following factors: 1. For longer sequencing fragments, the consumption of chemical reagents in later stages may lead to a decline in sequencing quality. 2. The first six bases may contain errors due to incomplete binding between random primers and the RNA template. To obtain clean reads for downstream analysis, the company screened the sequencing results based on indicators such as single-base sequencing quality (Qphred scores) and A/T/G/C content distribution, and removed the following reads: 1. Reads containing adapters. 2. Reads with undetermined base information (N). 3. Reads with low sequencing quality (where more than 50% of bases had a Qphred score >5). 4. The resulting clean reads were used for subsequent analysis, as the figure 3.

Figure 3. Filtering Status of Sample Sequencing Data

4.2 Reference Genome Alignment

After obtaining clean reads through quality control filtering, the company used HISAT2 software for comparison, located the mRNA fragment locations obtained from the sample on the reference genome, converted the results into bam files with samtools, and conducted visualization analysis with IGV viewer. The results are shown in figure.4

Figure 4. Visualization of Alignment Results using IGV Browser

Figure 5. The distribution of sequencing reads across genomic regions

Then the company Analyzed and statistically evaluated the distribution of differentially expressed regions (introns, exons, intergenic regions, etc.). The results are shown in the figure 5. It is speculated that reads aligned to intronic regions may originate from precursor mRNA or retained introns due to alternative splicing events. Reads aligned to intergenic regions may originate from ncRNAs or minor DNA fragment contamination, or it could indicate that gene annotations are not yet complete.

4.3 New Gene Prediction

The company used StringTie, a fast and accurate tool for assembling transcripts and predicting their expression levels, for new transcript assembly. Then, the company annotated the new transcript with Pfam, SUPERFAMILY, GO, KEGG, and other databases after the new transcript is assembled.

4.4 Quantitative Analysis

With the featureCounts tool in the Subread software package (Liao et al., 2014), the company counted the number of reads covering each gene, including newly predicted genes, from start to end. Some reads were removed, including reads with alignment quality values below 10, non-paired alignments, and reads mapping to multiple regions of the genome.

After completing the counting, the company performed quantitative analysis of gene expression levels, gene expression distribution, inter-sample correlation, and principal component analysis, generated the following charts. Finally, the company generated a Venn diagram to illustrate the co-expression of genes among samples, showing the number of genes uniquely expressed in each group and sample. The overlapping regions indicate the number of genes that are co-expressed across two or more groups and samples.

Figure 6. Presentation of gene expression quantification results

The company performed quantitative analysis of gene expression levels for each sample separately, then combined the results to obtain the expression matrix for all samples.

Figure 7. Boxplot of Sample Gene Expression Levels

After normalizing for sequencing depth and gene length, the company calculated the expression values (FPKM) for all genes across the samples and plotted boxplots. In the plot, the x-axis represents the sample names, and the y-axis represents log2(FPKM + 1). Each boxplot region displays five statistics (from top to bottom: maximum, upper quartile, median, lower quartile, and minimum).

Figure 8. Heatmap of Correlation between Samples

To ensure and verify the biological replicates, the company calculated the Pearson correlation coefficients of FPKM values within and between groups of samples and plotted them as a heatmap. The numbers in the heatmap represent the square of the correlation coefficients between each pair of samples. Correlation coefficients closer to 1 indicate a higher similarity in expression patterns between samples.

Figure 9. Principal Component Analysis (PCA) Result Plot

The company also performed Principal Component Analysis (PCA), using linear algebra computational methods to reduce the dimensionality and extract principal components from tens of thousands of gene variables. A PCA plot was drawn (with the first principal component on the x-axis and the second principal component on the y-axis) to evaluate inter-group differences and intra-group sample replication. Under ideal conditions, samples from different groups should be dispersed, while samples within the same group should cluster together.

Finally, the company generated a Venn diagram to illustrate the co-expression of genes among samples, showing the number of genes uniquely expressed in each group and sample. The overlapping regions indicate the number of genes that are co-expressed across two or more groups and samples.

Figure 10. Venn Diagram of Co-expressed Genes

4.5 Differential Expression Analysis

After completing the quantitative analysis of gene expression, the company performed differential gene expression analysis using two different software and standardization methods, depending on the biological replication of the experiments. For experiments with biological replicates, the company utilized DESeq2 (Anders et al., 2014) software and performed standardization using the DESeq method. For experiments without biological replicates, the company used edgeR (Robinson et al., 2010) software and standardized the data using the TMM method. After standardization, the company calculated p-values with a negative binomial distribution and the FDR (often represented as padj, which is introduced to adjust the hypothesis test p-values, thereby controlling the accumulation of false positives due to the large number of transcriptome samples) with BH method. Subsequently, the company screened for differentially expressed genes based on the empirical criteria of |log2(FoldChange)| ≥ 1 & padj ≤ 0.05, and adjusted the screening thresholds as needed to ensure that the number of selected genes met the requirements for subsequent analyses. The top five differentially significant results identified were ENSBTAG00000014534, ENSBTAG00000013103, ENSBTAG00000013472, ENSBTAG00000014824, and ENSBTAG00000004258.

After completing the statistical analysis of all differentially expressed genes, the company conducted data analysis and visualized the results through graphs and tables, as shown below.

Figure 11. Bar Chart of Differential Gene Counts Across Comparison Groups

The company created a bar chart using the number of differentially expressed genes and the quantities of upregulated and downregulated genes after comparison and analysis. The blue and gray bars represent upregulated and downregulated differentially expressed genes, respectively, and the numbers on the bars indicate the count of differentially expressed genes.

Figure 12. Volcano Plot of Differential Genes

A volcano plot was used to analyze the distribution of differentially expressed genes for each comparison combination. The x-axis represents the fold change in gene expression between the treatment and control groups, while the y-axis indicates the significance level of the expression differences between the two groups. Red and green points represent upregulated and downregulated genes, respectively.

Figure 13. Venn Diagram of Differential Genes

The company used a Venn diagram to illustrate the overlap of differentially expressed genes between different comparison combinations, with different colors representing different combinations. The sum of all numbers within each circle represents the total number of differentially expressed genes for that comparison combination, while the overlapping areas indicate the number of common differentially expressed genes between the combinations.

Figure 14. Heatmap of Clustered Differentially Expressed Genes (DEGs)

The company also obtained a set of differentially expressed genes and performed hierarchical clustering analysis on the FPKM values of the genes. The rows were normalized using Z-scores, and the resulting values were displayed. The comparison can only be made horizontally, not vertically. The colors range from red, indicating higher expression levels, to green, indicating lower expression levels.

4.6 Enrichment Analysis

To facilitate subsequent gene function analysis, the company uses enrichment analysis to segment the list of differentially differentiated genes into multiple parts, expecting to discover key signaling pathways. During the operation, the company firstly built a gene set (such as GO and KEGG databases, etc.), which refers to genome annotation information for classification. The target gene set (differential gene set or other gene set) was then mapped to the background gene set. In this study, the differential gene set was the gene set of the differentially significant genes analyzed and annotated to the GO or KEGG database, and the background gene set was the gene set of all the genes that were significantly analyzed and annotated to the GO or KEGG database.
The company uses ClusterProfiler software to perform GO functional enrichment analysis and KEGG pathway enrichment analysis on differential gene sets based on the principle of hypergeometric distribution (as shown in figure 15). The results of enrichment analysis were the enrichment of all differential gene sets, up-regulated differential gene sets, and down-regulated differential gene sets for each differential comparison combination.

Figure 15. Schematic Diagram of Gene Enrichment Analysis

4.6.1 GO enrichment analysis

Gene Ontology (GO) is a comprehensive database that describes gene function, which can be divided into three parts: biological process, cellular component, and molecular function. In this analysis, the Company used a less than 0.05 padj value as the threshold for significant enrichment and performed a visual analysis of the results, as shown below.

Figure 16. Up- and Down-Regulated Bar Chart of GO Enrichment Analysis

From the results of GO enrichment analysis, the most significant 30 terms were selected, and if there were less than 30, all terms were selected, and the histogram was drawn according to the three categories of biological process, cell composition and molecular function, as well as the upward and downward regulation of genes. In the figure, the abscissa is GO Term, and the ordinate is the number of gene upward regulation and downward regulation annotated to GO Term.

Figure 17. Bar Chart of GO Enrichment Analysis

In the figure, the abscissa is GO Term, the selection criteria are the same as above, and the different colors represent the three GO subclasses of BP, CC, MF. The ordinate is the significance level of GO Term enrichment, represented by log10(padj), indicates number of differential genes enriched for the term.

Figure 18. Scatter Plot of GO Enrichment Analysis

The abscissa is the ratio of the number of differential genes annotated to the GO Term to the total number of differential genes, the ordinate is the GO Term, the size of the dot represents the number of genes annotated to the GO Term, and the color from red to purple represents the significance of the enrichment. The selection criteria for GO Term are the same as above.

Figure 19. Directed Acyclic Graph (DAG) of GO Enrichment Analysis

Finally, the Directed Acyclic Graph (DAG) was used to graphically display the results of GO enrichment analysis of differential genes. In the figure, the branches represent inclusion relationships. The functional range narrows from top to bottom. For each comparison, the top 5 most significantly enriched GO Terms are selected as main nodes in the directed acyclic graph (shown in boxes). Associated GO Terms are displayed via these inclusion relationships, with color shades indicating the degree of enrichment. In our project, DAG maps of biological processes, molecular functions, and cellular components are mapped separately.

4.6.2 KEGG pathway enrichment analysis

Kyoto Encyclopedia of Genes and Genomes (KEGG) is a comprehensive database that integrates genomic, chemical, and system functional information. In this part of the analysis, the company used a padj value of less than 0.05 as the threshold for significant enrichment and performs a visual analysis of the results as figures shown below.

Figure 20. Bar Chart of KEGG Enrichment Analysis

From the KEGG enrichment results, the 20 most significant KEGG pathways were selected to draw a histogram for display, and if there were less than 20, all pathways were plotted. The abscissa is the KEGG pathway, the ordinate is the significance level of pathway enrichment. The higher the value, the more significant level of pathway enrichment, and the value on the column represents the number of differential genes enriched to the pathway.

Figure 21. Scatter Plot of KEGG Enrichment Analysis

The abscissa is the ratio of the number of differential genes annotated to the KEGG pathway, the ordinate is the KEGG pathway. The size of the dot represents the number of genes annotated to the KEGG pathway, and the color from red to purple represents the significance of enrichment. The KEGG enrichment criteria are the same as those on the left.

4.6.3 protein interaction network analysis

The company used the STRING protein interaction database to analyze the protein interaction network of differentially related genes and extracted the interaction relationships of target gene sets from the database to build the network. Subsequently, the company used Cytoscape software to visually edit the interaction grid, resulting in the following network.

Figure 22. Protein-Protein Interaction (PPI) Network Graph

Each node represents a protein, and each line represents the interaction between the linked proteins This diagram has the following properties and characteristics: The size of a node in an interaction network diagram is proportional to the degree of the node (the more connections you have with other nodes, the larger the size of the node, and possibly at the core of the interaction grid). The color of the node is related to the clustering coefficient of the node, and the color gradient from green to red corresponds to the value of the aggregation coefficient from low to high (the higher the aggregation coefficient, the better the connectivity between the adjacencies of the node).