Model | BUCT-China

Model

1. Target Screening

This year, our detection system is focused on a single target. Therefore, we consider the Pearson correlation coefficient as the most important evaluation metric. As shown in Figure 1, we first use the correlation coefficient to screen the top 5 targets, and then apply methods such as machine learning modeling and database analysis to evaluate and validate the rationality of these targets.

Figure 1: Target screening and validation process.

Setting Up the Environment

Choosing which tool to use depends on your specific needs.

If we need precise control over Python versions and package dependencies, conda is the best choice.

Installing packages directly with pip can clutter the root environment and make dependency man- agement difficult.
Python venv environments isolate packages but don’t allow fine-grained control over Python or package versions.
In contrast, conda lets you create separate environments with specific versions of Python and packages. This improves repeatability and avoids dependency issues.

Through some experimentation, we ultimately chose Micromamba as the Python environment management tool over Miniconda and Anaconda.

Anaconda includes many preinstalled packages and tools we don’t need for this project. At over 3GB, it takes up considerable disk space compared to the more lightweight options.
In contrast, Micromamba contains only the bare minimum needed – the core conda components and Python. At around 30MB, it has a much smaller footprint.
Beyond size, Micromamba also provides faster dependency resolution and environment creation than Miniconda.

The biggest advantage of Micromamba over Miniconda is speed. Using Miniconda to install dependencies took 2 hours on a teammate’s computer, whereas with Micromamba it took just 10-15 minutes.

Micromamba is faster at installing and updating packages due to its more performant package manager and parser. It also supports parallel installs for more flexible multi-environment manage- ment, which helps us a lot.

So for our purposes, Micromamba provides the reliable environment management of conda while being sufficiently lightweight. The speed advantage compared to Miniconda is the primary reason we selected it.

Importing and Preprocessing Data

First, download the corresponding miRNA expression data from TCGA using the TCGA identifier. Then we decompress the resulting TSV files and parse them into Pandas dataframes. Next, standardize the expression data. Since the original dataset used log2 transformation to reduce skew, we normalize the data: Normalized $expression = 2^{expression} - 1$ .

d = np. power(2 , datafile ) = 1

To facilitate subsequent data training, we also add sample status (cancer 1/normal 0) to the normal- ized dataframe.

Through these preprocessing steps, we transform the data into a form suitable for machine learning and statistical analysis. These preprocessed and curated miRNA expression data will be used in subse- quent analyses, such as differential expression identification, feature selection, and classification model training.

It is worth noting that last year we found that the performance of our model was limited by an imbalance in the dataset (with cancer patient data being larger than healthy control data). Therefore, we applied a balance function, extracting healthy control data from all cancer datasets to supplement the target cancer database. This adjustment significantly improved specificity. Below are the confusion matrices for the target hsa-mir-21 in non-small cell lung cancer before and after the modification.

2. Correlation Coefficients

Correlation is one of the most commonly used statistical measures. It describes the degree of association between two variables using a single value. The range of the correlation coefficient is between $[-1, +1]$ . A negative value indicates that as one variable increases, the other decreases; a positive value indicates that as one variable increases, the other also increases; and a value of 0 indicates that changes in one variable have no effect on the other. There are currently three commonly used types of correlation coefficients: Pearson correlation coef- ficient, Spearman correlation coefficient, and Kendall correlation coefficient.

Pearson Correlation Coefficient

Definition: The Pearson correlation coefficient is the most commonly used measure of correlation.It assesses the strength and direction of the linear relationship between two variables.
Calculation: It is computed based on the mean, standard deviation, and covariance of the vari-ables. The formula is:

r = \frac{\sum{(X - \bar{X})(Y - \bar{Y})}}{\sqrt{\sum{(X - \bar{X})^2} \sum{(Y - \bar{Y})^2}}}

Where $X$ and $Y$ are the values of the two variables, and $\bar{X}$ and $\bar{Y}$ are their means.

Applicable Scenarios: It is suitable for continuous variables where data is assumed to follow anormal distribution and the relationship is linear.
Features: It only measures linear relationships and is sensitive to outliers.

SPearman Correlation Coefficient

Definition: The Spearman correlation coefficient is a non-parametric measure based on rankings.It is used to evaluate the monotonic relationship between two variables, regardless of whether therelationship is linear.
Calculation: Data is first ranked, and then the Pearson correlation coefficient is calculated based on the ranks. The formula is:

\rho = 1 - \frac{6 \sum{d_i^2}}{n(n^2 - 1)}

Where $d_i$ is the difference between the ranks of the paired samples, and $n$ is the sample size.

Applicable Scenarios: It is suitable for data that does not meet the normality assumption when there is a non-linear but monotonic relationship.
Features: It is less sensitive to outliers and can handle ordinal categorical data.

Kendall Correlation Coefficient

Definition: The Kendall correlation coefficient is a non-parametric statistic based on rankings, measuring the consistency of rank orders between two variables.
Calculation: It is computed by comparing the order of all pairs of samples, determining whether the pair is concordant or discordant. The formula is:

\tau = \frac{C - D}{\sqrt{(C + T_x)(C + T_y)}}

Where $C$ is the number of concordant pairs, $D$ is the number of discordant pairs, and $T_x$ and $T_y$ are the numbers of tied ranks in each variable.

Applicable Scenarios: It is suitable for small sample sizes or when there are many tied ranks in the data.
Features: Similar to Spearman, but more effective in handling tied ranks.

Kendall’s correlation coefficient becomes computationally complex and inefficient with larger datasets, so we excluded it. From a wet lab perspective, we aim to accurately analyze patients’ disease states based on expression levels, meaning we seek targets whose expression levels have a linear relationship with the disease state. Therefore, we excluded Spearman’s correlation coefficient and selected Pearson’s correlation coefficient.

We calculated the Pearson correlation coefficients between the microRNA expression levels and disease states for all cancers in the database and plotted a heatmap. In the heatmap, we show targets with a Pearson correlation coefficient greater than 0.6 for at least one type of cancer. We also calculated the average Pearson coefficient for each target and the number of cancers where the Pearson coefficient exceeds 0.6. We hope this result will provide insights into pan-cancer research and target selection.

Python Code

def calculate_correlation(train_data, datasets):
	status = train_data.iloc[:, 0]
	other_columns = train_data.iloc[:, 1:]
	correlations = {}
	
	for column in other_columns.columns:
	data = other_columns[column]
	corr, _ = pearsonr(status, data)
	if not pd.isnull(corr):
	correlations[column] = corr
	
	correlation_series = pd.Series(correlations, name=datasets)

Based on the above analysis, we selected the top 5 targets with the highest Pearson correlation coefficients for ovarian cancer and non-small cell lung cancer, respectively.

3. Expression Analysis

For the targets selected using the Pearson correlation coefficient, we performed expression analysis to display their expression levels in patients, aiming to identify microRNAs with high expression in patients.

4. Machine Learning Modeling

Due to the limitations of dataset size and to better display the linear relationship between the targets and disease status, we selected relatively simple models: linear regression and logistic regression.

(1) Confusion Matrix

(2) ROC Curve

(3) Score Distribution Plot

We were surprised by the high AUC value for certain single targets. To verify whether this result is accurate, we generated a score distribution plot (with the x-axis representing case numbers in the test set, the y-axis representing the scores assigned by the machine learning model, and the color representing the true disease status of each case):

After performing machine learning analysis, we reviewed and discussed all the results. Based on relevant literature, we ultimately selected hsa-mir-200c and hsa-mir-99b as targets for ovarian cancer, and hsa- mir-21 and hsa-mir-141 as targets for non-small cell lung cancer.

5. Kinetic Modeling

Modeling Purpose and Ideas

Establishing a molecular kinetic model holds immense significance:

Validating the rationality and feasibility of wet experiments.
Optimizing the protocol for wet experiments.
Conserving human and material resources.
Investigating reaction mechanisms and offering guidance for their future analysis and design.

To validate the feasibility of our designed reaction system, we developed a chemical reaction dynamics model to investigate its reaction kinetics and mechanism. In our detection system, the central reaction in the wet experiment is the polymerase-mediated strand displacement reaction. Thus, during the modeling process, we initially adjusted the rate constants associated with each reaction process based on various experimental data. By comparing this curve with the experimental data, we identified the most appropriate parameters and derived a simulation curve that closely aligns with the experimental results. The method we used is the same as last year (Simulation (Modeling) — POMELO (igem.wiki)), through which we obtained the simulated fluorescence curves for our two systems: