Software

ChromoBlend

Our Aim

Accurately replicating colours using chromoproteins can be challenging. The process of mixing them often relies on trial and error, which can be time-consuming and inefficient, leading to inconsistent results. Recognising these challenges, we reached out to Professor Oliver Ebenhöh, a leading expert in mathematical modelling. During our discussions, he highlighted the potential of mathematical modelling in dye applications, particularly in the context of the Cyan, Magenta, Yellow and Key (CMYK) colour model, which is a subtractive model widely used in colour printing (Westland and Cheung, 2023). Inspired by this concept, we aimed to develop a software capable of predicting the precise amounts of chromoproteins needed to create any desired shade.

Our approach focuses on the CMYK colour model, using aeCP597 for Cyan (Shkrob et al., 2005), mScarlet for Magenta (Alieva et al., 2008), and sfGFP for Yellow, intentionally leaving out the K component due to the lack of a matching chromoprotein. By mixing these three chromoproteins, we can achieve a wide spectrum of colours, allowing the reproduction of any desired shade. Our software streamlines this process by allowing users to input a HEX code, a widely used digital colour format, and then calculate the precise volumes of each chromoprotein needed to replicate that colour. This allows labs and industries to consistently replicate specific colours without the need for manual adjustments or repeated trial-and-error.

Evaluation

In order to create a reliable software for predicting dye concentrations, we needed a dataset that accurately reflected the relationship between dye proportions and the final colour outcome. Rather than relying on literature values, we generated our own dataset by systematically mixing chromoproteins, specifically, aeCP597 for Cyan, mScarlet for Magenta, and sfGFP for Yellow and then measuring the resulting colours in terms of their RGB values.

chromoblend data photo
Figure 1: 96-well plate with mixed concentrations of aeCP597 (Cyan), mScarlet (Magenta), and sfGFP (Yellow). Each well displays a unique colour created by varying chromoprotein volumes.

chromoblend 3d
Figure 2: Dataset visualised in the RGB colour spectrum, showing the range of colours generated through varying volumes of aeCP597, mScarlet, and sfGFP chromoproteins.

To predict the concentrations of chromoproteins for any given RGB input, we considered several approaches, selecting the most effective one based on performance metrics and model suitability. To evaluate the different approaches, we used Root Mean Squared Error (RMSE) as our performance indicator, a common metric that reflects how closely predicted values align with actual values by calculating the average difference between them (Ajala et al., 2022). Lower RMSE values indicate better accuracy, with a smaller average error in prediction (Forrest et al., 2021).

One potential approach was Random Forest (RF), a method known for capturing complex, non-linear patterns in data. However, Random Forest generally requires larger datasets to perform reliably, as it can easily overfit, leading to unreliable results (Couronné et al., 2018). Given our limited dataset of 96 samples, this method was less practical for achieving promising results.

For this reason, we considered two more approaches for our software: Linear Regression (LR) and k-Nearest Neighbour (kNN). Linear Regression, although straightforward and suited to smaller datasets, assumes a linear relationship between RGB inputs and chromoprotein concentrations, which may be overly simplistic for potential non-linear patterns within our data (Schmidt and Finan, 2018). Despite these limitations, Linear Regression performed comparably well to kNN, as shown in Table 1. The RMSE values for both methods were close, suggesting that they both performed quite well. However, kNN consistently resulted in lower RMSE values, indicating that it captured variations within our dataset better. In our case, kNN works by comparing the input RGB values to the closest matches in our dataset and averaging the corresponding CMY chromoprotein volumes of those nearest neighbours (Kaliappan et al., 2022).

In evaluating our models, we also examined how RGB and CMY channels relate to each other, given that RGB is an additive model, while CMY is subtractive. We found that Red negatively correlates with Cyan, Green with Magenta, and Blue with Yellow (Table 2). Although this suggests the possibility of simplifying the model by using only complementary colour channels, the results indicate otherwise. As shown in Table 3, restricting the model to the complementary colour channels increased the RMSE values, while considering all RGB channels led to the most accurate predictions.

Based on our evaluation, we decided to use k-Nearest Neighbour (kNN), a widely used supervised machine learning method, as our model due to its lower RMSE values and overall superior performance (Uddin et al., 2022). Another significant advantage of kNN is its intuitivity, making it understandable and easy to use across various industries. Since our software also features a user interface, which will be discussed later in more detail, it was essential to choose a method that is straightforward. With kNN’s logic of comparing an input to its closest neighbours, users can follow the software’s output, adding to its practical appeal and accessibility.

Table 1: Comparison of RMSE values between Linear Regression and k-Nearest Neighbour (Z-score, k=3) across Cyan, Magenta, and Yellow chromoprotein volumes.

	Linear Regression	kNN (Z-score, k = 3)
RMSE Cyan	10.901	9.558
RMSE Magenta	13.711	11.992
RMSE Yellow	7.623	10.772

Table 2: Correlation coefficients between RGB and CMY colour channels.

	Cyan	Magenta	Yellow
Red	-0.831	0.553	0.263
Green	-0.768	0.098	0.641
Blue	0.470	0.307	-0.903

Table 3: RMSE values for Linear Regression with complementary colour channels versus all colour channels.

	LR All Colour Channels	LR Complementary Colour Channels
RMSE Cyan	10.901	14.082
RMSE Magenta	13.711	23.822
RMSE Yellow	7.623	10.812

To ensure that kNN works optimally with our dataset, several parameters needed to be set. One of the most critical parameters is k, which specifies how many neighbours the algorithm considers when making a prediction (Zhang, 2016). If k is set too low, the model risks overfitting, becoming overly sensitive to small variations in the data. On the other hand, if k is too large, the model may oversimplify and lose accuracy. The challenge, therefore, is to find the optimal k value that balances accuracy without overfitting the data (Zhang, 2016). Additionally, scaling the RGB input values was essential to improve the model’s accuracy. Since kNN is sensitive to the scale of input data, we had to standardise the RGB values to ensure that one colour channel, for example Red, does not dominate the predictions. We explored two scaling methods to determine the most suitable one. Z-score normalisation scales the data based on its mean and standard deviation, making it effective when the dataset contains outliers (Andrade, 2021). Min-max normalisation, on the other hand, scales all data points between 0 and 1, which is useful when the dataset has a limited range of values (Lambert et al., 2024).

To determine the best scaling method and k value combination, we used Root Mean Squared Error (RMSE) again as our performance indicator (Ajala et al., 2022). To ensure that this measure is reliable, RMSE must be calculated on unseen data that the model has not been trained on (Ajala et al., 2022). Therefore, we split the dataset into training and testing sets using an 80/20 ratio, where 80% of the data trains the model, and 20% is reserved for testing. This approach helps achieve a balance between overfitting, where the model is too closely tuned to the training data, and underfitting, where it fails to capture important patterns (Rácz et al., 2021). By computing RMSE on the test set, we can assess how well the model generalises to unseen data. Ultimately, the lower the RMSE value, the more accurate the model (Forrest et al., 2021). For our dataset, we compared the RMSE values for different scaling methods and k values and, though the values are quite similar, decided that Z-score normalisation and k=5 gives us the most accurate prediction.

Table 4: Overview of different scaling methods (Z-score, Min-Max) and k-values (3,5 and 7) for the respective colours cyan, magenta and yellow Scaling Method

Scaling Method	k Value	RMSE Cyan	RMSE Magenta	RMSE Yellow
Z-score	3	9.5581	11.9928	10.7726
Z-score	5	8.7178	13.0554	10.4243
Z-score	7	10.0283	14.3213	10.8326
Min-Max	3	8.8889	11.3038	10.8012
Min-Max	5	9.0431	13.4825	10.0443
Min-Max	7	9.8343	14.1861	10.8588

Software/UI

Our User Interface (UI) was designed to make determining chromoprotein volumes as intuitive and accessible as possible. From the very start, our goal was to simplify each step so that users with no background in programming or machine learning could confidently use the software. By streamlining each process, from colour input to visualising results, we created an interface that allows users to focus on the output without needing to understand the coding behind it.

The UI begins with an easy-to-use colour input, where users can simply enter a HEX code. This HEX code is automatically converted to its corresponding RGB values, which are then scaled to ensure balanced predictions. With the optimised k-value (k=5 in this case), the kNN algorithm calculates the nearest RGB matches in our dataset. The UI then displays these matches alongside the calculated averages for each chromoprotein, providing an accurate estimate of aeCP597, mScarlet, and sfGFP volumes to reproduce the desired colour. On top of that, we included a 3D visualisation of the dataset, which shows the relationship between the user's input colour and the closest matches in our dataset, letting users see precisely how their desired colour aligns with existing samples.

Ultimately, our UI combines technical robustness with an intuitive design, allowing users to easily determine chromoprotein volumes. This approach makes the tool highly accessible across various fields, allowing users to achieve precise colour-matching without trial-and-error.

chromoblend ui
Figure 3: User Interface (UI) of our Software ChromoBlend.

Limitations & Future Outlook

Our software has the potential to be used by various laboratories and industries for chromoprotein-based colour prediction. However, one limitation we currently face is the relatively small size of our dataset. While the kNN model has shown promising results, having a larger dataset would enhance its performance. With more data, we could also explore the use of Random Forest, which is well-suited for larger datasets due to its ability to capture non-linear relationships and patterns, ultimately leading to more accurate predictions (Couronné et al., 2018).

Additionally, our PI Dr. St. Elmo Wilken has suggested exploring neural networks as a promising alternative. Neural networks can be effective in handling smaller datasets by learning complex, non-linear relationships through iterative training, making them a valuable alternative alongside our current kNN model (Dou et al., 2023).

In addition to expanding the dataset and using different machine learning techniques, we also want to implement a feedback loop that allows users to provide insights on the accuracy of the predictions. This feature would enable the model to learn from user input and refine its performance over time, creating an adaptive system that evolves with new data points through an active learning approach.

software link

References

Ajala, Sunday, et al. “Comparing Machine Learning and Deep Learning Regression Frameworks for Accurate Prediction of Dielectrophoretic Force.” Scientific Reports, vol. 12, no. 1, July 2022, https://doi.org/10.1038/s41598-022-16114-5. Accessed 20 Sept. 2022.

Alieva, Naila O., et al. “Diversity and Evolution of Coral Fluorescent Proteins.” PLoS ONE, edited by Hany A. El-Shemy, vol. 3, no. 7, July 2008, p. e2680, https://doi.org/10.1371/journal.pone.0002680.

Andrade, Chittaranjan. “Z Scores, Standard Scores, and Composite Test Scores Explained.” Indian Journal of Psychological Medicine, vol. 43, no. 6, Oct. 2021, p. 025371762110465, https://doi.org/10.1177/02537176211046525.

Couronné, Raphael, et al. “Random Forest versus Logistic Regression: A Large-Scale Benchmark Experiment.” BMC Bioinformatics, vol. 19, no. 1, July 2018, https://doi.org/10.1186/s12859-018-2264-5.

Dou, Bozheng, et al. “Machine Learning Methods for Small Data Challenges in Molecular Science.” Chemical Reviews, vol. 123, no. 13, American Chemical Society, June 2023, pp. 8736–80, https://doi.org/10.1021/acs.chemrev.3c00189. Accessed 24 Sept. 2023.

Forrest, Lauren N., Valentina Ivezaj, and Carlos M. Grilo. “Machine Learning v. Traditional Regression Models Predicting Treatment Outcomes for Binge-Eating Disorder from a Randomized Controlled Trial.” Psychological Medicine 53.7 (2023): 2777–2788. Web.

Kaliappan, Jayakumar, et al. “Performance Evaluation of Regression Models for the Prediction of the COVID-19 Reproduction Rate.” Frontiers in Public Health, vol. 9, Sept. 2021, https://doi.org/10.3389/fpubh.2021.729795.

Lambert, Tamara P., et al. “A Comparison of Normalization Techniques for Individual Baseline-Free Estimation of Absolute Hypovolemic Status Using a Porcine Model.” Biosensors, vol. 14, no. 2, Multidisciplinary Digital Publishing Institute, Jan. 2024, pp. 61–61, https://doi.org/10.3390/bios14020061.

Rácz, Anita, et al. “Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification.” Molecules, vol. 26, no. 4, Feb. 2021, p. 1111, https://doi.org/10.3390/molecules26041111.

Schmidt, Amand F., and Chris Finan. “Linear Regression and the Normality Assumption.” Journal of Clinical Epidemiology, vol. 98, June 2018, pp. 146–51, https://doi.org/10.1016/j.jclinepi.2017.12.006.

Shkrob, Maria A., et al. “Far-Red Fluorescent Proteins Evolved from a Blue Chromoprotein from Actinia Equina.” Biochemical Journal, vol. 392, no. 3, Dec. 2005, pp. 649–54, https://doi.org/10.1042/bj20051314.

Uddin, Shahadat, et al. “Comparative Performance Analysis of K-Nearest Neighbour (KNN) Algorithm and Its Different Variants for Disease Prediction.” Scientific Reports, vol. 12, no. 1, Apr. 2022, https://doi.org/10.1038/s41598-022-10358-x.

Westland, Stephen, and Vien Cheung. “CMYK Systems.” Springer EBooks, Springer Nature, Jan. 2023, pp. 1–7, https://doi.org/10.1007/978-3-642-35947-7_13-3.

Zhang, Zhongheng. “Introduction to Machine Learning: K-Nearest Neighbors.” Annals of Translational Medicine, vol. 4, no. 11, June 2016, pp. 218–18, https://doi.org/10.21037/atm.2016.03.37.

Property testing machine software

We wrote a code in the programming language scratch, with modifications enabling the implementation for the LEGO Mindstorms EV3 set, which served as the base for our property testing device (See hardware) . The code provides the ability to control our property testing machines' winding mechanism with the EV3 Brick’s control panel. The way the program operates, it winds the cable up or down as long as the associated buttons are pushed. To program this, we used the program “EV3 Classroom”[1], which is freely available. We provide the file “property testing machine 02.lmsp” we used below. As an alternative, there is a picture of the code attached (Picture 1).

Picture 1: Code for property testing machine “DAS MASCHIN” created with “EV3 Classroom”.

software link

References

https://education.lego.com/de-de/downloads/mindstorms-ev3/software/, last opened 09/29/2024, 19:42

ChromoBlend​

Our Aim​

Evaluation​

Software/UI​

Limitations & Future Outlook​

References​

Property testing machine software​

References​

ChromoBlend

Our Aim

Evaluation

Software/UI

Limitations & Future Outlook

References

Property testing machine software

References