OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Natural Science 2021

Prediction of Protein Expression and Growth Rates by Supervised Machine Learning

DOI: 10.4236/ns.2021.138025, PP. 301-330

Simiao Zhao

Keywords: DNA Sequences, Protein Production, Growth Rate, Supervised Machine Learning

Full-Text Cite this paper Add to My Lib

Abstract:

The DNA sequences of an organism play an important influence on its transcription and translation process, thus affecting its protein production and growth rate. Due to the com-plexity of DNA, it was extremely difficult to predict the macroscopic characteristics of or-ganisms. However, with the rapid development of machine learning in recent years, it be-comes possible to use powerful machine learning algorithms to process and analyze biolog-ical data. Based on the synthetic DNA sequences of a specific microbe, E. coli, I designed a process to predict its protein production and growth rate. By observing the properties of a data set constructed by previous work, I chose to use supervised learning regressors with encoded DNA sequences as input features to perform the predictions. After comparing different encoders and algorithms, I selected three encoders to encode the DNA sequences as inputs and trained seven different regressors to predict the outputs. The hy-per-parameters are optimized for three regressors which have the best potential prediction performance. Finally, I successfully predicted the protein production and growth rates, with the best R² score 0.55 and 0.77, respectively, by using encoders to catch the potential fea-tures from the DNA sequences.

References

[1]	Tarca, A.L., Carey, V.J., Chen, W.X., Romero, R. and Draghici, S. (2007) Machine Learning and Its Applications to Biology. PLoS Computational Biology, 3, Article No. e116. https://doi.org/10.1371/journal.pcbi.0030116
[2]	Sinden, R.R. (1994) DNA Structure and Function. Academic Press, Cambridge, 11-12.
[3]	Henderson, J.F. and Paterson, A.R.P. (1973) Nucleotide Metabolism: An Introduction. Academic Press, Cambridge, 23-25.
[4]	Stormo, G.D. and Zhao, Y. (2010) Determining the Specificity of Protein-DNA Interactions. Nature Reviews Genetics volume, 11, 751-760. https://doi.org/10.1038/nrg2845
[5]	Riggs, P. (2021) What is mRNA? The Messenger Molecule That’s Been in Every Living Cell for Billions of Years Is the Key Ingredient in Some Covid-19 Vaccines. The Conversation.
[6]	Guillaume, Cambray, Guimaraes, J.C. and Arkin, A.P. (2018) Evaluation of 244,000 Synthetic Sequences Reveals Design Principles to Optimize Translation in Escherichia Coli. Nature Biotechnology, 36, 1005-1015. https://doi.org/10.1038/nbt.4238
[7]	Addgene (2017) Promoters. https://www.addgene.org/mol-bio-reference/promoters/.
[8]	Ng, P. (2017) Dna2vec: Consistent Vector Representations of Variable-Length k-Mers. arXiv:1701.06279.
[9]	Weisberg, S. (1973) Applied Linear Regression. John Wiley Sons, Inc., Hoboken, 19-33.
[10]	Sharma, A. (2020) Decision Tree vs. Random Forest—Which Algorithm Should You Use? Analytics Vidhya.
[11]	Oshiro, T.M., PerezJose′, P.S. and Baranauskas, A. (2012) How Many Trees in a Random Forest? International Workshop on Machine Learning and Data Mining in Pattern Recognition, Berlin, 13-20 July 2012, 154-168. https://doi.org/10.1007/978-3-642-31537-4_13
[12]	Scikit Learn (2008) Neural Network Models (Supervised). https://scikit-learn.org/stable/modules/neural_networks_supervised.html
[13]	Seiffert, U. (2001) Multiple Layer Perceptron Training Using Genetic Algorithms. European Symposium on Artificial Neural Networks, Bruges, 25-27 April 2001, 159-164.
[14]	Peterson, L.E. (2009) K-Nearest Neighbor. Scholarpedia, 4, 1883.
[15]	Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W. and Liu, T.-Y. (2017) Light-GBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems 30, Long Beach, 4-9 December 2017, 3148-3156.
[16]	Sharp, T. (2020) An Introduction to Support Vector Regression (SVR). Towards Data Science.
[17]	Awad, M. and Khanna, R. (2015) Support Vector Regression. In: Efficient Learning Machines, Apress, Berkeley, 67-80. https://doi.org/10.1007/978-1-4302-5990-9_4
[18]	Frank, A.C. and Windmeijer, A.G. (1997) An R-Squared Measure of Goodness of Fit for Some Common Nonlinear Regression Models. Journal of Econometrics, 77, 329-342. https://doi.org/10.1016/S0304-4076(96)01818-0
[19]	Imbens, G.W., Newey, W.K. and Ridder, G. (2005) Mean-Square-Error Calculations for Average Treatment Effects. IEPR Working Paper No. 05.34, California Energy Commission, Sacramento. https://doi.org/10.2139/ssrn.820205
[20]	Willmott, C.J. and Matsuura, K. (2001) Advantages of the Mean Absolute Error over the Root Mean Square Error in Assessing Average Model Performance. Climate Research, 30, 79-82. https://doi.org/10.3354/cr030079
[21]	Makkar, T., Kumar, Y., Dubey, A.K., Rocha, á. and Goyal, A. (2017) Analogizing Time Complexity of KNN and CNN in Recognizing Handwritten Digits. 2017 4th International Conference on Image Information Processing, Shimla, 21-23 December 2017, 1-6. https://doi.org/10.1109/ICIIP.2017.8313707

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133