全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Prediction of Protein Expression and Growth Rates by Supervised Machine Learning

DOI: 10.4236/ns.2021.138025, PP. 301-330

Keywords: DNA Sequences, Protein Production, Growth Rate, Supervised Machine Learning

Full-Text   Cite this paper   Add to My Lib

Abstract:

The DNA sequences of an organism play an important influence on its transcription and translation process, thus affecting its protein production and growth rate. Due to the com-plexity of DNA, it was extremely difficult to predict the macroscopic characteristics of or-ganisms. However, with the rapid development of machine learning in recent years, it be-comes possible to use powerful machine learning algorithms to process and analyze biolog-ical data. Based on the synthetic DNA sequences of a specific microbe, E. coli, I designed a process to predict its protein production and growth rate. By observing the properties of a data set constructed by previous work, I chose to use supervised learning regressors with encoded DNA sequences as input features to perform the predictions. After comparing different encoders and algorithms, I selected three encoders to encode the DNA sequences as inputs and trained seven different regressors to predict the outputs. The hy-per-parameters are optimized for three regressors which have the best potential prediction performance. Finally, I successfully predicted the protein production and growth rates, with the best R2 score 0.55 and 0.77, respectively, by using encoders to catch the potential fea-tures from the DNA sequences.

References

[1]  Tarca, A.L., Carey, V.J., Chen, W.X., Romero, R. and Draghici, S. (2007) Machine Learning and Its Applications to Biology. PLoS Computational Biology, 3, Article No. e116.
https://doi.org/10.1371/journal.pcbi.0030116
[2]  Sinden, R.R. (1994) DNA Structure and Function. Academic Press, Cambridge, 11-12.
[3]  Henderson, J.F. and Paterson, A.R.P. (1973) Nucleotide Metabolism: An Introduction. Academic Press, Cambridge, 23-25.
[4]  Stormo, G.D. and Zhao, Y. (2010) Determining the Specificity of Protein-DNA Interactions. Nature Reviews Genetics volume, 11, 751-760.
https://doi.org/10.1038/nrg2845
[5]  Riggs, P. (2021) What is mRNA? The Messenger Molecule That’s Been in Every Living Cell for Billions of Years Is the Key Ingredient in Some Covid-19 Vaccines. The Conversation.
[6]  Guillaume, Cambray, Guimaraes, J.C. and Arkin, A.P. (2018) Evaluation of 244,000 Synthetic Sequences Reveals Design Principles to Optimize Translation in Escherichia Coli. Nature Biotechnology, 36, 1005-1015.
https://doi.org/10.1038/nbt.4238
[7]  Addgene (2017) Promoters.
https://www.addgene.org/mol-bio-reference/promoters/.
[8]  Ng, P. (2017) Dna2vec: Consistent Vector Representations of Variable-Length k-Mers. arXiv:1701.06279.
[9]  Weisberg, S. (1973) Applied Linear Regression. John Wiley Sons, Inc., Hoboken, 19-33.
[10]  Sharma, A. (2020) Decision Tree vs. Random Forest—Which Algorithm Should You Use? Analytics Vidhya.
[11]  Oshiro, T.M., PerezJose′, P.S. and Baranauskas, A. (2012) How Many Trees in a Random Forest? International Workshop on Machine Learning and Data Mining in Pattern Recognition, Berlin, 13-20 July 2012, 154-168.
https://doi.org/10.1007/978-3-642-31537-4_13
[12]  Scikit Learn (2008) Neural Network Models (Supervised).
https://scikit-learn.org/stable/modules/neural_networks_supervised.html
[13]  Seiffert, U. (2001) Multiple Layer Perceptron Training Using Genetic Algorithms. European Symposium on Artificial Neural Networks, Bruges, 25-27 April 2001, 159-164.
[14]  Peterson, L.E. (2009) K-Nearest Neighbor. Scholarpedia, 4, 1883.
[15]  Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W. and Liu, T.-Y. (2017) Light-GBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems 30, Long Beach, 4-9 December 2017, 3148-3156.
[16]  Sharp, T. (2020) An Introduction to Support Vector Regression (SVR). Towards Data Science.
[17]  Awad, M. and Khanna, R. (2015) Support Vector Regression. In: Efficient Learning Machines, Apress, Berkeley, 67-80.
https://doi.org/10.1007/978-1-4302-5990-9_4
[18]  Frank, A.C. and Windmeijer, A.G. (1997) An R-Squared Measure of Goodness of Fit for Some Common Nonlinear Regression Models. Journal of Econometrics, 77, 329-342.
https://doi.org/10.1016/S0304-4076(96)01818-0
[19]  Imbens, G.W., Newey, W.K. and Ridder, G. (2005) Mean-Square-Error Calculations for Average Treatment Effects. IEPR Working Paper No. 05.34, California Energy Commission, Sacramento.
https://doi.org/10.2139/ssrn.820205
[20]  Willmott, C.J. and Matsuura, K. (2001) Advantages of the Mean Absolute Error over the Root Mean Square Error in Assessing Average Model Performance. Climate Research, 30, 79-82.
https://doi.org/10.3354/cr030079
[21]  Makkar, T., Kumar, Y., Dubey, A.K., Rocha, á. and Goyal, A. (2017) Analogizing Time Complexity of KNN and CNN in Recognizing Handwritten Digits. 2017 4th International Conference on Image Information Processing, Shimla, 21-23 December 2017, 1-6.
https://doi.org/10.1109/ICIIP.2017.8313707

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133