全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

DNAVec:基因组DNA序列的预训练词向量表示
DNAVec: Pre-Trained Word Vector Representation of Genomic DNA Sequences

DOI: 10.12677/HJBM.2021.113016, PP. 121-128

Keywords: BERT,DNA序列,预训练,自然语言处理
BERT
, DNA Sequence, Pre-Training, Nature Language Processing

Full-Text   Cite this paper   Add to My Lib

Abstract:

破译DNA序列所代表的信息是基因组研究的基本问题之一。基因调控编码由于存在多义性关系而变得非常复杂,而以往的生物信息学方法往往无法捕捉到DNA序列的隐含信息,尤其是在数据匮乏的情况下。因而从序列信息中预测DNA序列的结构和功能是计算生物学的一个重要挑战。为了应对这一挑战,我们引入了一种新的方法,通过使用自然语言处理领域的语言模型BERT将DNA序列表示为连续词向量。通过对DNA序列进行建模,BERT有效地从未标记的大数据中捕捉到了DNA序列中的序列特性。我们将DNA序列的这种新的嵌入表示称为DNAVec (DNA-to-Vector)。此外,我们可以从模型中提取出预训练的词向量用于表示DNA序列,用于其他序列级别的分类任务。
Deciphering the information represented by DNA sequences is one of the fundamental problems of genomic research. Gene regulatory coding is complicated by the presence of polysense relationships, and previous bioinformatics methods often fail to capture the implicit information of DNA sequenc-es, especially when data are scarce. Predicting the structure and function of DNA sequences from sequence information is thus an important challenge in computational biology. To address this challenge, we introduce a new approach to represent DNA sequences as continuous word vectors by using the language model BERT from the field of natural language processing. By modelling DNA sequences, BERT effectively captures the sequence properties in DNA sequences from unlabelled big data. We refer to this new embedding representation of DNA sequences as DNAVec (DNA-to-Vector). In addition, we can extract pre-trained word vectors from the model for repre-senting DNA sequences for other sequence-level classification tasks.

References

[1]  Asgari, E. and Mofrad, M.R. (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS ONE, 10, e0141287.
https://doi.org/10.1371/journal.pone.0141287
[2]  Chen, Y.H., Nyeo, S.L. and Yeh, C.Y. (2005) Model for the Distributions of k-Mers in DNA Sequences. Physical Review E, Statistical, Nonlinear, and Soft Matter Physics, 72, Article ID: 011908.
https://doi.org/10.1103/PhysRevE.72.011908
[3]  Heinzinger, M., et al. (2019) Modeling Aspects of the Language of Life through Transfer-Learning Protein Sequences. BMC Bioinformatics, 20, 723.
https://doi.org/10.1186/s12859-019-3220-8
[4]  Peters, M.E., et al. (2018) Deep Contextualized Word Representations.
[5]  Menegaux, R. and Vert, J.P. (2019) Continuous Embeddings of DNA Sequencing Reads and Application to Metagenomics. Journal of Computational Biology, 26, 509-518.
https://doi.org/10.1089/cmb.2018.0174
[6]  Joulin, A., et al. (2017) FastText.zip: Compressing Text Classification Models.
[7]  Joulin, A., et al. (2016) Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Volume 2, 427-431.
https://doi.org/10.18653/v1/E17-2068
[8]  Ng, P. (2017) dna2vec: Consistent Vector Representations of Variable-Length k-Mers.
[9]  Mikolov, T., Sutskever, I. and Chen, K. (2013) Distributed Representations of Words and Phrases and Their Compositionality.
[10]  Pennington, J., Socher, R. and Manning, C.D. (2015) GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, October 2014, 1532-1543.
https://doi.org/10.3115/v1/D14-1162
[11]  Greff, K., Koutn?k, R.K.S.J. and Schmidhuber, B.R.S.J.U. (2015) LSTM: A Search Space Odyssey.
[12]  Devlin, J., et al. (2018) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.
[13]  Bengio, Y., Courville, A. and Vincent, P. (2013) Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 1798-1828.
https://doi.org/10.1109/TPAMI.2013.50
[14]  Shazeer, N., et al. (2017) Attention Is All You Need.
[15]  Diederich, A. (2019) Advances in Neural Information Processing Systems 18. Journal of Mathematical Psychology, 51, 339.
https://doi.org/10.1016/j.jmp.2008.09.003
[16]  Schuster, M. and Paliwal, K.K. (1997) Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing, 45, 2673-2681.
https://doi.org/10.1109/78.650093
[17]  Vaswani, A., et al. (2017) Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, 4-9 December 2017, 1-15.
[18]  Vig, J. (2019) A Multiscale Visualization of Attention in the Transformer Model. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Florence, July 2019, 37-42.
https://doi.org/10.18653/v1/P19-3007
[19]  Bahmani, A., et al. (2021) Hummingbird: Efficient Performance Prediction for Executing Genomic Applications in the Cloud. Bioinformatics, btab161.
[20]  Nikolaou, C. and Almirantis, Y. (2005) “Word” Preference in the Genomic Text and Genome Evolution: Different Modes of n-Tuplet Usage in Coding and Noncoding Sequences. Journal of Molecular Evolution, 61, 23-35.
https://doi.org/10.1007/s00239-004-0209-2
[21]  Huimin, X. and H. Bailin, (2002) Visualization of K-Tuple Distribution in Procaryote Complete Genomes and Their Randomized Counterparts. Proceedings IEEE Computer Society Bioinformatics Conference, Stanford, 14-16 August 2002, 31-42.
[22]  Buenrostro, J.D., et al. (2013) Transposition of Native Chromatin for Fast and Sensitive Epigenomic Profiling of Open Chromatin, DNA-Binding Proteins and Nucleosome Position. Nature Methods, 10, 1213-1218.
https://doi.org/10.1038/nmeth.2688
[23]  Bartlett, A., et al. (2017) Mapping Genome-Wide Transcription-Factor Binding Sites Using DAP-Seq. Nature Protocols, 12, 1659-1672.
https://doi.org/10.1038/nprot.2017.055
[24]  Gerstberger, S., Hafner, M. and Tuschl, T. (2014) A Census of Human RNA-Binding Proteins. Nature Reviews Genetics, 15, 829-845.
https://doi.org/10.1038/nrg3813

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133