全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Combined Use of k-Mer Numerical Features and Position-Specific Categorical Features in Fixed-Length DNA Sequence Classification

DOI: 10.4236/jbise.2017.108030, PP. 390-401

Keywords: Sequence Classification, Numerical and Categorical Features, Feature Selection

Full-Text   Cite this paper   Add to My Lib

Abstract:

To classify DNA sequences, k-mer frequency is widely used since it can convert variable-length sequences into fixed-length and numerical feature vectors. However, in case of fixed-length DNA sequence classification, subsequences starting at a specific position of the given sequence can also be used as categorical features. Through the performance evaluation on six datasets of fixed-length DNA sequences, our algorithm based on the above idea achieved comparable or better performance than other state-of-the art algorithms.

References

[1]  GenBank and WGS Statistics.
https://www.ncbi.nlm.nih.gov/genbank/statistics/
[2]  UniProt Consortium (2014) UniProt: A Hub for Protein Information. Nucleic Acids Research, 43, D204-D212.
[3]  Xing, Z., Pei, J. and Keogh, E. (2010) A Brief Survey on Sequence Classification. ACM SIGKDD Explorations Newsletter, 12, 40-80.
https://doi.org/10.1145/1882471.1882478
[4]  Borozan, I., Watt, S. and Ferretti, V. (2015) Integrating Alignment-Based and Alignment-Free Sequence Similarity Measures for Biological Sequence Classification. Bioinformatics, 31, 1396-1404.
https://doi.org/10.1093/bioinformatics/btv006
[5]  Chen, L. and Guo, G. (2014) Nearest Neighbor Classification of Categorical Data by Attributes Weighting. Expert Systems with Applications, 42, 3142-3149.
https://doi.org/10.1016/j.eswa.2014.12.002
[6]  Iqbal, M.J., Faye, I., Samir, B.B. and Said, A.M. (2014) Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics. The Scientific World Journal, 2014, Article ID: 173869.
[7]  Weitschek, E., Cunial, F. and Felici, G. (2015) LAF: Logic Alignment Free and Its Application to Bacterial Genomes Classification. BioData Mining, 8, 2015.
https://doi.org/10.1186/s13040-015-0073-1
[8]  Pham, T.H., Tran, T.B., Ho, T.B., Satou, K. and Valiente, G. (2005) Qualitatively Predicting Acetylation and Methylation Areas in DNA Sequences. Genome Informatics, 16, 3-11.
[9]  Pokholok, D.K., Harbison, C.T., Levine, S., Cole, M., Hannett, N.M., Lee, T.I., Bell, G.W., Walker, K., Rolfe, P.A., Herbolsheimer, E., Zeitlinger, J., Lewitter, F., Gifford, D.K. and Young, R.A. (2005) Genome-Wide Map of Nucleosome Acetylation and Methylation in Yeast. Cell, 122, 517-527. https://doi.org/10.1016/j.cell.2005.06.026
[10]  Higashihara, M., Rebolledo-Mendez, J.D., Yamada, Y. and Satou, K. (2008) Application of a Feature Selection Method to Nucleosome Data: Accuracy Improvement and Comparison with Other Methods. WSEAS Transactions on Biology and Biomedicine, 5, 153-162.
[11]  Li, J. and Wong, L. (2003) Using Rules to Analyse Bio-Medical Data: A Comparison between C4.5 and PCL. Proceedings of Advances in Web-Age Information Management 4th International Conference, Chengdu, 17-19 August, 254-265.
https://doi.org/10.1007/978-3-540-45160-0_25
[12]  Nguyen, N.G., Tran, V.A., Ngo, D.L., Phan, D., Lumbanraja, F.R., Faisal, M.R., Abapihi, B., Kubo, M. and Satou, K. (2016) DNA Sequence Classification by Convolutional Neural Network. Journal of Biomedical Science and Engineering, 9, 280-286.
https://doi.org/10.4236/jbise.2016.95021
[13]  Guo, S.H., Deng, E.Z., Xu, L.Q., Ding, H., Lin, H., Chen, W. and Chou, K.C. (2014) iNuc-PseKNC: A Sequence-Based Predictor for Predicting Nucleosome Positioning in Genomes with Pseudo k-Tuple Nucleotide Composition. Bioinformatics, 30, 1522-1529.
https://doi.org/10.1093/bioinformatics/btu083
[14]  Tahir, M. and Hayat, M. (2016) iNuc-STNC: A Sequence-Based Predictor for Identification of Nucleosome Positioning in Genomes by Extending the Concept of SAAC and Chou’s PseAAC. Molecular BioSystems, 12, 2587- 2593.
https://doi.org/10.1039/C6MB00221H
[15]  Awazu, A. (2016) Prediction of Nucleosome Positioning by the Incorporation of Frequencies and Distributions of Three Different Nucleotide Segment Lengths into a General Pseudo k-Tuple Nucleotide Composition. Bioinformatics, 33, 42-48.
https://doi.org/10.1093/bioinformatics/btw562
[16]  Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A. (2004) Kernlab—An S4 Package for Kernel Methods in R. Journal of Statistical Software, 11, 1-20.
https://doi.org/10.18637/jss.v011.i09
[17]  Liaw, A. and Wiener, M. (2002) Classification and Regression by Randomforest. R News, 2, 18-22.
http://CRAN.R-project.org/doc/Rnews/
[18]  Chen, W., Feng, P., Ding, H., Lin, H. and Chou, K.C. (2015) Using Deformation Energy to Analyze Nucleosome Positioning in Genomes. Genomics, 107, 69-75.
https://doi.org/10.1016/j.ygeno.2015.12.005
[19]  Yi, X.F., He, Z.S., Chou, K.C. and Kong, X.Y. (2012) Nucleosome Positioning Based on the Sequence Word Composition. Protein and Peptide Letters, 19, 79-90.
https://doi.org/10.2174/092986612798472811

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133