To classify DNA sequences, k-mer frequency is widely used since it can convert variable-length sequences into fixed-length and numerical feature vectors. However, in case of fixed-length DNA sequence classification, subsequences starting at a specific position of the given sequence can also be used as categorical features. Through the performance evaluation on six datasets of fixed-length DNA sequences, our algorithm based on the above idea achieved comparable or better performance than other state-of-the art algorithms.
References
[1]
GenBank and WGS Statistics. https://www.ncbi.nlm.nih.gov/genbank/statistics/
[2]
UniProt Consortium (2014) UniProt: A Hub for Protein Information. Nucleic Acids Research, 43, D204-D212.
[3]
Xing, Z., Pei, J. and Keogh, E. (2010) A Brief Survey on Sequence Classification. ACM SIGKDD Explorations Newsletter, 12, 40-80. https://doi.org/10.1145/1882471.1882478
[4]
Borozan, I., Watt, S. and Ferretti, V. (2015) Integrating Alignment-Based and Alignment-Free Sequence Similarity Measures for Biological Sequence Classification. Bioinformatics, 31, 1396-1404. https://doi.org/10.1093/bioinformatics/btv006
[5]
Chen, L. and Guo, G. (2014) Nearest Neighbor Classification of Categorical Data by Attributes Weighting. Expert Systems with Applications, 42, 3142-3149. https://doi.org/10.1016/j.eswa.2014.12.002
[6]
Iqbal, M.J., Faye, I., Samir, B.B. and Said, A.M. (2014) Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics. The Scientific World Journal, 2014, Article ID: 173869.
[7]
Weitschek, E., Cunial, F. and Felici, G. (2015) LAF: Logic Alignment Free and Its Application to Bacterial Genomes Classification. BioData Mining, 8, 2015. https://doi.org/10.1186/s13040-015-0073-1
[8]
Pham, T.H., Tran, T.B., Ho, T.B., Satou, K. and Valiente, G. (2005) Qualitatively Predicting Acetylation and Methylation Areas in DNA Sequences. Genome Informatics, 16, 3-11.
[9]
Pokholok, D.K., Harbison, C.T., Levine, S., Cole, M., Hannett, N.M., Lee, T.I., Bell, G.W., Walker, K., Rolfe, P.A., Herbolsheimer, E., Zeitlinger, J., Lewitter, F., Gifford, D.K. and Young, R.A. (2005) Genome-Wide Map of Nucleosome Acetylation and Methylation in Yeast. Cell, 122, 517-527. https://doi.org/10.1016/j.cell.2005.06.026
[10]
Higashihara, M., Rebolledo-Mendez, J.D., Yamada, Y. and Satou, K. (2008) Application of a Feature Selection Method to Nucleosome Data: Accuracy Improvement and Comparison with Other Methods. WSEAS Transactions on Biology and Biomedicine, 5, 153-162.
[11]
Li, J. and Wong, L. (2003) Using Rules to Analyse Bio-Medical Data: A Comparison between C4.5 and PCL. Proceedings of Advances in Web-Age Information Management 4th International Conference, Chengdu, 17-19 August, 254-265. https://doi.org/10.1007/978-3-540-45160-0_25
[12]
Nguyen, N.G., Tran, V.A., Ngo, D.L., Phan, D., Lumbanraja, F.R., Faisal, M.R., Abapihi, B., Kubo, M. and Satou, K. (2016) DNA Sequence Classification by Convolutional Neural Network. Journal of Biomedical Science and Engineering, 9, 280-286. https://doi.org/10.4236/jbise.2016.95021
[13]
Guo, S.H., Deng, E.Z., Xu, L.Q., Ding, H., Lin, H., Chen, W. and Chou, K.C. (2014) iNuc-PseKNC: A Sequence-Based Predictor for Predicting Nucleosome Positioning in Genomes with Pseudo k-Tuple Nucleotide Composition. Bioinformatics, 30, 1522-1529. https://doi.org/10.1093/bioinformatics/btu083
[14]
Tahir, M. and Hayat, M. (2016) iNuc-STNC: A Sequence-Based Predictor for Identification of Nucleosome Positioning in Genomes by Extending the Concept of SAAC and Chou’s PseAAC. Molecular BioSystems, 12, 2587- 2593. https://doi.org/10.1039/C6MB00221H
[15]
Awazu, A. (2016) Prediction of Nucleosome Positioning by the Incorporation of Frequencies and Distributions of Three Different Nucleotide Segment Lengths into a General Pseudo k-Tuple Nucleotide Composition. Bioinformatics, 33, 42-48. https://doi.org/10.1093/bioinformatics/btw562
[16]
Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A. (2004) Kernlab—An S4 Package for Kernel Methods in R. Journal of Statistical Software, 11, 1-20. https://doi.org/10.18637/jss.v011.i09
[17]
Liaw, A. and Wiener, M. (2002) Classification and Regression by Randomforest. R News, 2, 18-22. http://CRAN.R-project.org/doc/Rnews/
[18]
Chen, W., Feng, P., Ding, H., Lin, H. and Chou, K.C. (2015) Using Deformation Energy to Analyze Nucleosome Positioning in Genomes. Genomics, 107, 69-75. https://doi.org/10.1016/j.ygeno.2015.12.005
[19]
Yi, X.F., He, Z.S., Chou, K.C. and Kong, X.Y. (2012) Nucleosome Positioning Based on the Sequence Word Composition. Protein and Peptide Letters, 19, 79-90. https://doi.org/10.2174/092986612798472811