%0 Journal Article %T Plant MicroRNA Prediction by Supervised Machine Learning Using C5.0 Decision Trees %A Philip H. Williams %A Rod Eyles %A Georg Weiller %J Journal of Nucleic Acids %D 2012 %I Hindawi Publishing Corporation %R 10.1155/2012/652979 %X MicroRNAs (miRNAs) are nonprotein coding RNAs between 20 and 22 nucleotides long that attenuate protein production. Different types of sequence data are being investigated for novel miRNAs, including genomic and transcriptomic sequences. A variety of machine learning methods have successfully predicted miRNA precursors, mature miRNAs, and other nonprotein coding sequences. MirTools, mirDeep2, and miRanalyzer require ¡°read count¡± to be included with the input sequences, which restricts their use to deep-sequencing data. Our aim was to train a predictor using a cross-section of different species to accurately predict miRNAs outside the training set. We wanted a system that did not require read-count for prediction and could therefore be applied to short sequences extracted from genomic, EST, or RNA-seq sources. A miRNA-predictive decision-tree model has been developed by supervised machine learning. It only requires that the corresponding genome or transcriptome is available within a sequence window that includes the precursor candidate so that the required sequence features can be collected. Some of the most critical features for training the predictor are the miRNA:miRNA£¿ duplex energy and the number of mismatches in the duplex. We present a cross-species plant miRNA predictor with 84.08% sensitivity and 98.53% specificity based on rigorous testing by leave-one-out validation. 1. Introduction MicroRNAs (miRNAs) are nonprotein coding RNAs of between 20 and 22 nucleotides that attenuate protein production by cleavage, translational inhibition, or sequestering of mRNA in P bodies [1]. They are implicated in several different biological pathways, including plant and animal development, and cancer [2¨C4]. To better understand the role that miRNAs play in these pathways, large datasets containing RNA-seq, expressed sequence tags (ESTs), and genomic sequences are being investigated for new miRNAs [5, 6]. As these datasets grow in an ever increasing rate, their rapid analysis has become critical. Understanding miRNA biogenesis is important when developing predictive models. The mature miRNA originates from an expressed RNA precursor. The precursor folds back to base pair with itself to form a characteristic stem-loop structure. However, not all stem-loop structures are miRNA precursors. The dicer protein cuts a short, double-stranded RNA (miRNA:miRNA* duplex) from the precursor. This double-stranded RNA associates with the RISC complex, where the mature miRNA is retained while the miRNA* is assumed to degrade [7]. The miRNA-loaded RISC complex is responsible for %U http://www.hindawi.com/journals/jna/2012/652979/