|
- 2017
Phylogeny reconstruction based on the length distribution of k-mismatch common substringsDOI: 10.1186/s13015-017-0118-8 Keywords: Alignment-free, Phylogeny, Kmacs, Average common substring, Pattern matching Abstract: k-mismatch common substrings with k = 2. For position i = 5 in S1, kmacs searches the longest substring of S1 starting at i that exactly matches a substring of S2. This is the substring starting at i? = 2 in S2 (matching substrings shown in red). It then extends this match without gaps until the k + 1st mismatch is reached. In this example, the k-mismatch common substring would consist of the red, blue and green substrings and has length 12. In the paper, the lengths of these k-mismatch common substrings are modelled by the random variables X i ( k ) , defined in (1). The original version of kmacs uses the average length of these k-mismatch common substrings to assign a distance value to a pair of sequences. In our modified implementation of kmacs, we consider the k-mismatch extension of the longest common substring at i. That is, the program would return the length of the k-mismatch substring match that starts after the first mismatch following the longest common substring. In our example, for i = 5, this would be the substring match starting with ‘T’ at position 11 in S1 and at position 8 in S2, consisting of the blue, green and orange matches; the length of this k-mismatch substring extension would be 9. The length of these k-mismatch extensions are modelled by the random variable X ^ i ( k ) , defined in (16
|