%0 Journal Article %T First Y-Short Tandem Repeat Categorical Dataset for Clustering Applications %A Ali Seman %A Zainab Abu Bakar %A Mohamed Nizam Isa %J Dataset Papers in Science %D 2013 %R 10.7167/2013/364725 %X The Y-chromosome short tandem repeat (Y-STR) data are mainly collected for a performance benchmarking result in clustering methods. There are six Y-STR dataset items, divided into two categories: Y-STR surname and Y-haplogroup data presented here. The Y-STR data are categorical, unique, and different from the other categorical data. They are composed of a lot of similar and almost similar objects. This characteristic of the Y-STR data has caused certain problems of the existing clustering algorithms in clustering them. 1. Introduction Y-chromosome short tandem repeats (Y-STRs) are the tandem repeats on the Y-chromosome. The Y-STR represents the number of times an STR motif repeats and is often called the allele value of the marker. Most of the markers begin with a prefix D that stands for DNA, Y that stands for Y-chromosome, and S that stands for a single copy sequence, then followed by the location on the Y-chromosome or often known as locus. This nomenclature is based on an international standard body called Human Gene Nomenclature Committee (HUGO; http://www.hugo-international.org/). For example, if there are eight allele values for the DYS391 marker, the STR would look like the following fragments: [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] [TCTA]. The number of tandem repeats has effectively been used to characterize and differentiate between two people. The Y-STR data are now being actively adapted as a remarkable method in genetic genealogy and anthropology studies such as in Hart [1], Smolenyak and Turner [2], Pomery [3], Sykes [4], Shawker [5], Fitzpatrick [6], and Fitzpatrick and Yeiser [7]. The method is used to trace similar groups of Y-surname projects as to support the traditional genealogical study. Furthermore, in wider perspectives such as in the anthropological studies, the method is also being utilized in establishing groups of males, often called haplogroups, across the geographical areas throughout the world. The haplogroups are the study in reference to mitochondria DNA and Y-chromosomes [1]. As a consequence, a reputable reference, known as modal haplotype, used for defining groups of males all over the world has been made available (see http://www.isogg.org/ for the details). The modal haplotype is actually a haplotype diversity where the degree of relatedness has become spread out. The Y-STR data have been applied and used in clustering Y-surname and Y-haplogroup applications. Initial benchmarking results of clustering Y-STR data have been reported (see, e.g., [8¨C12]). Furthermore, the Y-STR data and their clustering %U http://www.hindawi.com/journals/dpis/2013/364725/