The Y-chromosome short tandem repeat (Y-STR) data are mainly collected for a performance benchmarking result in clustering methods. There are six Y-STR dataset items, divided into two categories: Y-STR surname and Y-haplogroup data presented here. The Y-STR data are categorical, unique, and different from the other categorical data. They are composed of a lot of similar and almost similar objects. This characteristic of the Y-STR data has caused certain problems of the existing clustering algorithms in clustering them. 1. Introduction Y-chromosome short tandem repeats (Y-STRs) are the tandem repeats on the Y-chromosome. The Y-STR represents the number of times an STR motif repeats and is often called the allele value of the marker. Most of the markers begin with a prefix D that stands for DNA, Y that stands for Y-chromosome, and S that stands for a single copy sequence, then followed by the location on the Y-chromosome or often known as locus. This nomenclature is based on an international standard body called Human Gene Nomenclature Committee (HUGO; http://www.hugo-international.org/). For example, if there are eight allele values for the DYS391 marker, the STR would look like the following fragments: [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] [TCTA]. The number of tandem repeats has effectively been used to characterize and differentiate between two people. The Y-STR data are now being actively adapted as a remarkable method in genetic genealogy and anthropology studies such as in Hart [1], Smolenyak and Turner [2], Pomery [3], Sykes [4], Shawker [5], Fitzpatrick [6], and Fitzpatrick and Yeiser [7]. The method is used to trace similar groups of Y-surname projects as to support the traditional genealogical study. Furthermore, in wider perspectives such as in the anthropological studies, the method is also being utilized in establishing groups of males, often called haplogroups, across the geographical areas throughout the world. The haplogroups are the study in reference to mitochondria DNA and Y-chromosomes [1]. As a consequence, a reputable reference, known as modal haplotype, used for defining groups of males all over the world has been made available (see http://www.isogg.org/ for the details). The modal haplotype is actually a haplotype diversity where the degree of relatedness has become spread out. The Y-STR data have been applied and used in clustering Y-surname and Y-haplogroup applications. Initial benchmarking results of clustering Y-STR data have been reported (see, e.g., [8–12]). Furthermore, the Y-STR data and their clustering
References
[1]
A. Hart, How to Interpret Family History and Ancestry DNA Test Results for Beginners: The Geography and History of your Relatives, ASJA Press, New York, NY, USA, 2004.
[2]
M. S. Smolenyak and A. Turner, Trace your Roots with DNA Using Genetic Tests to Explore your Family Tree, Rodale Inc., 2004.
[3]
C. Pomery, Fimily History in the Genes: Trace your Family Tree, The National Archives, Surrey, UK, 2007.
[4]
B. Sykes, The Seven Daughters of Eve, W. W. Norton and Company, New York, NY, USA, 2001.
[5]
T. H. Shawker, Unlocking your Genetic History: A Step-By-Step Guide to Discovering your Faimily Medical and Genetic Heritage, Rutledge Hill Press, 2004.
[6]
C. Fitzpatrick, Forensic Genealogy, Rice Book Press, Fountain Valley, Calif, USA, 2005.
[7]
C. Fitzpatrick and A. Yeiser, DNA and Genealogy, Rice Book Press, Fountain Valley, Calif, USA, 2005.
[8]
A. Seman, Z. Abu Bakar, and A. M. Sapawi, “Centre-based clustering for Y-Short Tandem Repeats (Y-STR) as numerical and categorical data,” in Proceedings of the International Conference on Information Retrieval and Knowledge Management (CAMP '10), pp. 28–33, Shah Alam, Malaysia, March 2010.
[9]
A. Seman, Z. A. Bakar, and A. M. Sapawi, “Attribute value weighting in K-modes clustering for Y-short tandem repeats (Y-STR) surname,” in Proceedings of the International Symposium on Information Technology (ITSim '10), pp. 1531–1536, Kuala Lumpur, Malaysia, June 2010.
[10]
A. Seman, Z. A. Bakar, and A. M. Sapawi, “Modeling centre-based hard and soft clustering for y chromosome short tandem repeats (YSTR) data,” in Proceedings of the International Conference on Science and Social Research (CSSR '10), pp. 68–73, Kuala Lumpur, Malaysia, December 2010.
[11]
A. Seman, Z. A. Bakar, and N. Daud, “Hard and soft updating centroids for clustering Y-Short tandem repeats (Y-STR) data,” in Proceedings of the IEEE Conference on Open Systems (ICOS '10), pp. 6–11, Kuala Lumpur, Malaysia, December 2010.
[12]
A. Seman, Z. Abu Bakar, and A. M. Sapawi, “Centre-based Hard Clustering Algorithm for Y-STR Data,” Malaysia Journal of Computing, vol. 1, pp. 62–73, 2010.
[13]
A. Seman, Z. Abu-Bakar, and A. M. Sapawi, “Centre-based hard and soft clustering approaches for Y-STR data,” Journal of Genetic Genealogy, vol. 6, no. 1, pp. 1–9, 2010.
[14]
A. Seman, Z. Abu Bakar, and M. N. Isa, “Evaluation of k-Mode-type algorithms for clustering Y-short tandem repeats,” Journal of Trends in Bioinformatics, vol. 5, no. 2, pp. 47–52, 2012.
[15]
A. Seman, Z. Abu Bakar, and M. N. Isa, “An efficient clustering algorithm for partitioning Y-short tandem repeats data,” BMC Research Notes, vol. 5, no. 1, article 557, 2012.
[16]
Z. Huang, “Extensions to the k-means algorithm for clustering large data sets with categorical values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283–304, 1998.
[17]
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, Francisco, Calif, USA, 2001.