|
计算机科学 2005
CuMen: Clustering Sequences Based on Maximal Frequent Sequential Pattern and its Application in Genome Sequence Assembly
|
Abstract:
Sequencing genomes is a fundamental aspect of biological research. A variety of assembly programs have been previously proposed and implemented. Because of great computational complexity and increasingly large size, they incur great time and space overhead. In realistic applications, sequencing process might come to become unacceptably slow for insufficient memory even with a mainframe with huge RAM. This paper offeres a clustering algorithm based on maximal frequent sequential patterns,aiming at divide the whole dataset into several parts which can be processed independently and efficiently in limited memory. Some techniques are applied to optimize the mining and clustering procedure. This approach is introduced into grid environment, exploiting parallelism and distribution for improving scalability further.