|
Locating previously unknown patterns in data-mining results: a dual data- and knowledge-mining methodAbstract: The dual-mining method is based on automatically comparing the strength of patterns mined from a database with the strength of equivalent patterns mined from a relevant knowledgebase. When these two estimates of pattern strength do not match, a high "surprise score" is assigned to the pattern, identifying the pattern as potentially interesting. The surprise score captures the degree of novelty or interestingness of the mined pattern. In addition, we show how to compute p values for each surprise score, thus filtering out noise and attaching statistical significance.We have implemented the dual-mining method using scripts written in Perl and R. We applied the method to a large patient database and a biomedical literature citation knowledgebase. The system estimated association scores for 50,000 patterns, composed of disease entities and lab results, by querying the database and the knowledgebase. It then computed the surprise scores by comparing the pairs of association scores. Finally, the system estimated statistical significance of the scores.The dual-mining method eliminates more than 90% of patterns with strong associations, thus identifying them as uninteresting. We found that the pruning of patterns using the surprise score matched the biomedical evidence in the 100 cases that were examined by hand.The method automates the acquisition of knowledge, thus reducing dependence on the knowledge elicited from human expert, which is usually a rate-limiting step.In the twentieth century, computers began making it possible to store huge amounts of data through mass digital storage. It soon became clear that for stored data to be used, database management systems were needed. Database management systems (DBMS) help a user to locate and retrieve specific data/information in a timely manner [1]. During the past decade, the size of databases, however, has exceeded the capacity of scientists, engineers, and other experts to analyze them. This means most data are currently b
|