|
- 2016
SMASH: A Data-driven Informatics Method to Assist Experts in Characterizing Semantic Heterogeneity among Data ElementsAbstract: Semantic heterogeneity (SH) is detrimental to data interoperability and integration in healthcare. Assessing SH is difficult, yet fundamental to addressing the problem. Using expert-based and data-driven methods we assessed SH among HIV-associated data elements (DEs). Using Clinicaltrials.gov, we identified and obtained eight data dictionaries, and created a DE inventory. We vectorized DEs by study, and developed a new method, String Metric-assisted Assessment of Semantic Heterogeneity (SMASH), to find DEs: similar in An and Bn, unique to An, and unique to Bn. An HIV expert assessed pairs for semantic equivalence. Heterogeneous DEs were either semantically-equivalent/syntactically-different (HIV-positive/HIV+/Seropositive), or syntactically-equivalent/semantically-different (“Partner” [sexual]/“Partner”[relationship]). Context of usage was considered. SMASH aided identification of SH. Of 1,175 DE from pairs, 1,048 (87%) were semantically heterogeneous and 127 (13%) were homogeneous. Most heterogeneous pairs (97%) were semantically-equivalent/syntactically-different. Expert-based and data-driven methods are complementary for assessing SH, especially among semantically-equivalent/syntactically-different DE. Similar expert-based/data-driven solutions are recommended for resolving SH
|