全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

可视数据清洗综述

DOI: 10.11834/jig.20150402

Keywords: 数据清洗,可视清洗,可视分析,信息可视化,数据分析

Full-Text   Cite this paper   Add to My Lib

Abstract:

目的数据清洗是一个长期存在并困扰人们的问题,随着可视化技术的发展,可视数据清洗必将成为数据清洗的重要方法之一.阐述数据的主要质量问题和可视数据清洗的过程,回顾可视数据清洗的研究现状(包括数据质量问题的来源、分类以及可视数据清洗方法),并根据已有文献总结可视数据清洗面临的主要挑战和机遇.方法由于数据清洗的方法和策略与具体的数据质量问题相关,因此本文以不同的数据质量问题为线索来归纳和评述可视数据清洗的方法和策略.结果根据数据质量问题的不同,将可视清洗方法归纳为直接可视清洗、可视缺失数据、可视不确定数据、可视数据转换和数据清洗资源共享等,并依据不同的数据质量问题归纳总结出相应问题所面临的挑战和可进一步研究的方向.结论对可视数据清洗的归纳、总结和展望,并指出在数据清洗领域中可视数据清洗将会是未来最有前景的研究方向之一.

References

[1]  Gershon N D. Visualization of fuzzy data using generalized animation [C]//Proceedings of the 3rd Conference on Visualization \'92. Los Alamitos, CA: IEEE Computer Society Press, 1992: 268-273. [DOI: 10.1109/VISUAL.1992.23 5199]
[2]  Lodha S K, Wilson C M, Sheehan R E. Listen: Sounding uncertainty visualization [C]//Proceedings of the 7th Conference on Visualization \'96. Los Alamitos, CA: IEEE Computer Society Press, 1996: 189-196. [DOI: 10.1109/VISUAL.1996.568105]
[3]  Kosara R. Semantic depth of field using blur for focus+context visualization [D]. Vienna: Vienna University of Technology, 2001.
[4]  Ludscher B, Lin K, Bowers S, et al. Managing scientific data: from data integration to scientific workflows [J]. GSA Today In Geoinformatics: Data to Knowledge, 2006: 109-130.
[5]  Blackwell A F. SWYN: a visual representation for regular expressions [C]//Proceedings of Your Wish Is My Command: Programming by Example. San Francisco, USA: Morgan Kaufmann, 2001: 245-270.
[6]  Scaffidi C, Myers B, Shaw M. Intelligently creating and recommending reusable reformatting rules [C]//Proceedings of ACM IUI. Sanibel Island, Florida, USA: ACM, 2009: 297-306.
[7]  Huynh D F, Miller R C, Karger D R. Potluck: semi-ontology alignment for casual users [C]//Proceedings of ISWC. Berlin: Springer, 2007: 903-910. [DOI: 10. 1.1.107.547]
[8]  Tuchinda R, Szekely P, Knoblock C A. Building mashups by example [C]//Proceedings of ACM IUI. Berlin: Springer, 2008: 139-148. [DOI: 10.1145/13 78773. 1378792]
[9]  Lin J, Along J, Nichols J, et al. End-user programming of mashups with vegemite [C]//Proceedings of IUI. Berlin Heidelberg: Springer, 2009: 97-106. [DOI:10.1145 /1502650.150 2667]
[10]  Leshed G, Haber E M, Matthews T, et al. CoScripter: automating & sharing how-to knowledge in the enterprise [C]//Proceedings of ACM CHI. Florence, Italy: ACM, 2008: 1719-1728. [DOI: 10.1145/135 7054.1357323]
[11]  Arasu A, Garcia-Molina H. Extracting structured data from web pages [C]//Proceedings of ACM SIGMOD. San Diego, CA: ACM, 2003: 337-348. [DOI: 10.1145/8727 57.872799]
[12]  Soderland S. Learning information extraction rules for semistructured and free text [J]. Mach. Learn., 1999, 34: 233-272. [DOI: 10.1023/A:1007 562322031]
[13]  Kandel S, Paepcke A, Hellerstein J, et al. Wrangler: interactive visual specification of data transformation scripts [C]//Proceedings of ACM Human Factors in Computing Systems. Vancouver, BC, Canada: ACM, 2011. [DOI: 10. 1145/1978942.1979444]
[14]  Benjelloun O, Garcia-Molina H, Menestrina D, et al. Swoosh: a generic approach to entity resolution [C]//Proceedings of VLDB. New York, USA: ACM Press, 2009: 255-276. [DOI: 10.1007 /s00778-008-0098-x]
[15]  Cafarella M J, Halevy A, Wang D Z, et al. Webtables: exploring the power of tables on the web [C]//Proceedings of PVLDB. Auckland, New Zealand: ACM, 2008, 1(1): 538-549. [DOI: 10.14778/1453856.1453916]
[16]  Informatica. The informatica data quality methodology: a framework to achieve pervasive data quality through enhanced business-IT collaboration [EB/OL]. 2010-7-8[2014-08-17]. http://www.Informatica.com/downloads/7130-DQ-Methodology-wp-web.pdf.
[17]  Callahan S P, Freire J, Santos E, et al. Vistrails: visualization meets data management [C]//Proceedings of 2006 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2006: 745-747. [DOI: 10.1145/ 1142473.1142574]
[18]  Miller R C, Myers B A. Interactive simultaneous editing of multiple text regions [C]//Proceedings of USENIX Technical Conference. Boston, Massachusetts, USA: USENIX Association, 2001: 161-174.
[19]  Sean K, Jeffrey H, Catherine P, et al. Research directions in data wrangling: visualizations and transformations for usable and credible data [J]. Information Visualization, 2011, 10(4): 271-288. [DOI: 10.1177/147387161415994]
[20]  Bhattacharya I, Getoor L. Collective entity resolution in relational data [C]//Proceedings of ACM Transactions on Knowledge Discovery in Data. Berlin, Germany: Springer, 2007, 1(1). [DOI: 10.1145/1217299.1217 304]
[21]  Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16. [DOI: 10.1109/TKDE.2007.9]
[22]  Gravano L, Ipeirotis P G, Jagadish H V, et al. Using qgrams in a dbms for approximate string processing [J]. IEEE Data Engineering Bulletin, 2001, 24(4): 28-34. [DOI: 10.1.1.14.6009]
[23]  Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning [C]//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Alberta, Canada: ACM, 2002: 1-10. [DOI: 10.1145/775047. 775087]
[24]  Robertson G G, Czerwinski M P, Churchill J E. Visualization of mappings between schemas [C]//Proceedings of SIGCHI Conference on Human Factors in Computing Systems. Portland, Oregon, USA: ACM, 2005: 431-439. [DOI:10.1145/10549 72. 1055032]
[25]  Kang H, Getoor L, Shneiderman B, et al. Interactive entity resolution in relational data: a visual analytic tool and its evaluation [J]. IEEE Trans. Vis. Comput. Graph., 2008, 14(5): 999-1014. [DOI: 10.1109/TVCG.2008.55]
[26]  Huynh D, Mazzocchi S. Freebase GridWorks [CP]. http://code.google.com/p/google-refine/.
[27]  Raman V, Hellerstein J M. Potter\'s wheel: an interactive data cleaning system [C]//Proceedings of the 27th International Conference on Very Large Data Bases. Roma, Italy: Morgan Kaufmann, 2001: 381-390.
[28]  Li L, Peng T, Kennedy J. Improving data quality in data warehousing applications [C]//Proceedings of the 12th International Conference on Enterprise Information Systems. Funchal, Madeira, Portugal: SciTePress, 2010: 211-219.
[29]  Kim W, Choi B J, Hong E K, et al. A taxonomy of dirty data [J]. Data Mining and Knowledge Discovery, 2003, 1(7): 81-99. [DOI: 10.1023/A: 102156 4703268]
[30]  Müller H, Freytag J C. Problems, methods and challenges in comprehensive data cleansing, Technical Report HUB-IB-164 [R]. Humboldt-Universit?t zu Berlin, Berlin, Germany: Institut für Informatik, 2003.
[31]  Keim D A. Designing pixel-oriented visualization techniques: theory and applications [J]. IEEE Trans. Visual. Comput. Graph., 2000, 6(1): 59-78. [DOI: 10.1109/2945.841121]
[32]  Carr D B, Littlefield R J, Nicholson W L, et al. Scatterplot matrix techniques for large N [J]. Am. Stat. Assoc., 1987, 82: 424-436. [DOI: 10.2307/ 2289444]
[33]  Utwin A, Theus M, Hofmann H. Graphics of Large Datasets: Visualizing a Million [M]. Berlin: Springer, 2006. [DOI: 10.1198/tas.2008.s103]
[34]  Hellerstein J M, Haas P J, Wang H J. Online aggregation [C]//Proceedings of ACM SIGMOD. Tucson, USA: ACM, 1997: 171-182. [DOI: 10. 1145/253260.253291]
[35]  Twiddy J C, Shiri S M. Restorer: a visualization technique for handling missing data [C]//Proceedings of IEEE Visualization. Austin, USA: IEEE, 2004: 212-216.
[36]  Eaton C, Plaisant C, Drizd T. The challenge of missing and uncertain data [C]//Proceedings of 4th IEEE Visualization 2003. Washington DC, USA: IEEE Computer Society, 2003: #100. [DOI: 10. 1109/VIS.2003.10029]
[37]  MacEachren A M, Robinson A, Gardner S, et al. Visualizing geospatial information uncertainty: What we know and what we need to know [J]. Cartogr. Geogr. Inform. Sci., 2005, 32:139-160. [DOI: 10.1559/1523040054738936]
[38]  Skeels M, Lee B, Smith G, et al. Revealing uncertainty for information visualization [C]//Proceedings of Inform Visual. Napoli, Italy: SAGE, 2010, 9:70-81. [DOI: 10.1057/ivs.2009. 1]
[39]  更多...
[40]  Correa C, Chan Y H, Ma K I. A framework for uncertainty aware visual analytics [C]//Proceedings of IEEE Visual Analytics Science and Technology. Atlantic City, New Jersey: IEEE, 2009: 51-58. [DOI:10.1109/VAST.2009.5332611]
[41]  Griethe H, Schumann H. The visualisation of uncertain data: Methods and problems [C]//Proceedings of SimVis. Magdeburg, Germany: SCS Publishing House, 2006: 143-156.
[42]  Olston C, Mackinlay J. Visualizing data with bounded uncertainty [C]//IEEE Symposium on Information Visualization. Stanford, USA: IEEE Computer Society Press, 2002: 37-40. [DOI: 10.1109/INFVIS.2002.11731 45]
[43]  Pang T, Wittenbrink C M, Lodha S K. Approaches to uncertainty visualization [J]. The Visual Computer, 1997, 13(8): 370-390. [DOI: 10.1007/ s003710050111]
[44]  Lee B, Robertson G G, Czerwinski M, et al. CandidTree: visualizing structural uncertainty in similar hierarchies [J]. Information Visualization, 2007, 6(3): 233-246. [DOI: 10.1057/palgrave.ivs. 9500157]
[45]  Grigoryan G, Rheingans P. Point-based probabilistic surfaces to show surface uncertainty [J]. IEEE Trans. Visual. Comput. Graph., 2004, 10(5): 564-573. [DOI: 10.1109/TVCG.2004. 30]
[46]  Doan A, Madhavan J, Dhamankar R, et al. Learning to match ontologies on the semantic web [C]//Proceedings of VLDB. New York, USA: ACM Press, 2003, 12(4): 303-319. [DOI: 10. 1007/s00778-003-0104-2]
[47]  Rahm E, Bernstein P A. A survey of approaches to automatic schema matching [C]//Proceedings of VLDB. New York, USA: ACM Press, 2001, 10(4): 334-350. [DOI: 10.1007/ s007780100057]
[48]  Haas L M, Hernandez M A, Ho H, et al. Clio grows up: from research prototype to industrial tool [C]//Proceedings of ACM SIGMOD. Baltimore, Maryland, USA: ACM Press, 2005: 805-810. [DOI: 10.1145/1066157.1066252]
[49]  Chiticariu L, Kolaitis P G, Popa L. Interactive generation of integrated schemas [C]//Proceedings of ACM SIGMOD. New York, USA: ACM, 2008: 833-846. [DOI:10.1145/1376616.13 76700]
[50]  Altova. Data Integration: Opportunities, challenges, and altova mapforce [EB/OL]. 2010-7-8[2014-08-17]. http://www.altova.com/whitepapers /mapforce.pdf
[51]  CloverETL. Cloveretl overview [EB/OL]. 2010-7-8[2014-08-17]. http://www.cloveretl.com/products /designer.
[52]  Ives Z G, Knoblock C A, Minton S, et al. Interactive data integration through smart copy & paste [C]//Proceedings of CIDR. Pacific Grove, CA: ACM Press, 2009.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133