全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

A Method for Identifying Japanese Shop and Company Names by Spatiotemporal Cleaning of Eccentrically Located Frequently Appearing Words

DOI: 10.1155/2012/562604

Full-Text   Cite this paper   Add to My Lib

Abstract:

We have developed a method for spatiotemporally integrating databases of shop and company information, such as from a digital telephone directory, spatiotemporally, in order to monitor dynamic urban transformations in a detailed manner. To realize this, an additional method is necessary to verify the identicalness of different instances of Japanese shop and company names that might contain fluctuations of description. In this paper, we discuss a method that utilizes an -gram model for comparing and identifying Japanese words. The processing accuracy was improved through developing various kinds of libraries for frequently appearing words, and using these libraries to clean shop and company names. In addition, the accuracy was greatly and novelty improved through the detection of those frequently appearing words that appear eccentrically across both space and time. By utilizing natural language processing (NLP), our method incorporates a novel technique for the advanced processing of spatial and temporal data. 1. Introduction Spatiotemporal changes of shop and company locations have a major effect on the vitality and attraction of urban space. It is a significant challenge to monitor these changes, quantitatively and in as detailed as manner as possible, for use in various fields including urban engineering, geography, and economics. However, it is difficult to comprehensively monitor urban spaces, because much general regional and statistical information (e.g., the population census, commercial statistics) is compiled by separate administrative or city block units. On the other hand, detailed information on shop and company locations and names can be collected using telephone directories and web information. Fortunately, this is possible in Japan, because of the availability of digital telephone directories and detailed digital maps which can monitor almost all residents and tenants in a given building. The yearly continuations and changes in tenants or residents can be monitored for a certain location, and we can integrate these data across multiple years. The same can be done for shop and company locations over multiple years, by measuring changes in shop and company names. However, this measure is not easy because of name fluctuations between different two years or different kinds of data. Therefsore, we have been developing a dataset that can monitor the time-series changes of each shop and company and a system that can develop such data as to resolve this challenge [1, 2]. This paper focuses on a particular method of name identification, pertinent

References

[1]  Y. Akiyama and R. Shibasaki, “Development of detailed spatio-temporal urban data through the integration of digital maps and yellow page data and feasibility study as complementary data for existing statistical information,” in Proceedings of the Computers in Urban Planning and Urban Management (CUPUM '09), 187, 2009.
[2]  Y. Akiyama, T. Shibuki, and R. Shibasaki, “Development of three dimensional monitoring dataset for tenants variations in broad urban area by spatio-temporal integrating digital house maps and yellow page data,” in Proceedings of the 4th International Conference on Intelligent Environments (IE '08), 2008.
[3]  T. Ato, K. Omura, T. Arata, and S. Hujii, “The stagnation of commercial accumulation districts in front of the stations in the suburbs of the Tokyo metropolitan area: a study of honatsugi and odawara,” City Planning Review, vol. 41, no. 3, pp. 1037–1042, 2006.
[4]  H. Ai, Y. Sadahiro, and Y. Asami, “Spatio-temporal analysis of building location and building use in middle scale commercial accumulation districts,” City Planning Review, vol. 43, no. 3, pp. 103–108, 2008.
[5]  K. Ito and H. Magaribuchi, “Method for making spatio-temporal data from accumulated information: using the identification by resolving geometric and non-geometric ambiguity,” in Proceedings of the Geographic Information Systems Association, vol. 10, pp. 147–150, 2001.
[6]  R. Florian, H. Hassan, A. Ittycheriah et al., “A statistical model for multilingual entity detection and tracking,” in Proceedings of the Human Language Technologies Conference (HLT-NAACL '04), pp. 1–8, May 2004.
[7]  Q. Tri Tran, T. X. Thao Pham, Q. Hung Ngo, D. Dinh, and N. Collier, “Named entity recognition in Vietnamese documents,” Progress in Informatics, no. 4, pp. 5–13, 2007.
[8]  E. F. Tjong Kim Sang and F. D. Meulder, “Introduction to the CoNLL-2003 Shared task: language-independent named entity recognition,” in Proceedings of the 7th Conference on Natural Language Learning (HLT-NAACL '03), vol. 4, pp. 142–147, 2003.
[9]  R. Florian, A. Ittycheriah, H. Jing, and T. Zhang, “Named entity recognition through classifier combination,” in Proceedings of the 7th Conference on Natural Language Learning at (HLT-NAACL '03), vol. 4, pp. 168–171, 2003.
[10]  H. L. Chieu and H. T. Ng, “Named entity recognition: a maximum entropy approach using global information,” in Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, pp. 1–7, 2002.
[11]  R. Steinberger and B. Pouliquen, “Cross-lingual named entity recognition,” Lingvisticae Investigationes, vol. 30, no. 1, pp. 135–162, 2007.
[12]  T. Bogers, Dutch named entity recognition: optimizing features, algorithms, and output, Ph.D. thesis, University of Van Tilburg, 2004.
[13]  C. Sporleder, M. V. Erp, T. Porcelijn, A. V. Bosch, and P. Arntzen, “Identifying named entities in text databases from the natural history domain,” in Proceedings of the 5th International Conference on Language Resources and Evaluation, pp. 1742–1745, 2006.
[14]  S. Sato, M. Harada, and K. Kazama, “Measuring similarity among information sources by comparing string frequency distributions,” Information Processing Society of Japan Digital Document, vol. 2002, no. 28, pp. 119–126, 2002.
[15]  T. Kawakami and H. Suzuki, “A calculation of word similarity using decision list,” IPSJ SIG Technical Report, vol. 2006, no. 94, pp. 85–90, 2006.
[16]  K. Mishina, S. Tsuchita, S. Kurokawa, and R. Huji, “An emotion similarity calculation using N-gram frequency,” IEICE Technical Report, vol. 107, no. 158, pp. 37–42, 2007, NLC2007-7.
[17]  D. Cali, A. Condorelli, S. Papa, M. Rata, and L. Zagarella, “Improving intelligence through use of natural language processing. A comparison between NLP interfaces and traditional visual GIS interfaces,” Procedia Computer Science, vol. 5, pp. 920–925, 2011.
[18]  B. Bitters, “Geospatial reasoning in a natural language processing (NLP) environment,” in Proceedings of the 25th International Cartographic Conference, CO-253, July 2011.
[19]  S. Miyagawa, “The Japanese Language,” MIT JP NET, 2011, http://web.mit.edu/jpnet/articles/JapaneseLanguage.html.
[20]  D. Klein, J. Smarr, H. Nguyen, and C. D. Manning, “Named entity recognition with character-level models,” in Proceedings of the 7th Conference on Natural Language Learning (HLT-NAACL '03), vol. 4, pp. 180–183, 2003.
[21]  S. Kuno, The Structure of the Japanese Language. Current Studies in Linguistics, MIT Press, 1 edition, 1973.
[22]  C. E. Shannon, A Mathematical Theory of Communication, University of Illinois Press, 1948.
[23]  M. Kondo, An Analysis of Japanese Classical Literature Using Character-Based N-Gram Model, vol. 29, Chiba University, Zinbun Kenkyu, 2000.
[24]  T. Odaka, T. Murata, J. Gao et al., “A proposal on student report scoring system using N-gram text analysis method,” Journal of Institute of Electronics, Information, and Communication Engineers, vol. 86, no. 9, pp. 702–705, 2003.
[25]  J. B. Marino, R. E. Banchs, J. M. Crego et al., “N-gram-based machine translation,” Computational Linguistics, vol. 32, no. 4, pp. 527–549, 2006.
[26]  T. Sagara and M. Kitsuregawa, “Cleaning shop names by its location information for shop information retrieval from the web,” Journal of Institute of Electronics, Information, and Communication Engineers, vol. 91, no. 3, pp. 531–537, 2008.
[27]  Chasen legacy—an old morphological analyzer, http://chasen-legacy.sourceforge.jp/.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133