Most of the information in digital world is accessible to few who can read or understand a particular language. The speech corpus acquisition is an essential part of all spoken technology systems. The quality and the volume of speech data in corpus directly affect the accuracy of the system. However, there are a lot of scopes to develop speech technology system using Hindi language which is spoken primarily in India. To achieve such an ambitious goal, the collection of standard database is a prerequisite. This paper summarizes the Hindi corpus and lexical resources being developed by various organizations across the country.
Quasthoff, U., Mitra, R., Mitra, S., Eckart, T., Goldhahn, D., Goyal, P. and Mukherjee, A. (2012) Large Web Corpora of High Quality for Indian Languages. Proceedings of the 8th International Conference on Language Resources and Evaluation (LERC), Istanbul, 21-27 May 2012, 47.
Agrawal, S.S., Sinha, S., Singh, P. and Olsen, J. (2012) Development of Text and Speech Database for Hindi and Indian English Specific to Mobile Communication Environment. Proceedings of the International Conference on the Language Resources and Evaluation Conference (LREC), Istanbul, 21-27 May 2012.
Arora, K., Arora, S., Verma, K. and Agrawal, S.S. Automatic Extraction of Phonetically Rich Sentences from Large Text Corpus of Indian Languages. Proceedings of the 8th International Conference on Spoken Language Processing (ICSLP), Jeju Island, 4-8 October 2004, 2885-2888.
Jha, S., Narayan, D., Pande, P. and Bhattacharyya, P.A. (2001) WordNet for Hindi. Proceedings of the International Workshop on Lexical Resources in Natural Language Processing, Hyderabad, January 2001.
Joshi, A., Balamurali, A.R. and Bhattacharyya, P. (2010) A Fall-Back Strategy for Sentiment Analysis in Hindi: A Case Study. Proceedings of the Fifth International Conference on Systems (ICONS), Menuires, 11-16 April 2010, 1-6.