All Title Author
Keywords Abstract

Multimodal Semantics Extraction from User-Generated Videos

DOI: 10.1155/2012/292064

Full-Text   Cite this paper   Add to My Lib


User-generated video content has grown tremendously fast to the point of outpacing professional content creation. In this work we develop methods that analyze contextual information of multiple user-generated videos in order to obtain semantic information about public happenings (e.g., sport and live music events) being recorded in these videos. One of the key contributions of this work is a joint utilization of different data modalities, including such captured by auxiliary sensors during the video recording performed by each user. In particular, we analyze GPS data, magnetometer data, accelerometer data, video- and audio-content data. We use these data modalities to infer information about the event being recorded, in terms of layout (e.g., stadium), genre, indoor versus outdoor scene, and the main area of interest of the event. Furthermore we propose a method that automatically identifies the optimal set of cameras to be used in a multicamera video production. Finally, we detect the camera users which fall within the field of view of other cameras recording at the same public happening. We show that the proposed multimodal analysis methods perform well on various recordings obtained in real sport events and live music performances. 1. Introduction The widespread use of camera-enabled mobile devices has allowed people to record anything that they find interesting in their daily life. In particular, one of the most popular means for recording videos is represented by mobile phones which, thanks to their easy portability, are available at any time of the day. Interesting things that people consider worth capturing are very diverse; examples may include funny moments with friends or with the family, music shows, celebrations such as weddings. In particular, there are some situations in which a multitude of people happen to be recording the same scene at the same time. These situations are usually public happenings such as sport events or live music performances. In this paper, we target such kind of scenarios, in which videos of the same event are recorded by multiple people for their own personal archives using their handheld devices (we use the terms happening and event interchangeably). As also stated in [1, 2], user-generated videos are then seldom watched either by the people who have shot them or by others. One of the main reasons is the lack of effective tools for automatically organizing the video archives in such a way that it would be easy for a user to retrieve a particular video. For example, it would be beneficial to automatically classify


[1]  R. Oami, A. B. Benitez, S. F. Chang, and N. Dimitrova, “Understanding and Modeling User Interests in Consumer Videos,” in IEEE International Conference on Multimedia and Expo, pp. 1475–1478, Taipei, Taiwan, 2004.
[2]  M. Sugano, T. Yamada, S. Sakazawa, and S. Hangai, “Genre Classification Method for Home Videos,” in IEEE International Workshop on Signal Processing, pp. 1–5, Rio de Janeiro, Brazil, 2009.
[3]  D. Brezeale and D. J. Cook, “Automatic video classification: a survey of the literature,” IEEE Transactions on Systems, Man and Cybernetics C, vol. 38, no. 3, pp. 416–430, 2008.
[4]  N. Serrano, A. Savakis, and J. Luo, “A computationally efficient approach to indoor/outdoor scene classification,” in 16th IEEE International Conference on Pattern Recognition, pp. 146–149, Quebec City, Canada, 2002.
[5]  U. Lipowezky and I. Vol, “Indoor-outdoor detector for mobile phone cameras using gentle boosting,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 31–38, San Francisco, Calif, USA, 2010.
[6]  M. Szummer and R. W. Picard, “Indoor-outdoor image classification,” in IEEE International Workshop on Content-Based Access of Image and Video Database, pp. 42–51, Bombay, India, 1998.
[7]  A. Payne and S. Singh, “Indoor vs. outdoor scene classification in digital photographs,” Pattern Recognition, vol. 38, no. 10, pp. 1533–1545, 2005.
[8]  N. Zhang and L. Guan, “An efficient framework on large-scale video genre classification,” in IEEE International Workshop on Multimedia Signal Processing, pp. 481–486, Saint-Malo, France, 2010.
[9]  X. Yuan, W. Lai, T. Mei, X. S. Hua, X. Q. Wu, and S. Li, “Automatic video genre categorization using hierarchical SVM,” in IEEE International Conference on Image Processing, pp. 2905–2908, Atlanta, Ga, USA, 2006.
[10]  J. Xinghao, S. Tanfeng, and C. Bin, “A novel video content classification algorithm based on combined visual features model,” in 2nd International Congress on Image and Signal Processing (CISP '09), October 2009.
[11]  M. Montagnuolo and A. Messina, “Multimodal genre analysis applied to digital television archives,” in 19th International Conference on Database and Expert Systems Applications (DEXA '08), pp. 130–134, Turin, Italy, September 2008.
[12]  A. Feryanto and I. Supriana, “Location recognition using detected objects in an image,” in International Conference on Electrical Engineering and Informatics, pp. 1–4, Ban-dung, Indonesia, 2011.
[13]  G. Schroth, R. Huitl, D. Chen, M. Abu-Alqumsan, A. Al-Nuaimi, and E. Steinbach, “Mobile visual location recognition,” IEEE Signal Processing Magazine, vol. 28, no. 4, pp. 77–89, 2011.
[14]  K. Tieu, G. Dalley, and W. E. L. Grimson, “Inference of non-overlapping camera network topology by measuring statistical dependence,” in 10th IEEE International Conference on Computer Vision, vol. 2, pp. 1842–1849, Beijing, China, 2005.
[15]  T. Thummanuntawat, W. Kumwilaisak, and J. Chinrungrueng, “Automatic region of interest detection in multi-view video,” in International Conference on Electrical Engineering/Electronics Computer Telecommunications and Information Technology (ECTI-CON '10), pp. 889–893, Chiang Mai, Thailand, May 2010.
[16]  J. B. Hayet, T. Mathes, J. Czyz, J. Piater, J. Verly, and B. Macq, “A modular multi-camera framework for team sports tracking,” in IEEE Conference on Advanced Video and Signal Based Surveillance (AVSS '05), pp. 493–498, Como, Italy, September 2005.
[17]  A. Carlier, V. Charvillat, W. T. Ooi, R. Grigoras, and G. Morin, “Crowdsourced automatic zoom and scroll for video retargeting,” in 18th ACM International Conference on Multimedia ACM Multimedia (MM '10), pp. 201–210, Firenze, Italy, October 2010.
[18]  P. Doubek, I. Geys, T. Svoboda, and L. Van Gool, “Cinematographic rules applied to a camera network,” in 5th Workshop on Omnidirectional Vision, Camera Networks and Non-Classical Cameras, pp. 17–29, Prague, Czech Republic, 2004.
[19]  F. Chen and C. DeVleeschouwer, “Personalized production of basketball videos from multi-sensored data under limited display resolution,” Elsevier Journal of Computer Vision and Image Understanding, vol. 114, no. 6, pp. 667–680, 2010.
[20]  T. Ojala, M. Pietikainen, and D. Harwood, “Performance evaluation of texture measures with classification based on Kullback discrimination of distributions,” in 12th IAPR International Conference on Pattern Recognition, vol. 1, pp. 582–585, Jerusalem, Palestine, 1994.
[21]  MPEG-7, “ISO/IEC 15938, Multimedia Content Description Interface,”
[22]  D. G. Lowe, “Object recognition from local scale-invariant features,” in IEEE International Conference on Computer Vision, vol. 2, pp. 1150–1157, Corfu, Greece, 1999.
[23]  T. Lahti, On low complexity techniques for automatic speech recognition and automatic audio content analysis, Doctoral thesis, Tampere University of Technology, 2008.
[24]  F. Cricri, K. Dabov, I. D. D. Curcio, S. Mate, and M. Gabbouj, “Multimodal Event Detection in User Generated Videos,” in IEEE International Symposium on Multimedia, pp. 263–270, Dana Point, Calif, USA, December 2011.
[25]  V. Kobla, D. DeMenthon, and D. Doermann, “Identification of sports videos using replays, text, and camera motion features,” in Storage and Retrieval for Media Databases, vol. 3972 of Proceedings of SPIE, pp. 332–343, 2000.
[26]  C. G. M. Snoek, M. Worring, and A. W. M. Smeulders, “Early versus late fusion in semantic video analysis,” in ACM International Conference on Multimedia, pp. 399–402, Singapore, 2005.
[27]  V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995.
[28]  B. Foss, Filmmaking: Narrative and Structural Techniques, Silman James Press, Los Angeles, Calif, USA.
[29]  Y. G. Jiang, G. Ye, S. F. Chang, D. Ellis, and A. C. Loui, “Consumer video understanding: A benchmark database and an evaluation of human and machine performance,” in 1st ACM International Conference on Multimedia Retrieval (ICMR '11), Trento, Italy, April 2011.
[30]  M. Cha, H. Kwak, P. Rodriguez, Y. Y. Ahn, and S. Moon, “Analyzing the video popularity characteristics of large-scale user generated content systems,” IEEE/ACM Transactions on Networking, vol. 17, no. 5, pp. 1357–1370, 2009.
[31]  Y. Odaka, S. Takano, Y. In, M. Higuchi, and H. Murakami, “The evaluation of the error characteristics of multiple GPS terminals,” in Recent Researches in Circuits, Systems, Control and Signals, pp. 13–21, 2011.
[32]  T. Menard, J. Miller, M. Nowak, and D. Norris, “Comparing the GPS capabilities of the Samsung Galaxy S, Motorola Droid X, and the Apple iPhone for Vehicle Tracking Using FreeSim_Mobile,” in 14th IEEE International Conference on Intelligent Transportation Systems, pp. 985–990, Washington, DC, USA, 2011.


comments powered by Disqus