%0 Journal Article %T Multimodal Semantics Extraction from User-Generated Videos %A Francesco Cricri %A Kostadin Dabov %A Mikko J. Roininen %A Sujeet Mate %A Igor D. D. Curcio %A Moncef Gabbouj %J Advances in Multimedia %D 2012 %I Hindawi Publishing Corporation %R 10.1155/2012/292064 %X User-generated video content has grown tremendously fast to the point of outpacing professional content creation. In this work we develop methods that analyze contextual information of multiple user-generated videos in order to obtain semantic information about public happenings (e.g., sport and live music events) being recorded in these videos. One of the key contributions of this work is a joint utilization of different data modalities, including such captured by auxiliary sensors during the video recording performed by each user. In particular, we analyze GPS data, magnetometer data, accelerometer data, video- and audio-content data. We use these data modalities to infer information about the event being recorded, in terms of layout (e.g., stadium), genre, indoor versus outdoor scene, and the main area of interest of the event. Furthermore we propose a method that automatically identifies the optimal set of cameras to be used in a multicamera video production. Finally, we detect the camera users which fall within the field of view of other cameras recording at the same public happening. We show that the proposed multimodal analysis methods perform well on various recordings obtained in real sport events and live music performances. 1. Introduction The widespread use of camera-enabled mobile devices has allowed people to record anything that they find interesting in their daily life. In particular, one of the most popular means for recording videos is represented by mobile phones which, thanks to their easy portability, are available at any time of the day. Interesting things that people consider worth capturing are very diverse; examples may include funny moments with friends or with the family, music shows, celebrations such as weddings. In particular, there are some situations in which a multitude of people happen to be recording the same scene at the same time. These situations are usually public happenings such as sport events or live music performances. In this paper, we target such kind of scenarios, in which videos of the same event are recorded by multiple people for their own personal archives using their handheld devices (we use the terms happening and event interchangeably). As also stated in [1, 2], user-generated videos are then seldom watched either by the people who have shot them or by others. One of the main reasons is the lack of effective tools for automatically organizing the video archives in such a way that it would be easy for a user to retrieve a particular video. For example, it would be beneficial to automatically classify %U http://www.hindawi.com/journals/am/2012/292064/