%0 Journal Article
%T Real-Time Audio-Visual Analysis for Multiperson Videoconferencing
%A Petr Motlicek
%A Stefan Duffner
%A Danil Korchagin
%A Hervé Bourlard
%A Carl Scheffler
%A Jean-Marc Odobez
%A Giovanni Del Galdo
%A Markus Kallinger
%A Oliver Thiergart
%J Advances in Multimedia
%D 2013
%I Hindawi Publishing Corporation
%R 10.1155/2013/175745
%X We describe the design of a system consisting of several state-of-the-art real-time audio and video processing components enabling multimodal stream manipulation (e.g., automatic online editing for multiparty videoconferencing applications) in open, unconstrained environments. The underlying algorithms are designed to allow multiple people to enter, interact, and leave the observable scene with no constraints. They comprise continuous localisation of audio objects and its application for spatial audio object coding, detection, and tracking of faces, estimation of head poses and visual focus of attention, detection and localisation of verbal and paralinguistic events, and the association and fusion of these different events. Combined all together, they represent multimodal streams with audio objects and semantic video objects and provide semantic information for stream manipulation systems (like a virtual director). Various experiments have been performed to evaluate the performance of the system. The obtained results demonstrate the effectiveness of the proposed design, the various algorithms, and the benefit of fusing different modalities in this scenario. 1. Introduction Together Anywhere, Together Anytime (THETA2) project aims at understanding how technology can help to nurture family-to-family relationships to overcome distance and time barriers. This is something the current technology does not address well. Modern media and communications are designed for individuals, as phones, computers, and electronic devices tend to be user centric and provide individual experiences. Technological goal of TA2 is to build a system enabling natural remote interaction by exploiting sets of individual state-of-the-art “low-level-processing” audio-visual algorithms combined on a higher level. This paper focuses on the description and evaluation of these algorithms and their combination to be eventually used in conjunction with higher-level stream manipulation and interpretation systems, for example, an orchestrated videoconferencing system [1] that automatically selects relevant portions of the data (i.e., using a so-called virtual director). The aim of the proposed system is to separate semantic objects in the low-level signals (like voices, faces) to be able to determine their number and location, and, finally, determine, for instance, who speaks and when. The underlying algorithms comprise continuous localisation of audio objects and its application for spatial audio object coding [2], detection, and tracking of faces, estimation of head poses and visual focus
%U http://www.hindawi.com/journals/am/2013/175745/