We describe the design of a system consisting of several state-of-the-art real-time audio and video processing components enabling multimodal stream manipulation (e.g., automatic online editing for multiparty videoconferencing applications) in open, unconstrained environments. The underlying algorithms are designed to allow multiple people to enter, interact, and leave the observable scene with no constraints. They comprise continuous localisation of audio objects and its application for spatial audio object coding, detection, and tracking of faces, estimation of head poses and visual focus of attention, detection and localisation of verbal and paralinguistic events, and the association and fusion of these different events. Combined all together, they represent multimodal streams with audio objects and semantic video objects and provide semantic information for stream manipulation systems (like a virtual director). Various experiments have been performed to evaluate the performance of the system. The obtained results demonstrate the effectiveness of the proposed design, the various algorithms, and the benefit of fusing different modalities in this scenario. 1. Introduction Together Anywhere, Together Anytime (THETA2) project aims at understanding how technology can help to nurture family-to-family relationships to overcome distance and time barriers. This is something the current technology does not address well. Modern media and communications are designed for individuals, as phones, computers, and electronic devices tend to be user centric and provide individual experiences. Technological goal of TA2 is to build a system enabling natural remote interaction by exploiting sets of individual state-of-the-art “low-level-processing” audio-visual algorithms combined on a higher level. This paper focuses on the description and evaluation of these algorithms and their combination to be eventually used in conjunction with higher-level stream manipulation and interpretation systems, for example, an orchestrated videoconferencing system  that automatically selects relevant portions of the data (i.e., using a so-called virtual director). The aim of the proposed system is to separate semantic objects in the low-level signals (like voices, faces) to be able to determine their number and location, and, finally, determine, for instance, who speaks and when. The underlying algorithms comprise continuous localisation of audio objects and its application for spatial audio object coding , detection, and tracking of faces, estimation of head poses and visual focus
M. Falelakis, R. Kaiser, W. Weiss, and M. F. Ursu, “Reasoning for video-mediated group communication,” in Proceedings of the 12th IEEE International Conference on Multimedia and Expo (ICME '11), Barcelona, Spain, July 2011.
J. Engdeg？rd, B. Resch, C. Falch et al., “Spatial audio object coding (SAOC)—the upcoming MPEG standard on parametric object based audio coding,” in Proceedings of the 124th AES Convention, Amsterdam, The Netherlands, 2008.
Z. Khan, T. Balch, and F. Dellaert, “MCMC-based particle filtering for tracking a variable number of interacting targets,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1805–1819, 2005.
J. Pardo, X. Anguera, and C. Wooters, “Speaker diarization for multi-microphone meetings using only between-channel differences,” in Proceedings of the Machine Learning for Multimodal Interaction (MLMI '06), Bethesda, Md, USA, 2006.
D. Korchagin, “Audio spatio-temporal fingerprints for cloudless real-time hands-free diarization on mobile devices,” in Proceedings of the 3rd Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA '11), pp. 25–30, Edinburgh, UK, June 2011.
M. Slaney and M. Covell, “Facesync: a linear operator for measuring synchronization of video facial images and audio tracks,” in Proceedings of the Neural Information Processing Systems, pp. 814–820, 2000.
D. Korchagin, P. Motlicek, S. Duffner, and H. Bourlard, “Just-in-time multimodal association and fusion from home entertainment,” in Proceedings of the 12th IEEE International Conference on Multimedia and Expo (ICME '11), Barcelona, Spain, July 2011.
H. Nock, G. Iyengar, and C. Neti, “Speaker localisation using audio-visual synchrony: an empirical study,” in Proceedings of the 2nd International Conference on Image and Video Retrieval (CIVR '03), Urbana-Champaign, Ill, USA, 2003.
K. Otsuka, S. Araki, K. Ishizuka, M. Fujimoto, M. Heinrich, and J. Yamato, “A realtime multimodal system for analyzing group meetings by combining face pose tracking and s peaker diarization,” in Proceedings of the 10th International Conference on Multimodal Interfaces (ICMI '08), pp. 257–264, Chania, Greece, October 2008.
S. Rickard and ？. Yilmaz, “On the approximate W-disjoint orthogonality of speech,” in Proceedings of the IEEE International Conference on Acustics, Speech, and Signal Processing (ICASSP '02), pp. I/529–I/532, May 2002.
O. Thiergart, G. Del Galdo, M. Prus, and F. Kuech, “Three-dimensional sound field analysis with directional audio coding based on signal adaptive parameter estimators,” in Proceedings of the AES 40th International Conference on Spatial Audio: Sense the Sound of Space, Tokyo, Japan, October 2010.
O. Thiergart, R. Schultz-Amling, G. Del Galdo, D. Mahne, and F. Kuech, “Localization of sound sources in reverberant environments based on directional audio coding parameters,” in Proceedings of the 127th AES Convention, New York, NY, USA, 2009.
J. Herre, C. Falch, D. Mahne, G. Del Galdo, M. Kallinger, and O. Thiergart, “Interactive teleconferencing combining spatial audio object coding and DirAC technology,” in Proceedings of the 128th AES Convention, London, UK, 2010.
P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 511–518, December 2001.
C. Scheffler and J. M. Odobez, “Joint adaptive colour modelling and skin, hair and clothing segmentation using coherent probabilistic index maps,” in Proceedings of the British Machine Vision Conference, 2011.
E. Ricci and J.-M. Odobez, “Learning large margin likelihoods for realtime head pose tracking,” in Proceedings of the IEEE International Conference on Image Processing (ICIP '09), pp. 2593–2596, November 2009.
D. Sodoyer, B. Rivet, L. Girin, J.-L. Schwartz, and C. Jutten, “An analysis of visual speech information applied to voice activity detection,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), pp. I601–I604, May 2006.
S. Siatras, N. Nikolaidis, M. Krinidis, and I. Pitas, “Visual lip activity detection and speaker detection using mouth region intensities,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 1, pp. 133–137, 2009.
H. Hung and S. O. Ba, “Speech/non-speech detection in meetings from automatically extracted low resolution visual features,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '10), Dallas, Tex, USA, 2010.
P. N. Garner, “Silence models in weighted finite-state transducers,” in Proceedings of the 9th Annual Conference of the International Speech Communication Association (INTERSPEECH '08), pp. 1817–1820, Brisbane, Australia, September 2008.
J. Dines, J. Vepa, and T. Hain, “The segmentation of multi-channel meeting recordings for automatic speech recognition,” in Proceedings of the INTERSPEECH and 9th International Conference on Spoken Language Processing (INTERSPEECH ICSLP '06), pp. 1213–1216, September 2006.
P. N. Garner, J. Dines, T. Hain et al., “Real-time ASR from meetings,” in Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH '09), pp. 2119–2122, Brighton, UK, September 2009.
F. Kuech, M. Kallinger, M. Schmidt, C. Faller, and A. Favrot, “Acoustic echo suppression based on separation of stationary and non-stationary echo components,” in Proceedings of the Acoustic Echo and Noise Control, Seattle, Wash, USA, 2008.
G. Lathoud and I. A. McCowan, “A sector-based approach for localization of multiple speakers with microphone arrays,” in Proceedings of the Workshop on Statistical and Perceptual Audio Processing (SAPA '04), Jeju, Republic of Korea, 2004.
D. Vijayasenan, F. Valente, and H. Bourlard, “An information theoretic approach to speaker diarization of meeting data,” IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 7, pp. 1382–1393, 2009.
A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, “The DET curve in assessment of detection task performance,” in Proceedings of the European Conference on Speech Communication and Technology (Eurospeech '97), vol. 4, pp. 1895–1898, Rhodes, Greece, 1997.