< Back to previous page


Audio Visual Signal Processing

Research Group

Main organisation:Research Council
Lifecycle:22 Dec 2005  →  Today
Organisation profile:

Nowadays, audio-visual scene analysis becomes more and more important in many areas. A typical application regards teleconferencing systems for enhanced multimodal communication, or to steer electronically the camera to the active speaker without a human operator. But also in surveillance systems the audio-visual perception is of great interest for intruder detection, behaviour and event analysis, etc ... A further example affects the observation of a driver. By means of acoustic and visual data the vigilance of the driver can be detected to avoid accidents due to fatigue or inattention. Within this context, the AVSP research at ETRO is pioneering future directions in audio-visual systems. The AVSP research cluster explores and capitalizes on the correlation between speech and video data for Computational information technology where efficient numerical methods of computational engineering are combined with the problems of information processing. It has adopted an integrated approach to speech and vision processing, and is supported by in-house expertise in Speech Processing (ETRO-DSSP) and Computer Vision (ETRO-IRIS) and long term collaboration with the Dept. of Computer Information & Engineering (CIE) - School of Computer Science, North Western Polytechnic University (NWPU) Xi'an - China. The research work includes: * Audio Visual Interactions in Multimodal Communications * Audio Visual Emotion Analysis & Synthesis * Source localization and tracking in audio-visual scenes * Audio-Visual Synchrony * Audio Visual Speaker Identification The applications being addressed by this work include: * Multimedia applications- understanding, indexing and managing multimedia content as well as detecting scene transitions and events. * Ubiquitous computing applications - multimodal Audio-Visual Processing for Advanced Context Awareness in Smart Spaces (Smart rooms, Group meeting, Intelligent Environments) * Combined Visual-Acoustic processing for Humanoid Robots * Audio-visual surveillance * Audio-visual driver vigilance monitoring Description * Audio-Visual Interactions in Multimodal Communications Human perception of speech is bimodal in that acoustic speech can be affected by visual cues from lip movement. Due to the bimodality in speech perception, audio-visual interaction is an important design factor for multimodal communication systems, such as video telephony and video conferencing. The key issue in bimodal speech analysis and synthesis is the establishment of the mapping between acoustic and visual parameters. We are developing approaches for establishing this mapping. Our current work addresses three inter-related problems: (i) the synthesis of articulatory parameters for an MPEG-4 facial animation model, (ii) robust audio-visual speech recognition, and (iii) photorealistic audio-visual speech synthesis. Fusing these areas will impact the fields of speech and text driven facial animation parameters, speech and text driven facial animation of avatars and audio-visual speech recognition. Specifically, the proposed technique is not only intended to reproduce lip movements in accordance with the speech content, but the goal is also to reproduce the complex set of facial movements and expressions related with the emotional content of the speech. Such a system, for example, can provide visual information during telephone conversation that could be beneficial to people with impaired hearing. It can also be beneficial to normal hearing people when the audio signal is degraded. * Audio-Visual Emotion Analysis & Synthesis Recent research results in the area of multimodal interfaces have been focusing on the development of natural, adaptive and intelligent interfaces enabling machines to communicate with humans in ways much closer to the way humans communicate among themselves. In order to augment natural interactivity between humans and the physical or virtual environment, research is now carried out towards autonomous interfaces that are capable of learning and adapting (responding) to user emotions, intentions and behaviour. Methods for intention and emotion recognition are still in a very early stage of research and development. We are developing audio-visual methods that combine the information obtained from voice (mainly based on prosodic features) and from video (mainly based on facial expressions) for emotion analysis and synthesis. * Source localization and tracking in audio-visual sequences Human listeners make use of binaural cues (e.g., interaural time differences) to localise sound sources in space. To make use of such information, we propose the joint use of microphone arrays and video information to localize and track humans. A major innovation in this area will be the use of novel audio-visual localization methods, where audio and video processing methods are combined in order to realize reliable source localization. In this way, it will be possible to continuously identify the focus of e.g. a meeting discussion and to detect changes in this focus. A variety of different modes that could be used for tracking (e.g., face localization, sound localization, motion evaluation) will be combined in a stochastic framework in order to guarantee a robust tracking algorithm. * Audio-Visual Synchrony Human beings have a special ability in understanding what is happening in an audio-visual scene and are particularly efficient at assessing audiovisual synchrony. Visual observation allows objects to be tracked from one location to another and allows objects' appearances and activities to be characterized over time. Audio observation compliments visual observation in many ways. We are investigating methods to detect discrete audio and visual events, determine anomalous audio and visual events, cluster the audio and video events into meaningful classes, and determine the salient temporal chains of these events that correspond to particular activities in the environment. Another aspect of this work will be the combination and selection of a variety of multimodal features for these tasks. * Audio-Visual Speaker Identification Humans identify speakers based on a variety of attributes of the person which include acoustic cues, visual appearance cues and behavioural characteristics (such as characteristic gestures, lip movements). Speaker identification is an important technology for a variety of applications including security, meetings, and more recently as index for search and retrieval of digitized multimedia content (for instance in the MPEG7 standard). Audio-based speaker identification accuracy under acoustically degraded conditions (such as background noise) still needs further improvement. We have begun to investigate the combination of audio-based processing with visual processing for speaker Identification

Keywords:Speech intelligibility, Sign language, Emotion detection, Talking heads, Audiovisual speech recognition, Behaviour analysis, Audiovisual speech synthesis
Disciplines:Signal processing, Multimedia processing