Embed Size (px)
Transcript of £…½´…±ƒ¼½·...
- , . . .http://cvsp.cs.ntua.gr . , . .
CVSP -- . ()3 7 . + 2-5 . + . : . , . , / ( ) - & / & : http://cvsp.cs.ntua.gr
(McGurk & MacDonald) () :/
- / ,
- (King et al., Deng) : N (Articulatory Gestures, Browman & Goldstein)
... (.. Bell, 1867)
G. Papandreou, A. Katsamanis, V. Pitsikalis, and P. Maragos, Adaptive Multimodal Fusion by Uncertainty Compensation with Application to Audio-Visual Speech Recognition, IEEE Trans. ASLP, 2009
- . Image AcquisitionFirewire color camera, 640x480 @25 fpsFace detectorAdaboost-based, @5 fpsHMM-based backendFace tracking & feature extractionReal-time AAM fitting algorithms(Re)initializationSystem OverviewGPU-accelerated processingOpenGL implementationTranscription
: ; (.. )
, , , ... : , , (Knill & Richards) (.. Ernst et al.) // Maragos et al., Cross-Modal Integration, Springer 2008
: : : Wiener Kalman ;
: :SNR= 20dB SNR= 5dB
: : !: o (Gaussian Mixture Model - GMM)S
GMM 1- (y1 y2), 2
: : :PoG :
Markov () & Viterbi () - () ( frame)
Mel Frequency Cepstral Coefficients (MFCCs):Pre-emphasis STFT | . | Mel-scale log( . ) DCT (e.g. SPLICE, ALGONQUIN) MFCC (VTS)
+ Deng, Droppo, Acero, IEEE Tr. SAP, 2005
- :Asynchronous-HMM, Coupled-HMM, Dynamic Bayesian Networks, Product- Multistream-
. : CUAVE:36 (30 , 6 )5 10 : 1500 (30x5x10) : 300 (6x5x10) babble - NOISEX HMMs (- , 8 , 1 /, ) HTK ( )
/ AV-W-UC vs. A-UC28.7 %
AV-UC vs. AV AV-W-UC vs. AV-W 20 %
Product-HMM Product-HMM vs. Multistream-HMM1.2 %
: & : MUSCLE (NoE) & HIWIRE (STREP)
- A. Katsamanis, G. Papandreou, and P. Maragos, Face Active Appearance Modeling and Speech Acoustic Information to Recover Articulation, IEEE Trans. ASLP, 2009
- : : () : , ,
CSTR, Univ. Edinburgh (, 1 /1 ), 460 TIMIT (2- 9 ) 30 MOCHA
y, x:prior :Yehia, Rubin & Vatikiotis-Bateson, Speech Comm., 1998
CCA CCA : : . . (CCA)
Hiroya & Honda, IEEE TSAP 2004Viterbi Markov ->
: / : . : HMM / MS-HMM:
() : / . :Visemes ( ) ( )
- ( )
Viseme Classes for Inversion
Katsamanis et al. EUSIPCO 2008*: .
/ CVSP (. ) : X-rays, (. . )
- : : () : , , : ASPI (FET) & ()
Good-morning everybody, I m gpI will present you our work on Adaptive Multimodal Fusion,Which has been inspired from applications in AV-ASR and ASPI.This is joint work with nk, vp, and pm @ ICCS-NTUA
Wikipedia text:Visible speech is the name of the writing system used by Alexander Melville Bell, who was known internationally as a teacher of speech and proper elocution and an author of books on the subject. The system is composed of symbols that show the position and movement of the throat, tongue, and lips as they produce the sounds of language, and it is a type of phonetic notation. The system was used to aid the deaf in learning to speak. Bell's son Alexander Graham Bell learned the symbols, assisted his father in giving public demonstrations of the system and mastered it to the point that he later improved upon his father's work. Eventually, Alexander Graham Bell became a powerful advocate of visible speech and oralism in the United States. The money he earned from his patent of the telephone helped him to pursue this mission. G. Papandreou, A. Katsamanis, V. Pitsikalis, and P. Maragos, Adaptive Multimodal Fusion by Uncertainty Compensation with Application to Audio-Visual Speech Recognition,IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 3, pp. 423-435, Mar. 2009.
Overview of a Real-Time AV-ASR Prototype
Image acquisition is done with cheap off-the-shelf camera. Firewire interface selected to support high transfer rates of high-res video. USB-2.0 would also be OK, but USB-1.0 is insufficient.Face detector initializes the face tracker at the first frame and whenever tracking fails. It is near real-time and is based on the popular Ada-boost algorithm. Implementation in Intels OpenCV used.Face tracking module is based on the Active Appearance Modeling (AAM) technique. AAM fitting algorithms run in Real-Time, thanks to the GPU-based module. Visual feature extraction (parameters of fitted AAM) is performed.AAM fitting requires repeated texture mapping. This is efficiently done in the Graphics Processing Unit (GPU). A simple modern graphics card is required (ATI Radeon 9700 used in the prototype). Texture mapping module is written in OpenGL and is thus quite generic. Visual features are fed into the HMM-based HTK recognition engine. Digit models have already been trained on the QUAVE database.Recognizer emits digit recognition results.[Top right] Photograph of the prototype in action.
In these applications we have multiple cues available,and our systems can perform better if we properly fuseall the information sources available.
Multiple cues provide complementary information more informed decisionsDifferent modalities affected dissimilarly by noise more robust performance
biometric apps particularly lend themselves to multimodal approaches
I NEED TO DEVELOP A (PICTORIAL) ANALOGY OF MULTIPLE CUES AS DIFFERENT VIEWPOINTS OF THE REALITYHowever effective cue integration is non-trivial.
the key issue is that we can actually measure our features within limited precision ,Thus, in the proposed scheme We model the feature uncertainty explicitly.
In this way our scheme adapts both in time and the models classNext, one of the most important contributions, as a by product, is theGeneralization and explanation of the conventionally stream weighting technique.Further our scheme is applicable on both AA. AV cases.For the AV case, where there is an asynchronicity issue, we can incorporate itBy integrate our method with p-hmm models.Finally, over all the proposed scheme is prob. Rigorous, and simple to implement. Our approach to the problem builds on the following fundamental fact:We explicitly model feature measurement uncertainty in a probabilistic framework.We argue that the following setup is much more realistic. Here we explicitly acknowledge the limited precision of our measurements and we quantitatively characterize it, by accompanying our features with confidence intervals / errorbars.Pictorial notation of graphical modelsWe can more easily grasp whats going on if we make a gaussian noise assumption.Then one can compute the integral analytically, and the class posterior is given by the following formula;This implies that we can effectively compensate for feature uncertainty by increasing theCovariance matrix of the model by the covariance matrix of the measurement noise.This has a regularizing effect, which is particularly noticeable at the more peakycomponents. We are far more interested though for the effect of measurement noise compensation in the multimodal case.We show next an example of classification using two information streams, x- (video) and y- (audio).We have two gaussian mixtures centered at () and (), and we assume that they have the same covariance matrix.Then, in the case that we can measure very accurately both x- and y- features, the classification decision boundary is the black line, which intersects the axes at a 45o angle.Now, lets assume that we have increasing acoustic noise and the precision in the y-feature measurements decreases.This is demonstrated by a corresponding effective increase of the model covariance matrices in the direction of the y-axis.Then the classification decision boundary tends to align with the y-feature axis; this means that we tend to discountthe importance of the y-feature and concentrate more on the x-axis.In the limit of very high variance in the y-axis, the y-stream is effectively ignored completely, as desired.It is interesting to explore connections of our approach with
This is an appealing result since our framework unveils the pobabilistic underpinnings under stream weight based formalationsAnd it provides a rigorous mechanism to select for each new measurement an uncertainty estimate (m_e, sigma_e) all involved IMPLICIT stream weights fully adaptively with respect to both class label c and mixture component m Will change the formulas for the standard caseAlthough our discussion so far has mainly concentrated on GMMs, the ideas can be easily transferred to HMMsCHANGE & MERGEclassificationResults_audiovisual_babble5_MFCC_D_A_Z.res_WithVars
The sequence of computations for the Uncertainty Ellipses has as follows: Fitting of the AAM gives the estimated