Question? Leave a message!




Feature Extraction from Speech and other kinds of Audio

Feature Extraction from Speech and other kinds of Audio
11. Feature Extraction from Speech and other kinds of Audio Rahil Mahdian 13.07.2015• Feature Extraction (Jakobson) 1. Total energy 2. Spectral Center of Gravity (SCG) 3. Duration 4. Low, medium and high frequency energy 5. Formant transitions 6. Silence detection 7. Voicing detection 8. Rate of change of energy in various frequency bands 9. Rate of change of SCG 10. Most prominent peak frequency 11. Rate of change of the most prominent peak frequency 12. Zerocrossing rate 2Time domain features ZCR: 3Features in the Time Domain: Shorttime Energy M1 (n) 2 E f Definition:  mn m0 Example: From: SchukatTalamazzini 4LPC features 5ASR Speech Features • Learn about the most established feature extraction from speech • Mel Frequency Cepstral Coefficients: MFCC 6Preemphasis The source signal for voiced sounds has slope of 6 dB/octave: 4k 0 1k 2k 3k frequency We want to model only the resonant energies, not the source. But LPC will model both source and resonances. If we preemphasize the signal for voiced sounds, we flatten it in the spectral domain, and source of speech more closely approximates impulses. LPC can then model only resonances (important information) rather than resonances + source. 7 energy (dB)Preemphasis • Correct for filtering of the lips • Iterative scheme: ´ f f f n n n1 • Typical values: a=0.95 8Example: putting a rectangular on a speech signal Frame shift Frame width typ.: 10ms typ.: 25ms  (m) iin F (e ) f w e  n mn n 9Fourier Transform in Practice • Use “Fast Fourier Transform” (FFT) • Requires number of samples N to be power of 2 (e.g. N=256) • Code available • Complexity N log( N) 10Established Window Functions • Use to get sharper peaks R • Rectangular window: w1 n • Generalized Hamming Window: (a=0.46 : standard 2n H w (1) cos( ) n Hamming window) N1 nN / 2 2 0.5( ) G 3N / 2 • Gauss window: w e n n n P w 4 (1 ) • Parabola window: n N N n=0...N1 •Window functions vanish outside this interval 11Rewrite of Fourier Transform  • Definition: (m) iin F (e ) f w e  n mn n • Window functions vanish outside the interval n=0...N1 1  2 • Define N n N1 i2 (m) N F f w e  mn n n0 Note: for further processing, we take the absolute value of the Fourier Transform 12Example for ö Short time spectrum Smoothed spectrum Frequency (Hz) Frequency (Hz) 13Spectrogram • Calculate a spectrum for any point in time • Code the local intensity: color/grey scale Time 14Spectrogram http://www.wilhelmkurzsoftware.de/dynaplot/applicationnotes/spectrogram.htm "To return to the main menu, press the star key". 15Use praat to generate a Spectrogram • Praat: software for doing phonetics by computer • Written by: Paul Boersma and David Weenink • quite powerful: spectrograms, formants, pitch, … • Download: http://www.fon.hum.uva.nl/praat/ 16Use praat to generate a Spectrogram 17Smoothing the Spectrogram: Filterbank • Idea: imitate ear • Do an average over neighboring frequencies • Scale the frequencies according to the mel or the Bark scale a Reduction from 256 Fourier coefficients to 24 outputs of a filterbank 18Example of a Filterbank 19Filterbank • Spacing of center frequency: – According to mel scale: f Mel( f ) 2595 log (1 ) 10 700 • Low frequency cut off: – E.g. 300 Hz (for telephone speech) • High frequency cut off: – E.g. 3400 Hz (for telephone speech ) • Different settings for e.g. head set connected PC 20Frequency Scales The human ear has different responses at different frequencies. Two scales are common: Mel scale: Bark scale (from Traunmüller 1990): 26.81 f f Bark( f ) 0.53 Mel( f ) 2595 log (1 ) 10 1960 f 700 frequency frequency 21 energy (dB)Mel versus Hz 22Example Mel Filterbank 23Vocal Tract Length Normalization • Idea: • Average position of formants depends on length of vocal tract • a varying position of frequencies of filter bank • A kind of speaker adaptation 24Vocal Tract Length Normalization: Frequency Warping 25Learning the Warp Factor a • Issue: how to scale for a specific speaker • Slow version: • Use 11 different warping factors • Do speech recognition with all of them • Pick the best one • Oldest approach • Not very efficient • Improvement: 10 less recognition errors 26Homomorphic Transforms Cepstrum • Refers to the idea of transforming a signal such that the components are linearly combined • Allows nonlinear processing in a linear environment • Convolution in a homomorphic processing becomes summation 27From Spectrum to Cepstrum • Name: swapping of letters • Idea: separate out the convolutional contribution • Useful as a preparation to remove channel distortions (e.g. telephone) • Cepstral mean subtraction (CMS) 28Definition “Cepstrum” Signal Fourier Transform Spectrum log Discrete Cosine Transform Cepstrum 29Math for Cepstrum • e : original signal (e.g. excitation from glotis) n • f : measured signal n • h : impulse response of channel (e.g. vocal n tract)  f h e n mn n n 30Math for Cepstrum • Apply Fourier transform F  F f  F h e  n mn n n • Use convolution theorem F f  F h F e n n n 31Math for Cepstrum • Apply logarithm log(F f ) log(F h ) log(F e ) n n n • Impulse response and excitation now separated 32Discrete Cosine Transform DCT • DCT expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies • DCT is a Fourierrelated transform similar to the discrete Fourier transform (DFT), but using only real numbers 33Cepstrum: do discrete cosine transform after log • Discrete cosine transform: N / 21 (m) (m) c 2 / N log(F )  0 0 N / 21 q(21) (m) (m) c 4 / N log(F )cos( )  q N 0 3435Dynamic Features • Cepstrum captures local aspects of speech • Window size 25 ms • Capture slow changes in spectrum • Other name: delta features 36Dynamic Features • Capture slow changes in spectrum 37Dynamic Features • Calculate first and second derivatives • Naïve approach to first derivative – Continuous function df (t) f (tt) f (tt)  dt 2t – Time discrete sampling df (t ) f (t ) f (t ) m m m  dt 21 38Difference/Regression ith component of feature vector Line through extremes Regression curve m3 m2 m1 m m+1 m+2 m+3 Sample 39Regression Formula M i( f (t ) f (t ))  mi mi df (t) i1  M dt 2 i  i1 How could you derive this formula •Check M=1 40Dynamic Features • Invented by Furui 1981 • Standard in any modern ASR system • Alternative: • Linear mapping of neighboring feature vectors • Issue: • Dimension of feature vectors 41Linear Discriminant Analysis • Method to decrease size of feature vector • Maximize separability of class regions • Linear transform of feature vectors 42Complete Pipeline for MelFrequency Cepstral Coefficients (MFCC) Typical values: Sampling 16 kHz; 16 Bit quantization Preemphasis Signal Windowing Window size: 25 ms Fast Fourier Transform 512 Fourier Coefficients Absolute Value Melscaled Filterbank 24 filterbank values log keep only 20 Discrete Cosine Transform lowest cepstra Dynamic Features (1. and 2. derivative) Feature Vectors 60 dimensional vector Linear Discriminant Analysis 43• MFCC (Mel frequency cepstral coefficient) • Widely used in speech recognition 1.Take the Fourier transform of the signal 2.Map the log amplitudes of the spectrum to the mel scale 3.Discrete cosine transform of the mel logamplitudes 4.The MFCCs are the amplitudes of the resulting spectrum 44Complete Set of FeaturesMFCC 45• Extract a feature vector from each frame • 12 MFCCs (Mel frequency cepstral coefficient) + 1 normalized energy = 13 features • Delta MFCC = 13 • DeltaDelta MCC = 13 • Total: 39 features 39 Feature vector • Inverted MFCCs: 4647Alternative Feature Extraction Methods • LPCepstrum (LP=linear prediction) • Derived from speech coding • No longer much in use • PLP (=Perceptual linear prediction) • For certain applications popular • Claim: more noise robust than MFCCs 1/3 • Main change: us . instead of log in MFCC 48Summary • Classical “plain vanilla” feature extraction: MelFrequency Cepstral Coefficients • Main deficiency: not very noise robust • Used in • Speech Recognition • Speaker Recognition • Music genre classification 49
Website URL
Comment