Question? Leave a message!




Speech & Audio Processing & Recognition

Speech & Audio Processing & Recognition 5
EE E6820: Speech Audio Processing Recognition Lecture 1: Introduction DSP Dan Ellis dpweee.columbia.edu Mike Mandel mimee.columbia.edu Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/dpwe/e6820 January 22, 2009 1 Sound and information 2 Course Structure 3 DSP review: Timescale modi cation Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 1 / 33Outline 1 Sound and information 2 Course Structure 3 DSP review: Timescale modi cation Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 2 / 33Sound and information Sound is air pressure variation Mechanical vibration Pressure waves in air + + + + Motion of sensor v(t) Timevarying voltage t Transducers convert air pressure voltage Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 3 / 33What use is sound Footsteps examples: 0.5 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.5 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time / s Hearing confers an evolutionary advantage useful information, complements vision . . . at a distance, in the dark, around corners listeners are highly adapted to `natural sounds' (including speech) Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 4 / 33The scope of audio processing Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 5 / 33The acoustic communication chain message signal channel receiver decoder audio synthesis recognition processing Sound is an information bearer Received sound re ects source(s) plus e ect of environment (channel) Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 6 / 33Synthesis Levels of abstraction Much processing concerns shifting between levels of abstraction abstract ‘information’ representation (e.g. tf energy) sound p(t) concrete Di erent representations serve di erent tasks separating aspects, making things explicit, . . . Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 7 / 33 AnalysisOutline 1 Sound and information 2 Course Structure 3 DSP review: Timescale modi cation Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 8 / 33Source structure Goals I survey topics in sound analysis processing I develop and intuition for sound signals I learn some speci c technologies Course structure I weekly assignments (25) I midterm event (25) I nal project (50) Text Speech and Audio Signal Processing Ben Gold Nelson Morgan Wiley, 2000 ISBN: 0471351547 Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 9 / 33Webbased Course website: http://www.ee.columbia.edu/dpwe/e6820/ for lecture notes, problem sets, examples, . . . + student web pages for homework, etc. Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 10 / 33Course outline Fundamentals L4: L3: L1: L2: Pattern Auditory DSP Acoustics recognition perception Audio processing Applications L6: L5: L9: L10: Music Signal Speech Music analysis/ models recognition retrieval synthesis L7: L8: L11: L12: Audio Spatial sound Signal Multimedia compression rendering separation indexing Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 11 / 33Weekly assignments Research papers I journal conference publications I summarize discuss in class I written summaries on web page + Courseworks discussion Practical experiments I Matlabbased (+ Signal Processing Toolbox) I direct experience of sound processing I skills for project Book sections Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 12 / 33Final project Most signi cant part of course (50) of grade Oral proposals midsemester; Presentations in nal class + website Scope I practical (Matlab recommended) I identify a problem; try some solutions I evaluation Topic I few restrictions within world of audio I investigate other resources I develop in discussion with me Citation plagiarism Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 13 / 33Examples of past projects Automatic prosody classi cation Modelbased note transcription Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 14 / 33Outline 1 Sound and information 2 Course Structure 3 DSP review: Timescale modi cation Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 15 / 33DSP review: digital signals Discretetime sampling limits bandwidth x n = Q( x (nT ) ) d c Discretelevel quantization limits dynamic range time ε T sampling interval T 2 sampling frequency = T T   y quantizer Q(y) =  Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 16 / 33The speech signal: time domain Speech is a sequence of di erent sound types Vowel: periodic Fricative: aperiodic “has” “watch” .1 0.05 0 0 .1 0.05 1.38 1.4 1.42 1.86 1.88 1.9 1.92 0.2 0.1 0 0.1 0.2 1.4 1.6 1.8 2 2.2 2.4 2.6 time/s has a watch thin as a dime 0.1 0.02 0 0 0.02 0.1 1.52 1.54 1.56 1.58 2.42 2.44 2.46 2.4 Glide: smooth transition Stop burst: transient “watch” “dime” Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 17 / 33Timescale modi cation (TSM) Can we modify a sound to make it `slower' i.e. speech pronounced more slowly e.g. to help comprehension, analysis or more quickly for `speed listening' Why not just slow it down t x (t) = x ( ), r = slowdown factor ( 1 slower) s o r equivalent to playback at a di erent sampling rate 0.1 0.05 0 Original 0.05 r = 2 0.1 2.35 2.4 2.45 2.5 2.55 2.6 0.1 0.05 2x slower 0 0.05 0.1 2.35 2.4 2.45 2.5 2.55 2.6 time/s Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 18 / 33Timedomain TSM Problem: want to preserve local time structure but alter global time structure Repeat segments I but: artifacts from abrupt edges Crossfade overlap hj k i m m m1 y mL +n = y mL +n +wnx L +n r 0.1 1 2 3 4 5 6 0 0.1 2.35 2.4 2.45 2.5 2.55 2.6 1 4 time / s 2 5 3 6 0.1 11223344556 0 0.1 4.7 4.75 4.8 4.85 4.9 4.95 time / s Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 19 / 33Synchronous overlapadd (SOLA) Idea: allow some leeway in placing window to optimize alignment of waveforms 1 2 K maximizes alignment of 1 and 2 m Hence, hj k i m m m1 y mL +n = y mL +n +wnx L +n +K m r Where K chosen by crosscorrelation: m    P N ov m1 m y mL +nx L +n +K n=0 r K = argmaxq m P P    m m1 2 2 0KK u (y mL +n) (x L +n +K ) r Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 20 / 33The Fourier domain Fourier Series (periodic continuous x) x(t) 2 0.5 = 0 0 T 0.5 X 1 1.5 1 0.5 0 0.5 1 1.5 jk t 0 x(t) = c e t k 1.0 k c k Z T=2 1 jk t 0 123 4 567 k c = x(t)e dt k 2T T=2 Fourier Transform (aperiodic continuous x) 0.02 x(t) 0.01 Z 0 1 j t 0.01 x(t) = X (j )e d 0 0.002 0.004 0.006 0.008 time / sec 2 level / dB Z X(jΩ) 20 j t 40 X (j ) = x(t)e dt 60 80 0 2000 4000 6000 8000 freq / Hz Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 21 / 33Discretetime Fourier DT Fourier Transform (aperiodic sampled x) x n Z  1 j jn 1 1234567 n xn = X (e )e d jω 2 X(e )  3 X 2 j jn X (e ) = xne 1 ω 0 π 2π 3π 4π 5π Discrete Fourier Transform (Npoint x) x n X 2kn j N xn = X ke k 123456 7 n X 2kn j jω N Xk X(e ) X k = xne n k k=1... Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 22 / 33Sampling and aliasing Discretetime signals equal the continuous time signal at discrete sampling instants: x n = x (nT ) c d Sampling cannot represent rapid uctuations 1 0.5 0 0.5 1 0 1 2 3 4 5 6 7 8 9 10    2 sin + Tn = sin( Tn) 8n2Z M M T Nyquist limit ( =2) from periodic spectrum: T “alias” of “baseband” G (jΩ) a G (jΩ) signal p Ω Ω Ω Ω Ω T M M T Ω Ω Ω + Ω T M T M Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 23 / 33Speech sounds in the Fourier domain time domain frequency domain 0.1 40 60 Vowel: periodic 0 “has” 80 0.1 100 1.37 1.38 1.39 1.4 1.41 1.42 time / s 0 1000 2000 3000 freq / Hz 0.05 60 Fricative: aperiodic 0 80 “watch” 0.05 100 1.86 1.87 1.88 1.89 1.9 1.91 0 1000 2000 3000 4000 0.1 40 60 Glide: transition 0 “watch” 80 0.1 100 1.52 1.54 1.56 1.58 0 1000 2000 3000 4000 60 0.02 Stop: transient 0 80 “dime” 0.02 100 2.42 2.44 2.46 2.48 0 1000 2000 3000 4000 dB = 20 log (amplitude) = 10 log (power) 10 10 Voiced spectrum has pitch + formants Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 24 / 33 energy / dBShorttime Fourier Transform Want to localize energy in time and frequency break sound into shorttime pieces calculate DFT of each one 0.1 L 2L 3L 0 0.1 2.35 2.4 2.45 2.5 2.55 2.6 shorttime time / s window DFT 4000 3000 2000 1000 0 m = 0 m = 1 m = 2 m = 3 Mathematically,   N1 X 2k(nmL) X k;m = xnwnmL exp j N n=0 Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 25 / 33 k → freq / Hzintensity / dB The Spectrogram Plot STFT X k;m as a grayscale image 0.1 0 4000 10 0 3000 10 2000 20 30 1000 40 0 50 2.35 2.4 2.45 2.5 2.55 2.6 time / s 4000 3000 2000 1000 0 0 0.5 1 1.5 2 2.5 time / s Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 26 / 33 freq / Hz freq / HzTimefrequency tradeo Longer window wn gains frequency resolution at cost of time resolution 0.2 0 4000 3000 2000 10 0 1000 10 0 20 4000 30 40 3000 50 level 2000 / dB 1000 0 1.4 1.6 1.8 2 2.2 2.4 2.6 time / s Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 27 / 33 Window = 48 pt Window = 256 pt Wideband Narrowband freq / Hz freq / HzSpeech sounds on the Spectrogram Most popular speech visualization 4000 3000 2000 1000 0 1.4 1.6 1.8 2 2.2 2.4 2.6 time/s has a watch thin as a dime Wideband (short window) better than narrowband (long window) to see formants Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 28 / 33 freq / Hz Vowel: periodic “has” Glide: transition “watch” Fric've: aperiodic “watch” Stop: transient “dime”TSM with the Spectrogram Just stretch out the spectrogram 4000 3000 2000 1000 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Time 4000 3000 2000 1000 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Time how to resynthesize spectrogram is onlyjY k;mj Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 29 / 33 Frequency FrequencyThe Phase Vocoder Timescale modi cation in the STFT domain Magnitude from `stretched' spectrogram: h i m jY k;mj = X k; r I e.g. by linear interpolation But preserve phase increment between slices: h i m  k;m = k; Y X r I e.g. by discrete di erentiator Does right thing for single sinusoid I keeps overlapped parts of sinusoid aligned ΔT time . θ = Δθ ΔT . Δθ' = θ·2ΔT Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 30 / 33General issues in TSM Time window I stretching a narrowband spectrogram Malleability of di erent sounds I vowels stretch well, stops lose nature Not a wellformed problem I want to alter time without frequency . . . but time and frequency are not separate I `satisfying' result is a subjective judgment ) solution depends on auditory perception. . . Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 31 / 33Summary Information in sound I lots of it, multiple levels of abstraction Course overview I survey of audio processing topics I practicals, readings, project DSP review I digital signals, time domain I Fourier domain, STFT Timescale modi cation I properties of the speech signal I timedomain I phase vocoder Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 32 / 33References J. L. Flanagan and R. M. Golden. Phase vocoder. Bell System Technical Journal, pages 14931509, 1966. M. Dolson. The Phase Vocoder: A Tutorial. Computer Music Journal, 10(4):1427, 1986. M. Puckette. Phaselocked vocoder. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 222225, 1995. A. T. Cemgil and S. J. Godsill. Probabilistic Phase Vocoder and its application to Interpolation of Missing Values in Audio Signals. In 13th European Signal Processing Conference, Antalya, Turkey, 2005. Dan Ellis (Ellis Mandel) Intro DSP January 22, 2009 33 / 33
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
JuliyaMadenta
User Type:
Researcher
Country:
Philippines
Uploaded Date:
15-07-2017