Applied speech and audio processing with matlab examples

speech and audio processing for coding enhancement and recognition pdf
JuliyaMadenta Profile Pic
JuliyaMadenta,Philippines,Researcher
Published Date:15-07-2017
Your Website URL(Optional)
Comment
EE E6820: Speech & Audio Processing & Recognition Lecture 1: Introduction & DSP Dan Ellis dpweee.columbia.edu Mike Mandel mimee.columbia.edu Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/dpwe/e6820 January 22, 2009 1 Sound and information 2 Course Structure 3 DSP review: Timescale modi cation Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 1 / 33Sound and information Sound is air pressure variation Mechanical vibration Pressure waves in air + + + + Motion of sensor v(t) Time-varying voltage t Transducers convert air pressure voltage Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 3 / 33What use is sound? Footsteps examples: 0.5 0 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.5 0 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time / s Hearing confers an evolutionary advantage useful information, complements vision . . . at a distance, in the dark, around corners listeners are highly adapted to `natural sounds' (including speech) Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 4 / 33The scope of audio processing Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 5 / 33The acoustic communication chain message signal channel receiver decoder audio synthesis recognition processing Sound is an information bearer Received sound re ects source(s) plus e ect of environment (channel) Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 6 / 33Synthesis Levels of abstraction Much processing concerns shifting between levels of abstraction abstract ‘information’ representation (e.g. t-f energy) sound p(t) concrete Di erent representations serve di erent tasks separating aspects, making things explicit, . . . Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 7 / 33 AnalysisSource structure Goals I survey topics in sound analysis & processing I develop and intuition for sound signals I learn some speci c technologies Course structure I weekly assignments (25%) I midterm event (25%) I nal project (50%) Text Speech and Audio Signal Processing Ben Gold & Nelson Morgan Wiley, 2000 ISBN: 0-471-35154-7 Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 9 / 33Web-based Course website: http://www.ee.columbia.edu/dpwe/e6820/ for lecture notes, problem sets, examples, . . . + student web pages for homework, etc. Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 10 / 33Course outline Fundamentals L4: L3: L1: L2: Pattern Auditory DSP Acoustics recognition perception Audio processing Applications L6: L5: L9: L10: Music Signal Speech Music analysis/ models recognition retrieval synthesis L7: L8: L11: L12: Audio Spatial sound Signal Multimedia compression & rendering separation indexing Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 11 / 33Weekly assignments Research papers I journal & conference publications I summarize & discuss in class I written summaries on web page + Courseworks discussion Practical experiments I Matlab-based (+ Signal Processing Toolbox) I direct experience of sound processing I skills for project Book sections Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 12 / 33Final project Most signi cant part of course (50%) of grade Oral proposals mid-semester; Presentations in nal class + website Scope I practical (Matlab recommended) I identify a problem; try some solutions I evaluation Topic I few restrictions within world of audio I investigate other resources I develop in discussion with me Citation & plagiarism Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 13 / 33Examples of past projects Automatic prosody classi cation Model-based note transcription Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 14 / 33DSP review: digital signals Discrete-time sampling limits bandwidth x n = Q( x (nT ) ) d c Discrete-level quantization limits dynamic range time ε T sampling interval T 2 sampling frequency = T T   y quantizer Q(y) =  Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 16 / 33The speech signal: time domain Speech is a sequence of di erent sound types Vowel: periodic Fricative: aperiodic “has” “watch” .1 0.05 0 0 .1 -0.05 1.38 1.4 1.42 1.86 1.88 1.9 1.92 0.2 0.1 0 -0.1 -0.2 1.4 1.6 1.8 2 2.2 2.4 2.6 time/s has a watch thin as a dime 0.1 0.02 0 0 -0.02 -0.1 1.52 1.54 1.56 1.58 2.42 2.44 2.46 2.4 Glide: smooth transition Stop burst: transient “watch” “dime” Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 17 / 33Timescale modi cation (TSM) Can we modify a sound to make it `slower'? i.e. speech pronounced more slowly e.g. to help comprehension, analysis or more quickly for `speed listening'? Why not just slow it down? t x (t) = x ( ), r = slowdown factor ( 1 slower) s o r equivalent to playback at a di erent sampling rate 0.1 0.05 0 Original -0.05 r = 2 -0.1 2.35 2.4 2.45 2.5 2.55 2.6 0.1 0.05 2x slower 0 -0.05 -0.1 2.35 2.4 2.45 2.5 2.55 2.6 time/s Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 18 / 33Time-domain TSM Problem: want to preserve local time structure but alter global time structure Repeat segments I but: artifacts from abrupt edges Cross-fade & overlap hj k i m m m1 y mL +n = y mL +n +wnx L +n r 0.1 1 2 3 4 5 6 0 -0.1 2.35 2.4 2.45 2.5 2.55 2.6 1 4 time / s 2 5 3 6 0.1 11223344556 0 -0.1 4.7 4.75 4.8 4.85 4.9 4.95 time / s Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 19 / 33Synchronous overlap-add (SOLA) Idea: allow some leeway in placing window to optimize alignment of waveforms 1 2 K maximizes alignment of 1 and 2 m Hence, hj k i m m m1 y mL +n = y mL +n +wnx L +n +K m r Where K chosen by cross-correlation: m    P N ov m1 m y mL +nx L +n +K n=0 r K = argmaxq m P P    m m1 2 2 0KK u (y mL +n) (x L +n +K ) r Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 20 / 33The Fourier domain Fourier Series (periodic continuous x) x(t) 2 0.5 = 0 0 T 0.5 X 1 1.5 1 0.5 0 0.5 1 1.5 jk t 0 x(t) = c e t k 1.0 k c k Z T=2 1 jk t 0 123 4 567 k c = x(t)e dt k 2T T=2 Fourier Transform (aperiodic continuous x) 0.02 x(t) 0.01 Z 0 1 j t -0.01 x(t) = X (j )e d 0 0.002 0.004 0.006 0.008 time / sec 2 level / dB Z X(jΩ) -20 j t -40 X (j ) = x(t)e dt -60 -80 0 2000 4000 6000 8000 freq / Hz Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 21 / 33Discrete-time Fourier DT Fourier Transform (aperiodic sampled x) x n Z  1 j jn -1 1234567 n xn = X (e )e d jω 2 X(e )  3 X 2 j jn X (e ) = xne 1 ω 0 π 2π 3π 4π 5π Discrete Fourier Transform (N-point x) x n X 2kn j N xn = X ke k 123456 7 n X 2kn j jω N Xk X(e ) X k = xne n k k=1... Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 22 / 33Sampling and aliasing Discrete-time signals equal the continuous time signal at discrete sampling instants: x n = x (nT ) c d Sampling cannot represent rapid uctuations 1 0.5 0 0.5 1 0 1 2 3 4 5 6 7 8 9 10    2 sin + Tn = sin( Tn) 8n2Z M M T Nyquist limit ( =2) from periodic spectrum: T “alias” of “baseband” G (jΩ) a G (jΩ) signal p Ω -Ω Ω -Ω Ω T M M T Ω - Ω -Ω + Ω T M T M Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 23 / 33