0% found this document useful (0 votes)
342 views

Practical Cryptography PDF

Uploaded by

Tahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
342 views

Practical Cryptography PDF

Uploaded by

Tahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 10
ogoazor6 Practical Cryptography Home ciphers cryptanalysis. hashes miscellaneous resources home / miscellaneous / machine learning / mel frequency censteal coeficen (nce tuteral Mel Frequency Cepstral Coefficient (MFCC) tutorial The first step in any automatic speech recagniton system Isto extract features Le. dentify the components ofthe audio signal that are good for identyng the linguletc content and discarding al the other stuff which caries information lke background noise, emotion etc ‘The main point to understand about speech fs tat the sounds generated by a human are filtered by the shape ofthe vocal tract including tongue, teeth ete This shape determines what sound comes out twe can determine the shape accurately, this should give us an accu produced. The shape of the vocal tract manifests Itself inthe envelope ofthe short time power spectrum, and the job of MFCCe is to accurately represent this envelope. This page will provide a short tutorial on Meces, representation ofthe phoneme being Mel Frequency Cepstral Coetficents (MFCCS) area feature widely used in automatic speech and speaker recognition. They were introduced by Davis ané Mermelstein inthe 1980's, and have been stat ar ever since, Prior tothe introduction of MFCCS, Linea Pre 5 (LPC) and Linea Prediction Cepstral Coefficients (LPCCS) (click here for a tutoval an cepstrum ane LPCCS) and were the ‘main feature type for automatic speech recognition (ASR). This page wll go over the main aspects of MFCCe, why they make a good feature for ASR, and how to implement them. ‘of-the- jon Coeici Steps at a Glance ‘We will give a high level intro to the implementation steps, then go in depth why we do the things we do, Towards the end we wil go into a more detalled description of how to calculate MFCCs, 1. Frame the signal into short frames. 2. For each frame calculate the periodogram estimate of the power spectrum 5. Apply the mel lterbank tothe power spectra, sum the energy In each file. 4. Take the logarithm of al flteroank energies. 5. Take the DCT of the lg filterbank energies 6. Keep OCT coefficients 2-13, discard the rest ‘There are a few more things commonly done, sometimes the frame energy is appended to each feature vector, Deka ane Delta-Delta features are usually also appended: Littering i also commonly applied to ‘he final features. Why do we do these things? We will now ge ati more slowly through the steps and explain why each of the steps is necessary {An audo signal is constantly changing, so to silly things we assume that on short timescales the audi signal doesn't change much (when we say it doesnt change, we mean statistically Le. statistically stationary, obviously the samples are constantly changing on even short time scales) This is why we frame the signal into 20-40ms frames. f the frame Is much shorter we don't have enough samples to get a reliable spectral estimate, fits longer the signal changes too much throughout the fame. The next step iste calculate the power spectrum of each frame, This s motivate by the hutman cochlea (an organ in the ear) which vibrates at different spots depending onthe frequency ofthe incoming sounds. Depending onthe location inthe cochlea that vibrates (which wobbles small hairs). eiferent nerves fie informing the brain that certain frequencies are present, Our periodogram estimate performs a Similar jb for us, identifying whieh frequencies are present inthe frame. The periodogram special estimate stl contains a [ot of information not required for Automatic Speech Recagnition (ASR. In particular the cochlea can not discern the difference between two closely spaced frequencies. This effect becomes more pronounced as the frequencies increase. For this reason we take ‘lumps of perledogram bins and sum ther up to get an idea of how much energy exists In various frequency regions. This is performed by our Mel fiterbank: the frst fier is very narrow and gives an Inieation of how much energy exists near 0 Hert. AS the frequencies get higher our filters get wider as we become less concerned about variations. We are only interested ln roughly how much energy occurs at ‘each spot. The Mel scale tells us exactly how to space our fiterbanks and how wide ro make them. See below for how to calculate the spacing itp acticaleryptogr pty. com miscellanecusimactine-learningigude-ma/-requency-cepstr-cootcints-mfccs! Contents steps at aGlance Wy do we do these things? 1 What isthe Mel scale? + Implementation steps Computing the Mel fiterbank + Dekas and Delta-Deltas + Implementations References Further reading We recommend these books if you're interests in finding out Pattern Recogn Learning and Machine The best machine learning book around” ‘uy from Amazon.com SB@KEN ‘Spoken Language Processing: A Guide to Theory, Algorithm and System Development ‘A good overview of speech rocessing algorithms and techniques ‘uy rom Amazon.com 10 ogoazor6 Practical Crypogranty ‘nce we have the fiterbank energies, we take the logarithm of them. This s also motivated by human hearing: we don't hear loudness ona linear scale. Generally to double the perceved volume of a sound we reed to put § times as much eneray into it This means that large variations in eneray may not sound all ‘that diferent If the sound is loud ro begin with. This comaression operation makes our features mateh more closely what humans acwualy hear. Why the logarithm and nota cube root? The logarithm allows us touse ceastral mean subtraction, which isa channel normalisation technique ‘The final step isto compute the DCT of the lg filterbank energies. There are 2 main reasons ths is performed. Because our fiterbanks are all overlapping, the fikerbank energies are quite correlated with ‘each other. The DCT decorrelates the energies which means diagonal covariance matrices can be used to ‘model the features in e.g, a HMM classifier But notice that only 12 af the 26 DCT coelfiients are kept. This s because the higher DCT coefficients represent fast changes in the fiterbank energies and itturns cout that these fast changes actualy degrade ASR performance, so we get a small Improvement by f(m+1) winere As the numberof ters we want, and F()is the fist of M+2 Melspaced frequencies. The final plot ofall 10 filters overlayed on each other is “he Tositer Mel Freak ns os o 03 ue a a 0 vao0 mmo aoo0 ano soon emo 7000 ano ‘toquoncy tH) [AMel-ierbank containing 10 fers. This filterbank starts at OH2 and ends at B000H2. This is a guide ony, the worked example above starts at 300H2 Deltas and Delta-Deltas [Also known as cifferentil and acceleration coefficients. The MFCC feature vector describes ony the power spectral envelope ofa single fame, butt seems lke speech would also have Information inthe dynamics |e what are the trajectories ofthe MECC coefficients overtime. Ittumns out that calculating the MECC trajectories and appending them to the orignal feature vector increases ASR performance by quite a bit it we have 12 MECC coefficients, we would also get 12 delta coefficients, which would combine to give a feature vector af length 24, To calculate the delta coefficients, the following formula is used a Drea 2(Ct4n = cn) 20a? where tea deta coficient, from frame £ compute in terms ofthe state coefficients 14+ to Ct—N. Atypical value for N is 2. Deta-Deta (Acceleration coeficients are calculated inthe same way, but they are calculated from the deltas, not the static coeiciens dy Implementations | have implemented MFCCS in python, available here, Use the ‘Download ZIP button on the right hand side lof the page to get the code. Documentation can be found at readtnedocs. you have any troubles or (queries about the code, you can leave a comment at the bottom of this page. ‘There isa good MATLAB implementation of MECCs over here References Davis, S. Mermelstein, P. (1980) Comparison of Parametric Representations for Monosyllabic Word Recagnition In Continuously Spoken Sentences. In IEEE Transactions an Acoustics, Speech, and Signal Processing, Val. 28 No. 4, pp. 257-366 X Huang, A. Acero, nd H. Hon. Spoken Language Processing: A guide to theory, algorithm, and system itp acticaleryptogr pty. com miscellanecusimactine-learningigude-ma/-requency-cepstr-cootcints-mfccs! a0 oscazot8 Practical Cryplogranty 225 Comments Practical Cyptogapty @ Log - @Fconmand 31 Le Shae Sorthy Best» © | Loin te cisasssion ei wo ‘You need to write a book, You are a master at explaining these concepts | ke the fact that ‘you never assume prerequisite knowledge is “obvious,” and explain every deta Thank you so much, 0 + Realy tae Jameatyons os Rossjoer « Gye: == thanks forthe compliment, | am glad you found it useful 9 Reply + Sta ote [im So sory fortis request ‘Someone can help me to explain step 4 of computing the Mel Fite bank with actual data just 2.3 step My English s not very goods Thank you x + Reply San B20 yeas © estvatt sy tan YOU Teste eoy ours expnaton t MFCCS bat hove cone arone. The al rowers prety et atpects of any crept Sect ete VI and HOW of igs. panto expr the weber and ab tite gates hes fe) fatha yo hiltis a very nice tutorial. am complete beginer in speech processing field searched for some voice feature extraction softwares. found one) i have a sound sample, by applying window lengtt 0.015 and time step 0.005, have extracted 12 MFCC features for 171 frames directly ‘rom sound sample by using software tol called PRAAT. Now Ihave all 12 MFC coeficlents for each frame. my question is that now i want to process them furter making there 39 dimensional matix by adding energy feature and dlta-deta features and apply dw. | dont know how to deal with coeficients and how to make deltaxdeta coefficients, lam having ‘rouble for using above farmula . can you please guide me step by step iam complete beginener and in alot of rouble, 1 erly + Sta ahh «yoo oo never thought that can understand this topic... thanks a ot...our professors need to learn fom uhow to teach 2 + Reply» Shan Slavash «2 yoo 2 Hi ‘create 10fiter bank Now,how do I compress these down to 256 elements? 2 + Rely « Shae Fonna » 2ycoro2 ‘Aout the DCT: | have read that it makes the transformation back to ime domain, Why that? ‘You are saying that itis used for decorrelation, What sits purpose ar all? Could you explain sore or pt link? Thanks a lot ely « Sha @ Febe oers + Amare itipracticleryptogr apt. com miscellanecusimactine-learninggude-ma-requency-cepstr-cootcints-mfccs! 5110 Practical Crypogranty ound a great answer about the DCT pat hitpsspace iorary tort. Page 62 ‘The 26 bins are called MFSC (spectral coefficients) ‘The 13 values ater the DGT are the MECC. The thesis |linkod explains why the OCT is made to increase results of Gaussian ‘models and MFSC are beter for Noural Network based made, + Renly » Sr @ Wi Hoved this tutorial explaining MFC. In your python implementation of MECC, Do the mice feature represenis the energy of the ‘ame? | have read in Meraturan books that the fst coeff of MECC represents the energy of fame. | extracted MFCCs from your implemented code but it does nat seem to give the energy or log energy of frame. ‘So, wanted to know haw do your implementation (python) take this in consideraton.? ‘Or have to append the energy of frame as a feature in my code tothe features extracted from your implementation?2 Thank you 1 Rely + Sha Jameatyons ios > ohipotS ri 0 <= seo ine 35 of base.py (htpsiigthub.comvjamesiyone.) It is where the fest MFCC Ccoocient is replaced with the log ofthe frame energy. + egy » Shen Abhijeet Singh > Thank you.. did't seo it intally. thanks. + Rely « Stan Jamestyons os > sbnipetSeo) + Sinn == no problem :) happy to hel, ely + Ska Qt one Hi ie ifound the 26 MFCCdelta-della foreach frame in an auc signal. In MatLab] ‘have 8 signals ni found the same for al the signals Now in my application if the speaker says one of those 8 words it should be abe t identi that, the speaker said one of those 8 words. My problem is how can ido tht 7im stuck .what am i supposed fo do with thase features ‘some one pz help, ‘Thank you! 1 + Reply + Shen @_vsHie Nagar] 4 oHcsANHOHe) «OHA =I tried converting each signal matrix [rows x 26] into one row matrix [1 x columns) 0 ifor 8 signals was lef with [8x columns)..thn used knnelassify(myvoiceinput, raining group). where “Training-tht [8 x column] matrix roup=|8 x tmatrix inciting the lables foreach signal but im getting the wrong ans. Plz check the images n elif got the correct plots for mfcctdetta.n also check if my code is correct. Im doing anything wrong. should i repeat take more recordings forthe same word n ‘comoute the mec ofitn then use the same method ab.77..oz helo itp acticleryptogr pty. com miscellanscusimactine-learningigude-mal-requency-cepstrd-cootcients-mfccs! ano oscazot8 Practical Cryplogranty + Rely +S hafida + DishiNogers| - a monhaxe Hello, ineed a code matlab for MFCC extracton fr speech wav. can you give ito me. please I nead your help thanks in advance + ply + Stax © Sc vt we etn 25 ve byw 10ers 77 te 2 coined Inmestyons tet rir + 2yeas 29 <== yeah, so guess | should have made that clearer, srry. The picture shows 10 fiters because 28 was too crowded, thats the only reason. Normally you want to use 26, fiers. Ignove the 10, Reply + Shav Joana | so ‘This isthe best detallod explanation | found of how to compute the MFCC. Thank youl (Vm sory | didrit find it days ago, t would have spared me alt of time...) 1 + egy» Shen @ renee , can upl explain abt deta and dalta-detacooft + Rely Shae [Anurondra Kumar © Snore Very ricely writen. | never had any idea about mice before but | could grasp most of the stuff above, Thanks. + Reply + She Perfect, thank you! ly + Sha mpegtorall = in step 2, whatis value of N assuming number of samples in a frame is 400. | got confused because later it was saying "We would general perform a $12 point FFT and keep only the {rst 257 coeficents". Should N be 512 in the equations of step 2, though the number of ‘samples is 400? Thanks in advance Reply + Sh ome Thanks a lot + Reply «Sh @ Mirth soe Hi {A small doubt in python implementation of your code. | wanted to know about iter parameter you are passing inthe python implementation of mice ‘extraction. By default is 22 when default mfcc features are 13. ‘So, Is the liter parameter depends on number of mfcc features or numberof fiter banks used ‘Secondly i wanted to extract 20 mice features from speech, so doi need to chance the liter itp acticaleryptogr apt. com miscellanscusfmactine-learninggude-ma-requency-cepstr-cootcients-mfccs! m0 oscazot6 parameter (keeping iterbank (fit = 26) 77 ‘Thanks in advance. epty + Sha en Hi Thanks forthe extremely detailed, yet so simple tutorial. landed on ths page searching fora \Way to find out ferent pattems in a signal which is like audio, but not an audio signal. The datas a time series data of locaton of an abject in space, and has object's X-Y coordinates as atime series. | would ke to fine out pattern of motion and when, iat all, any pattem is. repeated, ‘As per my understanding | shouldbe able o do ths by adjusting the fteroank parameters. But am notable to igure out how to do that ‘Any help is greatly appreciated.) ‘Thanks once again. ely + Sha @ teens fo calculate ftarbank energies we multiply each fiterbank with the power specttum, then add up the couficents” . Can you please explain this part please? ‘and thank u so much for such a good presentation, + Reply © Sh ©] ee Pa = or Hi James, Thanks forthe detailed and simplified explanation. | have a question on MFC. applications - 1. how good is MFCC for feature extraction with continuous speech in a handsfree scenario (unixe ASR case where short sentences need to be extracted) for ‘example: recording a lecture in a classroom. In such cases where speaker levels are varying, do we gat the same accuracy as at higher level, assuming a learning algorithm is used. Any approach on how to apply MFC to identify multiple speakers at varying levels in handsfrea ‘scenario. 2. Any insights on the complexty ofthe algorithms in rea-time implementation + Reply + Shere Oe Eres please in step (5) what isk’ related to 77! am realy confused + Reply « Sh Jamestyons ios > Youre « Girt a = Kis ust an index variable, e.g: for(k=0:k«256:k++) Hmik) = ote + egy + Shan @ Yow ae1 > jareshons + Oren ‘aha i got thx for reply + Rey + Stan abin 6 ont could anyone please tel why hamming window is used and nat other windows for mice extraction? + Reply « Sh @arestvons tes Hn = Sars ‘You can use other windows, ther Is no rule saying you have to use hamming, results will be prety much the same no malterwhal windaw is used. | encourage you to ‘ry asr with cifferont window functions to see the results. + egy» Shen Weniie Sha» 7 monts200 Fantastic tutorial. [have a question, does it make sense and how to convert mfcc back to audio signal? ely + Shas (@_Fameatyons os erp soe «Tens ox: itp acticaleryptogr apt. com miscellanscusfmactine-learninggude-ma-requency-cepstrd-cootcients-mfccs! ano oscazots Practical Crypogranty Ze tis detintely possible, but doesnt sound great. First do te IDCT to recover the log fiterbank energies, take the exp to get fiterbank energies, wheight the fiterbanks by the {energies and add them up to get a smoothes power spectrum. From there you can do IFFT to get audio back. The problem is you don't have phase information, and doing zero phase or random phase sounds prelty bad. Also the power spectrum is smoothed 0a fairl of ually is lost. ‘There is @ matlab implementation of his process here called invmeicc.: hip: Mtabrosa.ee.columbia edu ely + Sha ee @© ran wo vsete 13 cinensond ete vcs fr pater recpiton because we recone oly be woe tack natin ane sce oman eneved. Con Youleserebpne. xn aavance © rvesvons tes powpocstm + Ynertm the source Information Is not very useful for speaker Identlieaton, I does carry ptch information, but pitch is not the best feature. Everybody has slightly diferent shapes for their nasal caviy, lengts of various tubes anc cavities ete which affects resonances. ‘These diferences mean the formants occur in diferent spots, formants 1-3 are mostly {rom shape of lipsposiion of tongue but formants 46 (usually of much lower energy) ccan not be changed by the speaker and are conveyed by the spectial envelope (Le. MFCCs), Different speakers usually say phonemes slighty diferent as well, so even formants 1-3 will carry speaker information, Reply + Sha nltet Ser In Fast Fourier Transform step, i sample per frame in frame blocking is 256 sample, son FFT we will got 256 complex rumbers. Because FFT is symetical, in Mel Frequency Wrapping, do we just use 128 FFT complex numbers? Ir yes, so in Mal Frequency Wrapping, N= 1287 ply + Shas @ rvssvons ses Hens «sien youll want to take the absolute value ofthe complex numbers, but yes the iterbank ould be made length 128 in your example. Also itis warping, not wrapping :) ly + Sha gy Ae one In what domain this resut of DCT? + Reply « Sh Sami Liedes «Scns @ Fyeawartio tae te os one bn energies, shouts youraterdosomahn ke log(1 energy)? tm toying with your Python code, and for nearesient portions of signal logtbank() retums roughly -38, which wracks the DCT. + Reply « Ste @ resvons tas FSsniLisis = Sinerte == That's a problom with og, othor features ike PLP use a cube-rootfuncton for compressing the energies instead of og, which has much nicer behaviour for small ‘numbers. Though if you use cube root you can't do cepstral mean subtraction any more. n the end itis not erticl exactly how it's done, as long as the recognition results ‘are good. you get better resuis with fog( 4x), you should use + egy + Sh @ etree i sample rate of auc is 8000 Hz, do we need DCT for speaker identification? Why? Thank en Jameslyone os 9 Oniston « Sen Sm Sample rale doesnt matter, mcs can be used with any sample rate. Octis used to dcrelate fterbank energies, itp acticaleryptogr pty. com miscellanecusimactine-learningigude-ma/-requency-cepstr-cootcints-mfccs! a0 oscazot8 Practical Cryplogranty + Renty Shen Christian > resins + smanisan: <= Do youmean decorelate? Soif use any sample rte, ican use any numbers of iter fr fiterbank, ight? Reply + Stax Jamestyons os oven « Shenton: = Thoorotcally yos. Around 20 firs for Skhz anc 40 for 16khz is ‘common, ard should give good results. The exact numbers not crucial ‘and should probably be optimised for @ particular problem Feqly + Sha @) mee Whats the meaning of the result mel fiterbank? what is me ilerbank da to the signal? And Why use DCT after that? Reply + Shae @ emestons sos + spanon ‘This is all explained above, see the section why do we do these things erly + Sha @ mere Hi James, is thal mal fiterbank's amplitude has determined always 1, or what is dotermined ‘hat amplitude mel fterbank? * Reply « She Jemealyons tet 4 dorict » Smanhoace <== anthis page the ampitudes are one, in "spoken language processing" by Acero and Huang they scale the ampltudes so the area under each fier 1. I does not realy ‘matter for ASR, a8 long as you use the samo amltudes for taining and testing. Mast classifiers will work equally well wth any scaling, + egy» Shan amechufhn ch fu manofs ch ohwzh mane uch cho juwayh man ch fc ohwch mene Copyright & Usage Questions/Feedback Copyright lames Lyons © 2009-2012 Nec robem? Wed Fe tot No reproduction without permission. Leave a comment on the page and wel take a foc itp acticaleryptogr pty. com miscellanecusimactine-learningigude-ma/-requency-cepstr-cootcints-mfccs! 1010

You might also like