0% found this document useful (0 votes)
88 views37 pages

Speech Enhancement Temporal Convolutional Neural Network

Temporal Convolutional Neural Network Implementation and Comparision with other traditional techniques for single channel speech enhancement

Uploaded by

Hardey Pandya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
88 views37 pages

Speech Enhancement Temporal Convolutional Neural Network

Temporal Convolutional Neural Network Implementation and Comparision with other traditional techniques for single channel speech enhancement

Uploaded by

Hardey Pandya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 37
Contents Abstract Contents 1 Introduction LI Problem st 1.2 Motivation . 1.3. Literature review 2 Experimental Methods and Results 2.1 Algorithms used for denoising: 2.1.1 Compressed Sensing 2.1.2 LSTM soe 213 TONN 0s. .ce cece eee 2.14 Comparison(Testing) . . 3 Multichannel Speech Enhancement and Dereverabation i-channel speech enhancement ework of the speech enhancement device 1 Components of Framework . 3.2.2 Hardware availabilty for difforont algorithms 4 Discussion and conclusion 4.1 Future Directions: 4.2 Conclusion Weekly Progress Report Bibliography 29 32 Chapter 1 Introduction Speech Enhancement is one of very challenging problem since decades in signal processing. The main aim of speech erhancement is to obtain clean speech by removing noise from it. Noise can be of different types, of different SNR (dB). It has wide variety of applications ranging from telecommunications, hearing aids, cochlear implantation and many more] We are also interested in how speech signal is perceived through human ears. ‘This type of enhancement models called anditory models which takes human bi- ology into account and from that speech processing parameters and measures are introduced. Introducing such type of signal processing measures and biology falls under Computational Auditory Seene Analysis(CASA)[2, 3, 4, 5, 6) Speech Enhancement inchide denoising as well as dereverberation, First type of clean speech distortion is due to additive noise. Clean speech is corrupted by the noise. Since speech samples magnitude are added samplewi the noise magnitude, it is called additive noise. Goal of Speech Den reconstruct the clean speech signal from she input noisy speech, Second type of clean speech distortion is Reverberation, Reverberation is con- volutive distortion of speech signal wher. going through microphone. The speech signal is convolved with ambient or chamnel impulse response. The objective here is to recover the original speech without a priori information of channel or environ- ment through which speech is collected or recorded. Goal of dereverbaration is to recover clean speech from the input reverberated speech. Mathematical formulation for reverberated and noisy multi-channel speech is shown in the last chapter. But here we are able to study and perform experiments only for single channel corrupted speech signal with additive noise IL is also important to note that once appropriate model is chosen for implemen tation on embedded hardware for hearing aid application, that embedded hardware shonld be properly designed as a control system. Speech Enhancement for hearing aid should be real time, So, output of spe L denoising process shonld take very less time. Also, the memory constraint is subtle on embedded device for hearing, aid So, we need to chose appropriate algorithm which is real time as well as utilize lesser memory within constraints of designed and giving robust performance. Speech enhancement techniques applical 1, Digital Hearing aids (we target this in our project) 2, Cochlear implantation (we also target this in our project using auditory models but after basic model implementation) 3. As speech coding preprocessor in cellular phone to reduce noise for voice calling. 4, For military communication systems, enhancing intelligibility of speech is more effective than to improve quality. Air to ground communication, when pilot talk with ground sta corrupted by very high level of cockpit noise of aeroplane, n it get 6. In teleconferencing systems, noise sources present in one location will be broad~ cast to all other locations. The problem becoming frequent in Covid-19 sce- nario when most things are carried out online via teleconferencing, While this is our ultimate goal to explore our techniques from auditory model and propose robust enhancement by auditory means as well as qualitatively, we ‘want to study some supervised denoising models thoroughly. Traditionally, speech enhancement. was done using unsupervised methods. However, supervised methods fare atuclied vastly in recent years with the advent of neural networks(19}. We are interested in studying why some models are giving better performance than their counterparts. There is so much studies for speech signal suitable neural networks like RNN,LSTM ResLSTM,TCNN,SEGAN, CNN ete. [7, 8, 9, 10, 11, 12, 13, 14] In addition to this, there are two stage speech enhancement|15, 16] techniques have been proposed. They give very good results but often its computational latency is higher than that of single stage speech enhancement algorithms. 1.1 Problem statement A ML-model can be trained and develosed for speech enhancement in real time using different set of algorithms which can be used in digital hearing aid devices. "Vo develop efficient real-time algorithm which ean be deployed on embedded system ot DSP for applications in Digital hearing aids. Onur goal is to study algorithms for spesch enhancement for single channel, Multi- channel noisy inputs. We study various techniques and outline their advantages and disadvantages. We also perform experiments for single channel speech enhancement by training Deep Neural Networks like LSTM and TCNN as described in later sections. Industry solutions for on-chip neural network based hearing are still very rare because of high power consumption. Till date, as of our knowledge, only one digital Inca ing woth done by Denmark based hearing assistive device company Oticon. [17] went. Deep Nenral Network on Edge Devices consuming low power. ‘There have been previous attempts to deploy trained neural networks for speech enhancement on low power and low memory hardware. For example, (18, 19] are proposed to use in voice assistive devices. However, heaving aids require even less power to function on low battery than the voice assistive devices. The field related to study of deep neural network model implementation on embedded, edge devices is called TinyML. It is also called as BdgeAL. The primary goal of Edge and TinyML is to quantize the trained neural network parameters. It is called Neural Network Compression, Compressed neural network is then deployed on can be implemented on chosen embedded hardware devieo whieh is less power consuming. jl based une seual yedkled on chip is ian commercial use. Th is So it is a wide area of research to imp! 1.2. Motivation A Lot of research is done in speech denoising and enhancement to recorded speeches but there is very less success till now in enhancing the speech in real time. This motivates us to implement in our project is an attempt to enhance the speech signal in real time so that it can be used in the devices used in hearing aids. 1.3. Literature review Wide variety of literature available on speech enhancement. ‘The most classic one is the textbook covering all basics of the field|1] We are interested in both supervised and unsupervised algorithms. Unsupervised Algorithms do not require training data for training the model. Instead, in this some inherent properties of noise contained within inpnt signal and try to remove it to produce clean signal, While supervised algorithms can be trained using concrete data of clean and noisy speeches and supposed to give better results than unsupervised approacies [20, 21], it is also true that any neural network if it encounters some never seen before acoustic noise or complex situations like mixed speech separation, binaural speech enhancement, multi-chanmel speech cease, it asst 3 ‘Hearing Aid Research —_4—|_, ‘Aleoritmic pat Hardware Pat -— Step 1: Compress the von NeuralNetwork Acie Sato MAO om Step 2:Desin The-ast art “Application Specific Performance Tntegrated Chip using FPGA Step 3: Depby ‘Compressed Neural Network on designed chip, Figure 1.1: Hearing Aid Research Directions enhancement, it might not yield better results. In [22], authors have nearly achieved state-of-the-art performance through unsupervised methods Still Jot of siudies are appearing based on traditional inethods and modifying them, because of the reason that those can he implemented on power efficient Ap- plication Specific Integrated Chip(ASIC) compared to neural network. For this reason, we also check performances of wiener filter along with neural networks. There is one other methods which is compuationaly effective as as giving better performance than wiener filter, called Compressed Sensing based speech enhancement. It basically exploits the fact that the ot be repre- sented sparsely in any domain[23] Although, speech recognition is slightly different topic and its performance met- rics are very different than that of enhancement, itis also noteworthy to review some competitions held for speech enhancement and recognition{24]. Some submissions i these competitions often used hybrid approaches and even some novel approaches, which is quite interesting to explore. Why such results are obtained is also impor tant to explore the reasons behind that rather than considering them as mere black boxes. Lot of works already appeared comparing LSTM and TCNN for sequential modelling while also providing many theoretical reasons(25]. Inspired from that we ‘would like to explore several algorithms on particular dataset, in order to achieve uniformity which is lacking with the availability of tons of research in the area. We want to check performance of already existing available algorithms. In literature, there is some lack of uniformity in the sense that different algorithms have by checked for different datasets, so it would be better to compare them if they are used for same datasets. If we can develop model for deep learning-based models as ‘well as traditional imsupervised approaches, it may be possible to compare all those algorithms in a unified manner. Regarding Spectral Masking based speech enhancement, Wang et al. [26] have shown that estimating ideal ratio masks (INMs)using feed-forward deep neural net- works (DNNs) outperforms a direct spectral mapping from noisy to clean speech. According to [27], the mask-based approach can further be improved by employing a masked spectrum approximation (MSA) loss, optimizing the mask estimation task in the speech spectral domain, as opposec. to optimization in the mask domain with ‘0 mask approximation (MA) loss. ‘These studies focus on enhancing the magni- tude spectrum using real-valued masks, but more recent works suggest that it can be advantageous for NN-based speech enhancement to combine clean magnitude and phase estimation [7]. Williamson et. al, [28] estimate the real and imaginary parts of a complex ratio mask (cRM)an¢ report improvements on using IRM tar- gets, because this method given better performances in both noisy and reverberant seenat Finally, It is important to realise difference hetween intelligibility and quality of speech. Quality of speech is subjective to listeners. It is difficult to define, No reliable measures can be defined for it, because the term is highly subjective. Different: listeners perceive differently for it. Intelligibility is not subjective. Many Teliable measures available to quantify the intelligibility of speech. Roughly, it means number of words or phrases that were spoken which can be identified correctly. Speech can have high quality, but not intelligible. And vice versa, Speech can have good quality but yet not intelligible. Speech Enhancement addresses to improve both quality and intelligibility. Chapter 2 Experimental Methods and Results In this chapter the methods that were implemented in the project are mentioned. In each method, the algorithms, flowcharts, datasets, results used are included. 2.1 Algorithms used for denoising: 2.1.1 Compressed Sensing Sampling the random projections of a signal below Nyquist rate and still able to reconstruct the signal approximately very well from captured random projections is known as Compressed Sensing. It is a technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems. A signal can be compressed if it can be represented as sparse vector in any domain like Diserete Cosine Transform (DCT), Inverse DCT and Fast: Fourier ‘Transform (peT). In compressed sensing, the input signal xz is represented in sparse vector having, very less vahtes of amplitude of x. Compressed sensing is possible if x is sparse in nature, Practically x is not sparse in nature due to which x is compressed under fre- quency domain which is sparse in nature. In th ated very robustly. We demonstrate tha it can outperform algorithms like wiener filter. It would be interesting to check other performances of strong modified versions of weiner filter and other statistical based approaches and compare with compressed sensing Generally, it is qu candidate for denoising tasks, beca phenomenon, noise can be sepa- ed sensing can be good e cannot be represented 1¢ intuitive to infer that compressi in any domain sparsely. ‘The solution of underdetermined equation y = A * x was obtained by using 3 different methods namely ¢\-optimization, ¢y-optimization and Matching Pursuit (MP) optimization. There are two conditions under which reconstruction of compressed signal is possible. ‘The first ane is sparsity,the signsl ta he sparse in same domain. ‘Thesecand one is incoherence and RIP property between measurement matrix and orthonormal basis. ‘Algorithm 1 Compressed Sensing using LI and 12 norm 1 Read the audio file as = 2 ne len(z) 3: m€ ceil(n/10) 4: A & zeros(m,n) 5: k & randperm(ny’ & K€ sort(k(1 = m)) 7 be sk) 8. A= zeros(m,n) & for i+ Ito—m do 10: ek # zeros(1,n); Mu ek(k() = 1; 12 A(i,:) + idet(ek); 13: y+ pinv(A) #6 14 y2. Neg pd(y, A, A’,b, Seminus3, 32) In built function in MATLAB Figure 2.1: Compressed Sensing Algorit 7 Results: f= signal, b = random sample 05 05 om c= idet( a ) 0 pipe tee tcf 4 010@0004060600 Figure 22: Input Signal 8 X= J, solution, Ax =b Y= I, solution, Ary = b ire 2.3: jfference in the results obtained in f; and é optimization rrzeasng easy > os, 1 1S 22S > 35 4 10 18 reronng fegency > 7 =r" Figure 2.4: Matching Pursuit Implemented on two sound signals. Figure at the top shows the sparse representation of Gong sound and the bottom figure shows the sparse representation of Handel sound 2.1.2 LSTM Speech Enhancement is a regression problem in terms of Machine Learning, Its goal is to remove noise from an inpnt speech signal. The problem is very complex because of the ample variety of disturbances that might corrupt speech signals One of the methods used in the project ix Spectral masking method. In masking approaches, rather than estimating the enhanced signal directly, a soft mask is estimated followed by estimation of the enhanced signal by multiplying the noisy one by the soft mask, ‘There are two ways of masking; Waveform masking and Spectral masking, In spectral masking, the system maps noisy spectrogram into clean ones. This mapping is generally considered easier than waveform-to-waveform mapping, However, retrieving the signal in the time domain requires adding the phase information. ‘The common solution (reasonable, but not ideal) consists to use the phase of the noisy signal. Waveform-nasking approaches do not suffer from this limitation. If y(f,0) is given which is additive mixture of clean speech signal x(f,t) and n(f.t) in time-frequency domain(TF) domain, then clean speech (ft) can be estimated from given y(f.f) using spectral masking ag explained above Tt ean be given as #(f,t) = a(f,t) - y(f,t) where a is the spectral mask that needs to be estimated. We use Ideal Ratio Mask (IRM) to estimate: a = {> Here, we estimare it using Recurrent Deep Neural Network, specifically Long- Short Term Memory(LSTM)[11]. For feature extraction, we perform Short-Time Fourier Transform on speech signals. DTFT is more appropriate for stationary or deterministic signals. It is helpful for short segments of speech. It is noted that specch signal is approximately stationary or deterministic for every short interval like 2030 milliseconds - for that it can be useful. Otherwise for long enough time, for speech signals another transfrom called Short Time Fourier Transform (STFT) is useful. It utilizes the window in function for some interval in time, It gives 2D array containing time and corresponding frequency content as output, called Spectrogram. The STFT of any signal r(n) can be obtained as: X(n,w) = D> x(k )w(n — ber" # where w is the analysis window. It is time-reversed and shifted by n samples. It should be noted that STFT is the function of two variables: the discrete-time index, n, and the continuous frequency variable w. STFT of the igual cam be sampled in continuous frequency variable w, quite u similar to DTFT sampled as DFT. Xfrole window inp xop Length exp ( jan) Energy > Figure 2.5: Magnitude Power Spectrum of STFT inctions in discrete domain. Window function Figure 2.6: Commonly used window value is zero outside specific length. Thet specific length for which function value is non zero is called Window Length + Commonly used window functions in discrete domain. Window function value is zero outside specific length. ‘het specific length for whieh function value is non zero is called Window Length Input — Spectogram Clean Speech Spectogram Wask | Estimation | Phase Resynthesaer Hebe+ ve. Output Figure 2.8: Spectral Masking based Speech Enhancement in Transfer domain B Results: We shiow the plots of validation PESQ and STOL scores of Lrained LSTM networks and give comparision of inference PESQ and STOI scores with TCNN in the last subsection. a ken Figure 2.9: Validaiton PESQ and STOI score with number of Epochs While training we have randomly mixed the 16 kHz resampled UrbanSound8K data with clean speech data of TIMIT’s 8732 utterances with SNR -5 dB-4dB,-3 (B,-2 dB.-1 dB.0 dB,1 dB to generate roisy 8732 utterances whieh are given for training to our TCNN model. It should be noted that the authors in [13] used 320000 utterances dataset with same number of model parameters. And they have used all the data for training in single epoch, In contrast to that, here we have use! only 8732 utterances, which is approxi- mately one fourth of what authors used (13] is still giving competent results and in single epoch we train our neural networks for first 900 audio files, shuffle the whole dataset of 8732 files in random fashion and then train again for first 900 audio files in the next epoch and so on till 1100 epochs. After every 10 epochs, we validate our model to find PESQ and STOI scores with next 50 files after the first 900 files(which was sed for training inprevious epoch) of the dataset. Validation PESQ, STOI with respect to number of epoch graph are shown in the next Results section. During experiments, we first tried to mix the noise at -5 dB and 10dB with clean speech and then we tried to mix noise at -54B,-4dB,-3dB-24B,-1dB.0dB,1dB and obtained the results. No significant differences found on inference. First everything converted into spectiogram using STFT u(f.n) = a(f,n) + SNR« =(f,n) using Hamming Window. WindowLength : Numberoflayers : 1, inputSize : 161 neurons in each layer : 1024, outputSize 4 So, total number of parameters whose value needed to be estimated in LSTM = (4° imputSize+1)+outputSize)*outputSize™ (munberOfLavers) =(4°(161+1) + 161) *161*2 which is approximately 13.7 million parameters needed to be estimated to learn the spectral mask(Transfer Function), 2.1.3 TCNN We have seen the speech enhancement approach in transfer domain. However, since speech is needed to be convert into transfer domain and then to get back original sigual, inverse transform is also requived. So there is overall lot of computational latency. Therefore, it is naturally desirable to find end-to-end solution for problem of speech enhancement. For that, some method in time domain was required. One of the first approach in time domain was given by in the form of TasNet|29), but it still required some pre-processing. Then motivated from this and several other researches, for the first time, end-to-end speech separation neural network ConvTas- Net [39] was proposed. It was breathrough in the field of speech source separation and speech enhancement At the same time it should be mentionend that although time domain speech enhaneement methods require less computational power, it has disadvantages also. Short Time Fourier transform is very helpful for CASA approach to speech enhance- ment. As a result, for severe loss of heasing, STPT related cochleogram methods are more useful than the time domain speech enhancement. For example, in [32] , authors studied combined time domain and Time-frequency domain approaches to use in TCNN. Also, Time-Frequency domain methods can be more important in cochlear implantation applications. In TONN we use dilated causal convelutions, ee ee ei omen t if per b + =—* SESS SSK th th teiaame 2 ooegpo ret Db tecen a4 I a a a bie am Figure 2.10: Dileted Convolutions As mentioned in chapter 1, we have used time-domain speech processing instead of STFT. Input given to the TCNN model is overlapping frames. Frames are over lapped in order to remove boundary effects of speech segment. ‘To convert back the overlapping frames into original speech signal, we use Overlap and Add method. ‘The model has three parts: 1. Encoders - consisting of 7 2D convolutional filters 2, TCM blocks - we have 3 layers of TCM blocks in our model. Each TCM block consist of 1D convolution, depthwise convolution and other 1D deconvlutional block as shown in the figure. In fact, these three combined convohitional operations are combined together to form a dilated convolutional block, where Dilation Convolu tional filter size is set to 5. Therefore, ou: model is able to look at 2° = 32 samples in the past before the current sample. 3. Decoders- consisting of 7 2D transposedd-convolutional(deconvolutional) filters. Sar ae ame Seas [Hin Se Se ee ome eer tore oe > ae ea as ier Freer id eee ree Sate ee Heart reece a eons} Tro | Team] etree as Figure 2.11; TCNN Model architecture and TCM block [13] 16 rat Figure 2.12: TCNN Model For time domain speech processing, the input noisy signal y is converted into overlapping frames. Commonly, more than 40% is used. Here we used 50% overlap. Let O € RiW*len is our obtained 2D matrix after overlapping frames. 0, and oy € Rl the k-th frame, o4 is defined as ox[f] = i( = 1) -ovrlp +f], f= 0.1 Here in above equation, we have tot.f is the number of frames, len is the frame length, and ovrlp is overlap. tot-f is given by [nt where [-Jdenotes the ceiling function. Here m is the original length of the signal. In the situation where m is not factor of ovrip, then o is zero-padded for appropriate size. We keep frame size to be 820. Smce 50% ovelap is used, m the next frame, rst 160 samples will same as those of the last 160 samples of the previous frame. Here encoders are 2D convolution filters and decoders are transposed 2D convo- Iution(Decovnolutional) filters. We set 2D convolutions filter size to 3 x 5: wlen=1, wv ‘enral Network re 8732 files in the Algorithm 1 Speech Enhancement using Temporal Convolution 1: Load the Training and Validation Dataset > # Ther raining and Validation dataset. 2: Converted each utterances of into frames. > Data Preprocessing % Formed a batch of eight such framed utteranees. > # Epoch size: 8; No. of Epoch: 600 4: Batch is made uniform by adding frames of zero, > To match with the longest one in the batch. Model is trained epoch wise. Validation takes place after every 10 epoch. After validation PESQ and STOL are reeorded and saved in checkpoints Output of the model is converted to readable audio file by Overlap and Add wethod. % The model is tested! and PESQ and STOI were recorded Experimental methods Used: While training we have randomly mixed the 16 KHz resampled UrbanSound8K data with clean speech data of TIMIT’s 8732 utterances with SNR -5 dB,-4dB,-3 dB,-2 4B.-1 dB,0 dB.1 dB to generate noisy 8732 utterances which are given for training to onr TCNN model. It should be noted that the anthors in [13] used 820000 utterances dataset with same number of model parameters. And they have used all the data for training in single epoch, In contrast to that, here we have usel only 8732 utterances, which is approxi- mately one fourth ot what authors used [13] 1s still giving competent results and in single epoch we train our neural network for first 900 andio files, shuffle the whole dataset of 8732 files in random fashion and then train again for first 900 audio files in the next epoch and so on till 1100 epochs. After every 10 epochs, we validate our model to find PESQ and STOI scores with next 50 files after the first 900 files(which ‘was used for training inprevious epoch) ofthe dataset. Validation PESQ, STOI with respect to number of epoch graph are shown in the next Results section. In addition to these, we have tried to re ce nnmber of parameters to check whether our model is overfitting or not. We tried to rechice parameters by one- fourth(according to proportional amount of data used by [2] and by us). Surpris- ingly, we have not found any significant improvement after reducing the parameters. Moreover, According to size of outprt of 2D convolutions, we tried to reduce the mimber of encoder and decoder layers. In fact, the model performance was worsened. So, encoder and decoder layers are important in this model However, by changing the encoder-decoder layers structure, in [33], authors ‘eved better performances than TCNN. This is one another instance that the 18 time-frequency domain methods cannot be ruled out: Choice of loss fnetion also plays an important role in deciding the performance. Regarding loss functions for single channel monaural speech enhancement, compre- hensive review can be found at [31]. It is also argued that the noise types involved i noise dataset during training, amount of data with respect to parameters of neural network plays an tmportant Tole in neural network tratntg, Several other type of loss funetions have been designed to maximize the PESQ and STOI scores of the neural network, bat we would like to use such loss functions in future. Here, we employ the standardly used Mean-Square Error loss funetion for variable length features. Here, features are the mumber of samples involved in an audio data. Below, We show the validation PESQ and STO! measures for our Nenral Net work, Validation PESQ and STOI measutes are calulated as average valies obtained by denoising every noisy validation data using our TCNN model. For testing the model NOIZEUS CORPUS dataset is used which comprises of sentences spoken by 30 speakers. In this dataset, the sentences were recorded in a clean(noiseless) environment. On this audiofiles, different noises are mixed which are that of train, babble, airport, exhibition hall, street, railway station, car and restaurant at 0, 5, 10 and 15 dB SNR. Ten utterances were selected from clean dataset that were clean and their corresponding noisy of any type was picked randomly having random SNR value. Both the pair of clean and its noisy speech form was used to record PESQ and STOI after testing and results are mentioned in Table 3.2. PESQ Validation PESQ (TCNN) 3 ° mh Figure 2.13: Validation PESQ Epoch wise 19 sToI Validation STO! score(TCNN) spore 2% a 36 a a % 86 oo 96 10 105, xan Figure 2.14: Validation STOI Epoch wise 2.1.4 Comparison(Testing) car_| babble | train Noisy 157 [1.76 139) TCNN’ 1.98 | 2.03 [1.67 LSTM ‘161 | 1.95) 102 ‘Compressed-Sensing [0.90 [1.04 [0.82 Table 2.1: PESQ Scores, 4B SNR car_| babble | train Noisy 059 [005 [O74 TCNN 0.65 [0.77087 STM 0.60 [0.66 | 0.50 ‘Compressed-Sensing | 0.38 [0.49 [0.82 Table 2.2: STOI Scores, 0dB SNR Chapter 3 Multichannel Speech Enhancement and Dereverabation 3.1 Multi-channel speech enhancement Generally, Hearing Aid device fnnetion for multichannel speech signals only. Also, Hearing Aids generally includes multichannel Speech Inputs dereverberation as well as additive noi In sueh c es, objective can be formulated as: a(n) = a(n) + film) =qiln) * s(n) + fi(n) = GM) + alm) * 8(n) + Hm) = af (n) s(n) + [af (n) * s(n) + faln)] = 6,(n) + nin) Here, «r; is the signal received to microphone number 7. , is reverberant: speech and is additive noise. is the impulse response of the room(RIR) for microphone number 7. RIR can be decomposed into two parts - direct path impulse response plus reverberation path impulse response. Sum of additive noise and reverberation path signal is total interference of the signal. Goal of simultaneous reverberation and denoising is to obtain estimate of 6, “or each chan ‘As the technology advances even small sized cost and low DSP processors are able to process both additive noise and reverbaration as well as multichannel speech. So currently single channel speech erhancement is not used in digital hearing aids, but it ean be used as lightweight app ications such as smart phone based speech enhancement system. Modern Digital Hearing Aids include multichannel speech enahancement and 2 both dereverberation and denoising. It also facilitates adjusting according to target human’s auditory capabitities and can be customized according to that DSP (Digital Signal Processor) based hearing aids needed to be desinged as em- bedded system and trained neural networks should be deployed into it Deep Neural Networks like TCNN,GAN, Transformers,LSTM ete based system on chip hearing aids are difficult to implement because of ils power 164 Even industry leaders like Starkey, Phonas have not yet fully used latest DNN based approaches. They use several other feasible solutions for low power application. This is the reason some efficient compression of latest neural network based approaches are needed to design. Fixed-Point arithmetic is less resource-intensive than floating point ar Weights trained into any neural network on general purpose computer or other cloud are basically floating-point precision. It should be quantized into fixed point arithmetic in order to deploy it into power efficient and low-cost DSP or Embedded Device. It is a challenging problem, rents. metic Mechanical Switches Battery Compartment Figure 3.1: Hearing Aid Device A lot of research is made on implementation of mult speech en 35, 36|The project was about enhancement of speech signal which was mono channel in nature, ‘To extend this to a multi channel, few modifications will be incorporated at each part of the architecture. The processing task will be similar to that of coloured image processing in ML. The input speech will have multiple values for an instance depending on the channel dimensions. Therefore single encoder will be 2 replaced by multiple encoders to perform 1-dimensional dilated convolution for indi- idual microphone sources. The output from each encoder will he stored in different stacks in order to keep the track of the inter-channel relationships. In the TCN block, the one-dimensional convolution layer will be replaced with two-dimensional dilated convolution layer. By this, the information of channel-depende analysed and processed. [39] can be 3.2 Framework of the speech enhancement device After designing and comparing the machine learning models, the best ML model is deployed in an embedded device to do the trained task. In this ease, the embedded device is speech enhancement device used for hearing aid. This topic is a shift in paradigm from concepts of Machine Leacning and Neural Networks to Embedded systems, Digital Electronics and Hardware layout design. The topic is related to design a framework where the machine learning model works as an independent en- tity. All the libraries, weight matrices, hyperparameters, peripheral devices, storage devices, interrupt devices are readily available for real-time processing. Complete hardware was not designed practically but constructed a tentative solution for the requirements of the hardware that will be in future prospects of this project. Design- ing the machine learning model opens up the options of the process of deployment. In this case il was decided ( deploy Ute machine learning model (o Applicalioa Specific Integrated Circuit (ASIC). Top-Down approach was used to obtain the re- quired components starting from the whole hardware as a black-box taking input as an analogue noisy sound waves and output as enhanced sound waves in real time, From this step, each points were bifurcated as input nature, characteristics of hardware and nature of output. ‘The hardware is a combination of varous digital cirenits and IC responsible for different tasks. However, the input signal is acoustic. Therefore there is a need of microphone that converts those sound waves to electrical signals. Likewise the output is a sound wave so speakers are installed to convert the electrical signals to sound waves. Getting more deeper into hardware, the core is the processing unit. ‘The processing unit implements the instructions. To store the instructions and data, storage devices are required. ‘To make the signal computable and after that readable to the user, analog to digital converters and digital converters would be used. Con- trol unit is responsible for controlling the data flow inside the processing unit. Serial to Parallel and Parallel to Serial converters are used to convert sequential data to parallel data and vice versa. All these components requites power supply so power unit(Battery) is required. In case of different requirements of voltage for different devices, SMPS (Switched-Mode Power Supply) is inserted hetween the AC power ul the circuit board, Use of SMPS is subjected to power requirements. ‘The configuration of the processing unit depends on the computational complexity, algorithmic complexity, computational latency and effective memory access time (EMAT) In further breaking down to devices, each device have its own characteristics and specifications. Here it is important to know what kind of device is required which is compatible to the hardware and which version is available recently. ‘The following content mentioned is all about each device with its particular specifications according to the requirement. The combination of components are subjected to upgradation or modification for different requirements like power, size, heat resistant, ete. 3.2.1 Components of Framework 1, Microphone: Microphone takes acoustic signal as input and electric signals as output. It is the entry point for the device to take speech signal as input. It receives raw input which is analog and continuous in nature. The output of microphone is propagated to the anti-aliasing filter. There are certain char acteristics used to define the microphone like its type, freanency response, polar response, sensitivity arid maximum sound pressure level. After some re- search work it was decided to select that microphone which have the following configuration (37, 38) ‘Type: Blectret condenser microphone (ECM) because of small size and higher accuracy Frequeney response: It is the bandwidth for which the microphone responds correctly. The bandwidth decided was the same that of human audible range which is 20 to 20000 Hy. Polar response: It represents the spatial sensitivity of microphone, There are three types of microphone on this basis, Uni-directional, Omni-Directional and Bi-directional. Omni-directional microphone was selected because it can sense the acoustic signals irrespective of the location of sound source, Sensitivity: It is the measure of output signal voltage level for a given sound waves. It is measured in mV per pascal or dBV per paseal . The sensitivity selected is in the range of -30 to 25 dBV/Pa. Maximum Sound Pressure level: Maximum value of sound pressure till which the microphone’s output is undistorted. 120 dB SPL is the selected value for the framework. Considering all the specifications, MAX4466 IC is the most suitable circuit for microphone. However, with the upcoming future MEMS (Micro Electro Mechanical Systems) will be used over ECM completely. ou 2. Anti lasing filter: Anti-aliasing filter is used to limit the bandwidth of the signal before going to analog to digital converter. It ensures the maximum frequency component do not generate an alias after sampling. If mullifies the gain of frequency higher than riaximum frequency. Maximum frequency Jed based on sampling rate. It performs the most important step in digital signal processing As all the input utterances were resampled to 16000 Hz, the maximum allow- able frequency of the anti-aliasing filter will be set to 8000 Hz. However the maximum allowable frequency will be 1000 Hz. as it is the maximum frequency of speech signal(300 to 3400 Hz) 3. A/D and ia DIA converter: Analog to Digital converters are used to convert signal to Digital signal. The desired ber of levels are decided according to the precision needed and minimal quantization error so that it can be reconstructed back after speech enhancement. ‘The levels are represented in the form of bits. For n bits, there are 2 raised to power m levels. In the experiment, the input data was quantized to be represented in 16 bits which means every sample value occupies 16 bits (2 Bytes) of memory. Therefore the layout will have data bus of 16-bit. Digital to Analog converters are used to convert digital signal to analog con- tinuous signals. Quantization can’t be recovered due to which estimation of ‘quantization error is not possible, Analog signals are obtained by interpolation ‘af gampled cignal thrangh reennetruction Alter Both the devices operated at 5V of reference voltage. 4, Serial to Parallel and Parallel to serial converters: These converters are present inside DAC and ADC but other than this it is also used to operate on multiple values simultaneously. Input is sequential and to support real time processing, the data is parallelised to compute with corresponding weighte of the trained machine learning model. The reverse process occurs at the output of the ML-model where parallel data is converted to sequential signal. The number of samples to be converted either way is 16. In one instruction eycle, 16 values will be stored in the cache directly connected to the processing unit. Speaker: Speaker takes the enhanced signal and converts it into acoustic signal audible to the user. It was selected based on the size and number of channels of the signal. According to the nature of dataset, mono-channel speaker was selected. 6. Processing unit: With the incresse in the use of TCN over RNN for time sequential input for real time speech enhancement, the demand for such miero- processors and micro-controllers are increased which can perform convolution operations at higher speed. Different designs were proposed which contains ng wot able to fulfill Of the criteria like low power consumption, incompatible for TCN, memory yystem accelerators motmted on a general purpose processor for perform convolution operations. However the proposed ayster constraints ete. A different architecture is used which performs convolution operations. This architecture contains a dedicated system which only performs convolution op- eration. It commmnicates with the main micro-controller throngh an interface ceive marshalled input data for MAC(Muitiply and Accumulate) opera~ the memory dedicated. This architecture is known ton tion with weights stored as Neuraghe architecture.(40] Core component of this architecture is convo- Iution operation engine, The main processor acts as general purpose proces- sor/controller which performs data parallelisation along with its type-casting to improve the parallel computing. Convolution Engine contains squar> matrix of MAC units known as Sum of Product units. Each clement in the matrix is SoP unit. It contains several Digital Signal Processing units in parallel to perform MAC operation of sample data with the corresponding weight loaded from the weight storage space. For n matrix in engine, there are n SoP row wise and like that of column wise. Shift adders are used to add the bias with the partial output obtained at each row of the SoP unit, After getting output in one row, it is stored in the on-chip storage device and loaded in the next cycle as input for next operation which in-turn enables simultaneous accumulation operation, 6 Antaliasing filter Input Microphone Output Figure 3.2: Framework of Speech Enhancement system Therefore this architecture is at its initial stage and this will be the future 1 of architectures, its assembly lan- ind performance. For the project plan to understand and work on such ki guage, peripheral organisation, feasibility experiment conducted it was found that this design of micro-controller will be suitable for extending this project and to make the best opportunity ont of it. 3.2.2. Hardware availability for different algorithms Hearing aid devices working on neural are not yet available for commercial use. Speech enhancement is at a very beginning phase to be merged with AI. There exists a proposed hardware implementing LSTM algorithm but not for commercial use. As Compressed Sensing and Wiener filter supports low power consumption, these algorithms can be easily deployed on the hardware and it is available, For TCNN there is no typical hardware designed yet to fulfill every criteria that supports easy mobility. ‘The hardware mentioned above have some drawbacks related to either TCN compatibility or real-time processing, a Chapter 4 Discussion and conclusion 4.1 Future Directions: ¢ Modern hearing aid requires multichannel speech enhancement. So we would like to implement and compare multichannel algorithms for speech denoising. ‘¢ We would also like to build a neural network model for denoising and dere- verberation. ¢ Convolution operation can be efficiently done on a DSP chip, so relevant practical and power-efficient ASIC can be designed for hearing aid. 4.2 Conclusion: # If the dataset is small, and the number of parameters to be trained are kept higher then the model is going to overfit, and will give poor performance. « Traditional Statistical Signal Processing algorithms are very useful in training, of deep neural networks 18 not feasbble. Also commercial use of deep neural network is very hard due to high power consumption. ‘¢ Debugging in Deep Neural Networks is Hard. Black Box is required to open. We compare TCNN performance with LSTM by some margin, We see that TCNN fails to improve STO! and PESQ scores everytime than LSTM but since it require very less pre-processing and number of parameters to train are very low so it is more suitable for real time application in our case. 2B Week 1 (From 21/01/22 to 27/01/22) « Decided the project utilities like required software, tools, IDE, programming language, libraries required and the extent of the project. feqqnived for the project ent and embedded «# Selected the hooks to he referred for specific. concepts lected research papers related te the speech enhan‘ design. + Discussion with faculty mentor reiated to the review of the extent of the project and clarifying doubts. Recommended for convTasnet and TCNN to be implemented. Week 2 (From 28/01/22 to 03/02/22) Decided the network architecture to be used in TCNN, 4 Selected the training dataset namely TIMIT and learnt about the hierarchy of dataset from its manual to know how to access these files from the program code. * Referred the research paper relate the process and workflow of T compatible as input to the model. to TONN and its model for understanding. N and how to cast the dataset to make it ¢ Downloaded the TIMIT dataset for trammg the model. Week 3 (From 04/02/22 to 10/02/22) ‘# Pre processed the data by accessing the apecch files of way extension from the TIMIT dataset and resampled to 16kHz followed by making the chunks of frames of size 32 milliseconds (512 samples) with 16 milliseconds (256 samples) of overlap. ‘¢ Tried to diagnose the error related to python library called gflags. There is an issue related to the version and ts compatibility therefore searched for its alternate to serve the purpose. Referred different sources and websites for development of encoders /decoders to be defined and making various other utilities for obtaining results in visual graphi Sina wavetrm esampid signal Wavetorm sn ox i 005- i 0.025 i oe i on Figure 4.1: The initial stage of data pre-processing Week 4 (From 11/02/22 to 18/02/22) Decided how to mix noise with the clean speech dataset. ¢ Modified our previous baseline code of LSTM and made necessary changes to ho enmpatihle far TCONN which is in time-domain * Plan is to train the TONN this week. Week 5 (From 18/02/22 to 25/02/22) ¢ Development of Baseline code which is based on LSTM algorithm which was in STFT domain, is converted to time domain as our model is in time domain, ‘* Modification is continued by fixing tensor incompatibility error, # The code is under testing phase before beginning the training part. Week 6 (From 25/02/22 to 04/03/22) Trained the TCNN model successfully. Checkpoints were saved correctly. 30 ¢ Errors related to checkpoints, loss function and tensor incompatibility were resolved. © Loss function was defined according to the requirements. Tried to debug the code raising different types of errors in validation stage. Errors Ike data type error, index error, float error, toreh.cuda() error et were raised subsequent! Week 7 (From 04/03/22 to 11/03/22) All the logical errors and runtime etrors were debugged. * Trained the model for 600 epochs of batch size 8 and saved the checkpoint after every 10 epochs. Implemented Validation after ev 0 epochs * Obtained PESQ and STOI at each checkpoint and plotted the results ‘¢ Tested the model with the noise which wasn’t part of the training dataset and recorded PESQ and STOL # Referred the Differential Hearing Aid Speech Processing (DHASP) method which emphasizes generalization of heating impairment aid model Week 8: (From 11/03/22 to 18/03/22) © Prepared the report for Ist CP evaluation. ¢ Checked the Overlap and Add fanc:ion in pre-processing and post-processing stage to detect the cause of getting lower PESQ. # Solved the problem of getting diffe-ent values of PESQ and STOI in testing. phase after running the code with the same input. Started preparing the slides for evaluation, * Decided to reduce the model complexity according to the length of the dataset. Week 9: (From 19/03/22 to 25/03/22) ¢ Evaluation week, prepared slides, report work, LSTM model was trained over lataset and PESQ and STOI are recorded. 190 epochs with sam a Week 10: (From 26/03/22 to 01/04/22) ‘© Modified the TCNN network by reducing the encoders and decoders along with TCM module. ¢ Trained the model with the same procedure as mentioned above. © Continued training of LSTM model. Week 11: (From 02/04/22 to 08/04/22) Trained the model for larger number of epoch (1200). Rectified the loss function to get correct output Trained the model with rectified loss function. ¢ Improveme in result as max PESQ reached till1.95 while testing. Week 12: (From 09/04/22 to 16/04/22) ¢ Finalized the model after training and testing. The denoised speech is listened and results were satisfactory. © Referred different hearing aid devices with their circuit design and block dia- grams. inalized the report content for this project. ‘¢ Extension of this model for multi-channel speech enhancement as future roadmap. Week 13: (From 17/04/22 to 23/04/22) © Decided different ce ponents requied to design a framework of the project © For each component like microphone, A/D and D/A converters, serial to par- allel converters, processing unit, speaker and power supply, found suitable module(IC chip) for the framework ‘¢ Mentioned every specifications of each component. © Continued with updating project report 2 Bibliography Oy) speech Enhancement Theory and Practice”, Second edition by Philipos C. Lovou. [2] M.R. Saddler, A. Francl, J. Feather, X. Qian, Y. Zhang, and J. H. McDermott. Speech denoising with auditory models, 2021 [3] F. Bao and W. H. Abdulla. A new ratio mask representation for casa-based speech enhancement. IEEE/ACM ‘Transactions on Audio, Speech, and Lan- guage Processing, 27:7-19, 2019. [4] RD. Patterson, K. Robinson, J. L. Holdsworth, D. A. McKeown, C. Q. Zhang, and M. Allerhand. Complex sounds and anditory images. 1992. (5] M. H. Soni, N. Shah, and H. A. Patil, Time-frequency masking-based speech hancen using generative adversarial network. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) {6] Chen, Jitong and Deliang Wang. “Long short-term memory for speaker general- ation in supervised speech separation.” The Journal of the Acoustical Society of America 141 6 (2017): 4705 . (7] S.R. Park and J. Lee, A fully convolntional neural network for speech enhance- ment. CoRR, abs/1609.07132, 2016. ial, A. Bonafonte, and J. Serra. Segan: Speech enhancement generative network, 2017 (9] 1. Phan, I. V. McLoughlin, L. Pham, ©. Y. Chen, P. Koch, M. De Vos, and ‘A. Mertins. Improving gans for speech enhancement. IEEE Signal Processing Letters. [10] D. Wang and J. Chen, Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions or Audio, Speech, and Language Processing, 26(10):17021726, 2018, 33 [11] Weninger, Felix et al. “Speech Enkancement with LSTM Recurrent Neural ‘Networks and its Application to Noise-Robust ASR.” LVA/ICA (2015). [12] M. Lin, Y. Wang, J. Wang, J. Wang and X. Xie, "Speech Enhancement Method Based On LSTM Neural Network fer Speech Recognition,” 2018 14th IEEE International Conference on Signal Processing (ICSP), 2018, pp. 245-249. [13] “T'CNN: Temporal Convolutional Neural Network for real-time Speech En- hancement in the time domain”, by Ashutosh Pandey and DeLiang Wang i International Conference on Aconsties, Speech, and Signal Processing (ICASSP) 2019. [14] “Densely Connected Neural Network with Dilated Convolutions for real-time Speech Enhancement: in the time domain”, by Ashutosh Pandey and DeLiang Wang in International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2020. [15] A. Li, W. Liu, X. Luo, C. Zheng, ane X. Li. Ieassp 2021 deep noise suppression challenge: Decoupling magnitnde and phase optimization with a two-stage d network, In ICASSP 2021 - 2021 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). [16] Strake, Maximilian Defraene, B. Fluyt, Kristoff Tirry, Wouter Fingscheidt, ‘Tim. (2019). Separated Noise Suppression and Speech Restoration: LSTM- Based Speech Enhancement in Two Stag [17] meeps: //uww oricon com/anturiona/mara-hearing-aids [18] H.-S. Choi, S. Park, J. H. Lee, H. Heo, D. Joon and K. Lee, "Real-Time Denoising and Dereverberation wtih Tiny Recurrent U-Net,” ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5789-5793, [19] Igor Fedorov , Marko Stamenovic , Carl Jensen , Li-Chia Yang , Ari Mandell , Yiming Gan , Matthew Mattina , Paul N. Whatmough. ”TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids”. (20] K. Shimada, Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara, Unsupervised speech enhancement based on multichannel nmf-informed beam- forming for noise-robust automatic speech recognition, IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, 27(5):960-971, 2019. 21] X. Bie, $. Leglaive, X. Alameda-Pineda, and L. Girin. Unsupervised speech enhancement using dynamical variational auto-encoders, 2021. a {22] S. Wisdom, A. Jansen, R. J. Weiss, H. Erdogan and J. R. Hershey Efficient, and Semantic Mixture Invariant Training: Taming In-the-W supervised Sound Separation,” 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021, pp. 51-90. (28] Sharma, Pulkit et al. ing.” 2015 Twenty Fir 1. npervised speech enhancement using compressed sens- ‘ational Conference on Communications (NCC) (2015): [24] https: /fehi nechallenge githnh io/chimeG/ [25] Dai, Shanjie et al. “An Empivieal Evaluation of Genetic Convolutional and Recurrent Networks for Sequence Modeling,” ArXiv abs/1803.01271 (2018): n. pag (26) Y. Wang, A. Narayanan, and D, Wang. On training targets for supervised speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1849-1858, 2014. [27] F. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller. Discriminatively ‘trained recurrent neural networks for single-channel speech separation. In 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pages 577-581, 2014. (28] Williamson, D. S., and Wang, D. L. (2017). “Time-frequency masking in the ‘complex domain for speech dereverberation and denoising,” TEEE Trans. Audio Spoorh Tang Process 98, 14921501 [29] Luo, Yi and Nima Mesgarani. “TaSNet: Time-Domain Audio Separation Net work for Real-Time, Single-Channel Speech Separation.” 2018 IEEE Interna- tional Conference on Acousties, Speech and Signal Processing (ICASSP) (2018): 696-700. (30] Luo, Yi and Nima Mesgarani. *Cony-TasNet: Surpassing Ideal ‘Time Frequency Magnitude Masking for Speech Separation.” IEEE/ACM ‘Transactions on Audio, Speech, and Language Processing 27 (2019): 1256-1266. [31] Freiwald, Jan et al, “Loss Functions for Deep Monaural Speech Enhancement.” 2020 International Joint Conference on Neural Networks (IJCNN) (2020): 1-7. (32] Chuanxin ‘Tang, Chong Luo, Zhiyuan Zhao, Wenxuan Xie, Wenjun Zeng, “Joint Time-Frequency and Time Domain Learning for Speech Enhance- ment” Proceedings of the Twenty-Ninth International Joint Conference on Ar- tificial Intelligence Main track, Pages 3816-38: [33] Vinith Kishore, Nitya Tiwari, Perivasamy Paramasivam, "Improved Speech Enhancement Using TCN with Multiple Encoder-Decoder Layers” [34] Pandey A., Xu B., Kumar A., Donley J., Calamia P., and Wang D.L. "Multi- channel speech enhancement without beamforming” , arXiv:2110.13130v2 [es.SD] (35 Pandey A., Xu B., Kumar A., Donley J., Calamia P., and Wang D-L. "TADRN: ‘Tviple-attentive dual-recurrent network for ad-hoc array multichannel speech enhancement.” arXiv:2110.11844 [36] Pandey A., Xu B., Kumar A., Donley J., Calamia P., and Wang D.L. "TPARN: ‘Triple-path attentive recurrent network for time-domain multichannel speech enhancement”, arXiv:2110.10757. [37] Brimijoin, W. Owen; Whitmer, Wiliam M.; MeShefferty, David; Akeroyd, Michael A. The Effect of Hearing Aid Microphone Mode on Performance it aut Auditory Orienting Task, Ear and Hearing: September /October 2014 - Volume 35 - Issue 5 - p e204-e212 doi: 10.1097/AUD,0000000000000053 [38] https: //mynewmicrophone.com/top- understand crophone-specifications-you-need-to- [39] Dongheon Lee, Seongrae Kim, Student Member, IEEE, and Jung-Woo Choi, Member, ILEE,” Inter-channel Conv-LasNet for multichannel speech enliance- ment” [40] Marco Carreras, Gianfranco Deriu, Luigi Raffo, Luca Benini and Paolo Mel- oni,” Optimizing Temporal Convolutional Network inference on FPGA-based ac- celerators”, arXiv:2005.03775v1 [eess.SP] 7 May 2020. 36

You might also like