OENG1167-EB-ET-project-proposal-voice Recognition
OENG1167-EB-ET-project-proposal-voice Recognition
Project Proposal
Noise Reduction for Automotive Voice Recognition
Academic Supervisor:
PRIVATE AND CONFIDENTIAL
Table of Contents
Noise Reduction for Automotive Voice Recognition 0
Table of Contents 1
Section 1 - Executive Summary 2
Section 2 - Statement of Problem 2
Section 3 - Literature Review 3
Section 3.1 - Microphone Array Beamforming 3
Section 3.2 - Microphone Array Hardware 4
Section 3.3 - Adaptive Beamforming Algorithms 4
Section 3.4 - Beamforming with NR Algorithms 5
Section 3.5 - Noise Reduction and VAD Algorithms 6
Section 4 - Design Questions 6
Section 5 - Methodology 6
Section 5.1 - Design Methodology 6
Section 5.2 - Resource Planning 7
Section 5.3 - Alternative Designs 8
Section 5.4 - Project Timeline 9
Section 6 - Risk Management and Ethical Considerations 10
Section 6.1 - Risk Assessment 10
Section 6.1.1 - SWOT Analysis 10
Strengths 10
Weaknesses 10
Opportunities 10
Threats 10
Section 6.2.2 - Risk Solution Chart 11
Section 6.2 - Ethical Considerations 11
Section 7 - References 12
1
PRIVATE AND CONFIDENTIAL
This proposal will outline the problem, along with the requirements of our industry
sponsors, Fiberdyne Systems, and their clients, Softbank and Renesas. We also present our
initial research into some DSP solutions relevant to the project and then provide a plan for
the design and development of the noise reduction system. The requirements will be
refined during a key project meeting with the clients in April.
By the completion of this project, we aim to have a working prototype of a standalone DSP
embedded hardware system that improves the signal-to-noise (SNR) ratio of the voice signal
being passed through the system.
Given these VR systems are being used in a wider range of environments, the voice source is
often further away from the microphone. This significantly degrades speech intelligibility as
the signal is now being exposed to reverberation and background noise before reaching the
microphone input. For this reason, DSP algorithms could be implemented to clean up the
2
PRIVATE AND CONFIDENTIAL
signal and make the voice recognition work in environments that were previously
considered too noisy.
The company SoftBank has been developing a VR system that is based around detecting
emotion in the users voice within an automotive vehicle. Their initial testing had satisfactory
accuracy under ideal conditions (i.e. silence), however they found that the VR system
struggled to detect emotion accurately when the vehicle was being driven at speed. This is
because there are a number of extra noise sources (such as wind, road, and engine noise)
interfering with the voice signal.
As a result, SoftBank has contracted our industry sponsors, Fiberdyne Systems, to develop a
noise reduction microphone system that will enhance and clean up the microphone input
into the existing VR system. The usage of a beamforming mic array and machine learning
was suggested by SoftBank as potential methods to help isolate the users voice, and
minimise the ambient noise being picked up by the microphone input. Audio filtering was
also explored, however as the system is based around emotion recognition, it requires
significantly more bandwidth compared to normal voice recognition (M Lech, p. 2346). Due
to the increased bandwidth requirement, Softbank has specified that we need to maintain
the full spectral source data for their algorithm to analyse. This means that some additional
DSP algorithms may be required to cancel out background noise as a high-pass filter can’t be
used to filter out low frequency background noise. Additionally, a requirement of our
sponsor is that we use MEMS microphones. Due to their unique design, some research must
also be done in this area.
Our industry sponsors Fiberdyne, have specified for us to initially use the Analog Devices
SHARC SC589 family of architectures for this project as a test platform before porting it to
their client’s System on Chip (SoC) down the line. This is as it is a capable DSP platform
which we already have experience writing software on and we also have development
boards containing processors from this family.
3
PRIVATE AND CONFIDENTIAL
4
PRIVATE AND CONFIDENTIAL
In section 2, the requirement to use MEMS microphones was mentioned. As all MEMS mics
have an omnidirectional pickup pattern (InvenSense 2013, p. 1), there needs to be some
way of changing this to filter out depending on direction. A way of adding a weighted polar
pattern to a MEMS mic is using a microphone array. This uses anywhere from two to six
microphones and processes them to create a weighted polar pattern. There are two
fundamental array types which are explored here, which utilise different signal processing
techniques. Further work has also focused on using a combination of these array types to
further adjust the polar pattern (InvenSense 2013, p. 10).
The simplest microphone array type is the Broadside Array which places the microphones in
a line perpendicular to the preferred direction of sound waves. It then sums the microphone
inputs together, so at certain frequencies, they cancel out sound originating from the sides
of the array (InvenSense 2014, p. 3). Under testing, severe aliasing occurred at specific
frequencies.(Brandstein 2010, p. 50) Due to the way that this mathematically works, it
meant that the aliasing was close to the frequency where the nulls around the side
happened, and that adding more microphones reduce the frequency of the aliasing further.
Another disadvantage is that the array of two microphones has to be orthogonal to the
sound direction travel, as it creates a figure 8 pattern at the target frequency.
A more advanced type of microphone array is called the endfire array, which places the
microphones inline with the desired direction of sound propagation. Because the
microphone spacing is known, there is a known time delay between the two microphone
inputs. By compensating for this time delay and then summing the inputs, the mic array can
effectively cancel sound originating from behind the array, giving the array a cardioid
pickup-pattern (InvenSense 2013, p. 5).
5
PRIVATE AND CONFIDENTIAL
mic-array inputs (V Krishnaveni et al. 2013). This DOA can be used to adaptively steer the
beamformer in the direction of interest and reduce the effect of noise sources in other
directions (Hendriks & Gerkmann, 2011; Zhao et al. 2015). This would be more useful in a
automotive environment as we would be able to adjust the beamformer to have a narrower
pickup pattern and then self-adjust the pickup direction based on who is talking in the
vehicle (i.e the driver, or the passenger). Research by Zhao et al. (2015) suggests an
approach that adjusts individual microphone gains in real time based on the DOA estimate
to ‘steer’ the pickup pattern toward the location of the voice signal.
Beamforming can be done in either the time or frequency domain. Time domain
beamforming involves introducing known delays to the input signal (as in these cases the
microphone position is a fixed distance apart) and summing them together (V Krishnaveni et
al. 2013, p. 5). The downfall of this algorithm is that it is sensitive to phase mismatches,
which can be overcome by doing beamforming processing in the frequency domain (Joel J.
Fuster 2004, p. 10).
It is important that the direction of arrival algorithm used is robust. Care must be taken that
the algorithms used are able to deal with various different errors. If the algorithm does not
deal well with these errors, the speech may inadvertently be cancelled out. Errors occur
from a wide variety of factors - for example the impulse response of the environment
changing (windows being rolled up or down) or the microphone array not being calibrated
properly (SA Vorobyov et al. 2018, p. 313).
Furthermore, research by Affes (1997) suggested that identification and the matched
filtering of source-to-array impulse responses are necessary for a microphone array, to
further improve the intelligibility of speech by countering the effects of reverberation. This
is further highlighted by Aarabi (2004), which proposed a model for the signals received in a
beamforming array, being a combination of the original signal convoluted with the impulse
response of the environment, summed with a noise signal. Therefore, it will be important to
consider both noise reduction and dereverberation algorithms to achieve optimal
performance in the beamforming array.
6
PRIVATE AND CONFIDENTIAL
7
PRIVATE AND CONFIDENTIAL
● What microphone array setup works best for the noise reduction algorithms?
● Which Direction of Arrival (DOA) algorithm works best for wideband voice
recognition?
● Which noise reduction (NR) algorithm works best in an automotive environment?
● Can a combination of beamforming, direction-of-arrival, noise reduction, or
dereverberation be implemented together to improve speech intelligibility?
Section 5 - Methodology
We have decided to implement a prototyping life-cycle for both the DSP and Embedded part
of the project. The prototyping model is an iterative system design model that involves
creating a series of prototypes that are shown to the project stakeholders to identify and
8
PRIVATE AND CONFIDENTIAL
confirm the correct features are being developed. Prototype life-cycle development starts
with a simple system, that conveys a single feature or concept, which will then evolve to
either refine a feature or develop a new section of the project until an acceptable solution is
made (Radcliffe, 2015).
One of the key meetings for this project is yet to occur on the 5th-6th April 2018, in which
representatives from Renesas will be flying to Melbourne, to meet with us and our industry
sponsors Fiberdyne Systems. These meeting will clearly outline the project specifications
and requirements, as well as provide an opportunity to discuss early concepts and potential
solutions with our client. Following this meeting, our industry sponsors Fiberdyne will be
responsible for handling all communication with the clients, Softbank and Renesas. We plan
to meet on a weekly basis with Fiberdyne and our academic supervisor, Dr PJ Radcliffe, to
facilitate an ongoing discussion on the development of the project.
9
PRIVATE AND CONFIDENTIAL
Test and development equipment for this project includes laptops, audio interfaces,
speakers, signal generators, and oscilloscopes. These have also provided by Fiberdyne. In
the case that this equipment is unable to be used (Perhaps it will be used by the other
Fiberdyne engineers on a different project), the project is portable enough to be worked on
at RMIT with RMIT equipment. As this is an audio project that deals with frequencies in the
audible range, standard test equipment may be used.
Several alternative embedded hardware designs could be used. Texas Instruments DSP’s are
popular and provide a similar environment and processing capabilities. However, there are
already capable Analog Devices development boards in use, so it would take a lot of time
porting it to this platform.
An alternative design would also be to write the software on the SoC that it is planned to be
ported to, but this does not have JTAG line-by-line debugging, so that would take a long
time to develop on. Iif JTAG debugging on the SoC was made available, it may be a good
idea to create software to run on that as it would mean that the software wouldn’t need to
be ported from the analog devices processors to the SoC.
10
PRIVATE AND CONFIDENTIAL
11
PRIVATE AND CONFIDENTIAL
Strengths Weaknesses
● Previous experience in developing ● Limited time to develop
similar projects. ● Software development can take
● Familiarity with the development unpredictable lengths of time
environment. ● Will need to work on a wideband
● Previous work has been done on input (cannot high-pass filter input)
this platform(architectural ● Wideband noise reduction for
knowledge). emotive voice recognition is a
● Multi-part solution with many relatively unexplored area.
independent modules to show ● Creating a test setup may be
progress. difficult due to the number of
● DSP algorithm based on proven microphones, array placement, and
techniques. automotive environment.
● Software development already
underway.
12
PRIVATE AND CONFIDENTIAL
Opportunities Threats
● Current lack of solutions for ● A possibility of the client losing
wideband automotive noise interest or funding for the project.
reduction. ● Development boards potentially not
● This sort of system could be applied being available at certain times if
to other microphone inputs - for other industry-sponsored group
example, Bluetooth noise reduction. uses it or is required by the sponsor.
● Other solution to noise reduction
beating this to market
Playing music causes voice recognition to Create a series of tests to ensure that the
malfunction noise reduction the music volume
sufficiently
Emotive voice recognition fails due to Create tests that ensure the voice
frequency response being uneven due to reduction solution does not alter the
noise control speech input too much
Voice Recognition not failing results in Make it clear that this system is only to be
vehicle issue or crash used in non-critical systems as voice
recognition is not perfect
13
PRIVATE AND CONFIDENTIAL
It is also important to ensure that the noise reduction that is developed is not being used to
dictate safety-critical decisions for the car. This is as a voice recognition system is naturally
prone to error.
The end user’s data must also be accounted for. Considering that this is being used for a
voice detection algorithm, user data (the user’s speech) must be protected. As this is an
offline algorithm, this is not an issue in the current state. In future after the prototype stage
if the algorithm is used by another party as part of a ‘cloud’ algorithm (in other words the
DSP is done on a server where the user’s device sends the audio data to the server), it will
be their responsibility to ensure their system is secure against attackers.
14
PRIVATE AND CONFIDENTIAL
Section 7 - References
Affes, S. and Grenier, Y. (1997). A signal subspace tracking algorithm for microphone array
processing of speech. IEEE Transactions on Speech and Audio Processing, 5(5), pp.425-437.
Aarabi, P. and Shi, G. (2004). Phase-Based Dual-Microphone Robust Speech Enhancement.
IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 34(4),
pp.1763-1773.
Evangelopoulos, G. and Maragos, P. (2006). Multiband Modulation Energy Tracking for Noisy
Speech Detection. IEEE Transactions on Audio, Speech and Language Processing, 14(6),
pp.2024-2038.
Faubel, F., Georges, M., Kumatani, K., Bruhn, A. and Klakow, D. (2011). Improving hands-free
speech recognition in a car through audio-visual voice activity detection. 2011 Joint
Workshop on Hands-free Speech Communication and Microphone Arrays.
Fuster, J J, 2004. A Hardware Architecture for Real-Time Beamforming. Masters Thesis. FL,
USA: University of Florida.
Hirsch, H. and Ehrlicher, C. (1995). Noise estimation techniques for robust speech
recognition. 1995 International Conference on Acoustics, Speech, and Signal Processing.
M Lech, L He, N Allen, (2010). On the Importance of Glottal Flow Spectral Energy for the
Recognition of Emotions in Speech. In INTERSPEECH 2010. Makuhari, Chiba, Japan, 26-30
September 2010. 2010 ACM/IEEE International Symposium on Computer Architecture: ISCA.
2346-2349.
15
PRIVATE AND CONFIDENTIAL
Ramı́rez, J., Segura, J., Benı́tez, C., de la Torre, Á. and Rubio, A. (2004). Efficient voice
activity detection algorithms using long-term speech information. Speech Communication,
42(3-4), pp.271-287.
Shengkui Zhao, Xiong Xiao et al., (2015). Robust Speech Recognition Using Beamforming
With Adaptive Microphone Gains And Multichannel Noise Reduction. In Automatic Speech
Recognition and Understanding Workshop. Scottsdale, AZ, USA, 2015. IEEE Xplore: IEEE.
460-467.
Taghizadeh, M., Garner, P., Bourlard, H., Abutalebi, H. and Asaei, A. (2011). An integrated
framework for multi-channel multi-source localization and voice activity detection. 2011
Joint Workshop on Hands-free Speech Communication and Microphone Arrays.
16