0% found this document useful (0 votes)
8 views17 pages

OENG1167-EB-ET-project-proposal-voice Recognition

Uploaded by

dsdrefdwg4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views17 pages

OENG1167-EB-ET-project-proposal-voice Recognition

Uploaded by

dsdrefdwg4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

RMIT University

OENG1167 Engineering Capstone Project Part A - Task 1

Project Proposal
Noise Reduction for Automotive Voice Recognition

Academic Supervisor:
PRIVATE AND CONFIDENTIAL

Table of Contents
Noise Reduction for Automotive Voice Recognition 0
Table of Contents 1
Section 1 - Executive Summary 2
Section 2 - Statement of Problem 2
Section 3 - Literature Review 3
Section 3.1 - Microphone Array Beamforming 3
Section 3.2 - Microphone Array Hardware 4
Section 3.3 - Adaptive Beamforming Algorithms 4
Section 3.4 - Beamforming with NR Algorithms 5
Section 3.5 - Noise Reduction and VAD Algorithms 6
Section 4 - Design Questions 6
Section 5 - Methodology 6
Section 5.1 - Design Methodology 6
Section 5.2 - Resource Planning 7
Section 5.3 - Alternative Designs 8
Section 5.4 - Project Timeline 9
Section 6 - Risk Management and Ethical Considerations 10
Section 6.1 - Risk Assessment 10
Section 6.1.1 - SWOT Analysis 10
Strengths 10
Weaknesses 10
Opportunities 10
Threats 10
Section 6.2.2 - Risk Solution Chart 11
Section 6.2 - Ethical Considerations 11
Section 7 - References 12

1
PRIVATE AND CONFIDENTIAL

Section 1 - Executive Summary


In this project proposal, we present an outline for the development of a real-time
embedded noise reduction system for automotive voice recognition using digital signal
processing (DSP) algorithms. The primary aim of this project is to improve the clarity of
speech in the noisy environment of an automotive vehicle.

This proposal will outline the problem, along with the requirements of our industry
sponsors, Fiberdyne Systems, and their clients, Softbank and Renesas. We also present our
initial research into some DSP solutions relevant to the project and then provide a plan for
the design and development of the noise reduction system. The requirements will be
refined during a key project meeting with the clients in April.

By the completion of this project, we aim to have a working prototype of a standalone DSP
embedded hardware system that improves the signal-to-noise (SNR) ratio of the voice signal
being passed through the system.

Section 2 - Statement of Problem


The use of voice recognition (VR) algorithms is a growing trend in consumer electronics. A
number of technology companies have released virtual assistants such as Apple’s Siri,
Google’s Assistant, and Samsung’s Bixby. As a result, these companies have also released
hardware products to extend the usability of VR systems beyond mobile phones. For
instance, within the automotive industry VR systems are being used to provide hands-free
control of features. These include music, GPS, phone calls and messaging, helping minimize
distractions to the driver.

Given these VR systems are being used in a wider range of environments, the voice source is
often further away from the microphone. This significantly degrades speech intelligibility as
the signal is now being exposed to reverberation and background noise before reaching the
microphone input. For this reason, DSP algorithms could be implemented to clean up the

2
PRIVATE AND CONFIDENTIAL

signal and make the voice recognition work in environments that were previously
considered too noisy.

The company SoftBank has been developing a VR system that is based around detecting
emotion in the users voice within an automotive vehicle. Their initial testing had satisfactory
accuracy under ideal conditions (i.e. silence), however they found that the VR system
struggled to detect emotion accurately when the vehicle was being driven at speed. This is
because there are a number of extra noise sources (such as wind, road, and engine noise)
interfering with the voice signal.

As a result, SoftBank has contracted our industry sponsors, Fiberdyne Systems, to develop a
noise reduction microphone system that will enhance and clean up the microphone input
into the existing VR system. The usage of a beamforming mic array and machine learning
was suggested by SoftBank as potential methods to help isolate the users voice, and
minimise the ambient noise being picked up by the microphone input. Audio filtering was
also explored, however as the system is based around emotion recognition, it requires
significantly more bandwidth compared to normal voice recognition (M Lech, p. 2346). Due
to the increased bandwidth requirement, Softbank has specified that we need to maintain
the full spectral source data for their algorithm to analyse. This means that some additional
DSP algorithms may be required to cancel out background noise as a high-pass filter can’t be
used to filter out low frequency background noise. Additionally, a requirement of our
sponsor is that we use MEMS microphones. Due to their unique design, some research must
also be done in this area.

Our industry sponsors Fiberdyne, have specified for us to initially use the Analog Devices
SHARC SC589 family of architectures for this project as a test platform before porting it to
their client’s System on Chip (SoC) down the line. This is as it is a capable DSP platform
which we already have experience writing software on and we also have development
boards containing processors from this family.

3
PRIVATE AND CONFIDENTIAL

Overview of project scope:


1. Work with Fiberdyne Systems engineers to understand the requirements of
Softbank’s new voice emotion recognition system.
2. Research and evaluate existing DSP techniques and algorithms to determine if they
will be sufficient in an automotive environment
3. Develop and simulate the chosen DSP algorithm(s).
4. Implement the chosen DSP algorithm(s) on a standalone hardware processor.
5. Present the finished system in a hardware demo to a group of Fiberdyne Systems’
engineers.

Section 3 - Literature Review


The technical aim of this project is to improve the signal-to-noise ratio (SNR) of a
microphone input, where the voice is the signal and all other audio is noise. As such we
need to research and evaluate existing state-of-the-art noise reduction (NR) techniques and
DSP algorithms to determine if they will be sufficient in an automotive environment or if we
need to modify and or develop a NR system. This research will primarily focus on the
following areas:
● Microphone array beamforming
● Noise reduction DSP algorithms
● Voice activity detection

Section 3.1 - Microphone Array Beamforming


Fundamentally, a microphone array is a hardware configuration of multiple microphones
spaced at a fixed distance to each other. This will result in each microphone receiving a
similar signal, but offset in phase slightly. This multi-channel microphone input can then be
processed in a variety of ways as will be discussed in this literature review.

Section 3.2 - Microphone Array Hardware


Microphones come in many forms and can be set up in different ways. Microphones have
polar patterns, which can cancel out sound from certain directions and can be set act
differently depending on the way that they are constructed. The omnidirectional

4
PRIVATE AND CONFIDENTIAL

microphone results in no noise cancellation in any direction. Cardioid and supercardioid


microphones have a specific directional pattern (Nave, 2018).

In section 2, the requirement to use MEMS microphones was mentioned. As all MEMS mics
have an omnidirectional pickup pattern (InvenSense 2013, p. 1), there needs to be some
way of changing this to filter out depending on direction. A way of adding a weighted polar
pattern to a MEMS mic is using a microphone array. This uses anywhere from two to six
microphones and processes them to create a weighted polar pattern. There are two
fundamental array types which are explored here, which utilise different signal processing
techniques. Further work has also focused on using a combination of these array types to
further adjust the polar pattern (InvenSense 2013, p. 10).

The simplest microphone array type is the Broadside Array which places the microphones in
a line perpendicular to the preferred direction of sound waves. It then sums the microphone
inputs together, so at certain frequencies, they cancel out sound originating from the sides
of the array (InvenSense 2014, p. 3). Under testing, severe aliasing occurred at specific
frequencies.(Brandstein 2010, p. 50) Due to the way that this mathematically works, it
meant that the aliasing was close to the frequency where the nulls around the side
happened, and that adding more microphones reduce the frequency of the aliasing further.
Another disadvantage is that the array of two microphones has to be orthogonal to the
sound direction travel, as it creates a figure 8 pattern at the target frequency.

A more advanced type of microphone array is called the endfire array, which places the
microphones inline with the desired direction of sound propagation. Because the
microphone spacing is known, there is a known time delay between the two microphone
inputs. By compensating for this time delay and then summing the inputs, the mic array can
effectively cancel sound originating from behind the array, giving the array a cardioid
pickup-pattern (InvenSense 2013, p. 5).

Section 3.3 - Adaptive Beamforming Algorithms


Another method of microphone array beamforming utilizes an algorithm to estimate the
direction of arrival (DOA) of the voice signal based on the phase difference between the

5
PRIVATE AND CONFIDENTIAL

mic-array inputs (V Krishnaveni et al. 2013). This DOA can be used to adaptively steer the
beamformer in the direction of interest and reduce the effect of noise sources in other
directions (Hendriks & Gerkmann, 2011; Zhao et al. 2015). This would be more useful in a
automotive environment as we would be able to adjust the beamformer to have a narrower
pickup pattern and then self-adjust the pickup direction based on who is talking in the
vehicle (i.e the driver, or the passenger). Research by Zhao et al. (2015) suggests an
approach that adjusts individual microphone gains in real time based on the DOA estimate
to ‘steer’ the pickup pattern toward the location of the voice signal.

Beamforming can be done in either the time or frequency domain. Time domain
beamforming involves introducing known delays to the input signal (as in these cases the
microphone position is a fixed distance apart) and summing them together (V Krishnaveni et
al. 2013, p. 5). The downfall of this algorithm is that it is sensitive to phase mismatches,
which can be overcome by doing beamforming processing in the frequency domain (Joel J.
Fuster 2004, p. 10).

It is important that the direction of arrival algorithm used is robust. Care must be taken that
the algorithms used are able to deal with various different errors. If the algorithm does not
deal well with these errors, the speech may inadvertently be cancelled out. Errors occur
from a wide variety of factors - for example the impulse response of the environment
changing (windows being rolled up or down) or the microphone array not being calibrated
properly (SA Vorobyov et al. 2018, p. 313).

Furthermore, research by Affes (1997) suggested that identification and the matched
filtering of source-to-array impulse responses are necessary for a microphone array, to
further improve the intelligibility of speech by countering the effects of reverberation. This
is further highlighted by Aarabi (2004), which proposed a model for the signals received in a
beamforming array, being a combination of the original signal convoluted with the impulse
response of the environment, summed with a noise signal. Therefore, it will be important to
consider both noise reduction and dereverberation algorithms to achieve optimal
performance in the beamforming array.

6
PRIVATE AND CONFIDENTIAL

Section 3.4 - Beamforming with NR Algorithms


Research by Faubel et al. (2011) identified the benefit of the combination of
multi-microphone systems and NR algorithms, as the cross-correlation of common noise
between the multiple channels can negatively impact directional localization in the
microphone array. Further work by Taghizadeh et al. (2011) highlighted the usefulness of
using a VAD in a multi-microphone system to remove the noise spectrum from each
microphone channel before a beamforming algorithm is applied to improve localization.
Furthermore, research by Tourabin, Malka & Tzirkel-Hancock (2017) suggested that
designing a fixed-beamformer for a road with given properties, can result in reduced
performance when applied to a different road type, hence an adaptive noise estimation
should be used to improve the performance of the beamforming system.

Section 3.5 - Noise Reduction and VAD Algorithms


Distant speech recognition (DSR) systems are of great interest in automotive environments
as hands-free operation is the best way to avoid the distraction of the driver (Faubel et al.
2011, p. 70). However, such systems are operating in high-noise environments with engine
noise, gearbox, wind, and friction with the road all being picked up by the in-car
microphones along with the desired voice signals, thereby significantly degrading the
recorded speech quality (Tourabin, Malka & Tzirkel-Hancock, 2017). Hence, DSR systems
require a noise reduction (NR) algorithm operating in combination with a precise voice
activity detector (VAD) to help isolate the speech signal (Ramírez et al. 2003, p. 271).
Accurate VAD algorithms can improve the effectiveness of noise reduction algorithm by up
to 45.3%, by classifying periods of speech and silence in the signal (Evangelopoulos &
Maragos 2005). Significant research has been committed to improving the NR algorithms
using a VAD with algorithms such as histogram averaging, continuous spectral subtraction
(Hirsch & Ehrlicher 1995) and long-term spectral divergence (Ramírez et al. 2003).

7
PRIVATE AND CONFIDENTIAL

Section 4 - Design Questions


Given our initial research, we found extensive literature containing many potential solutions
for us to base elements of our project around. We have identified a few of these key areas
to direct our project design and development:

● What microphone array setup works best for the noise reduction algorithms?
● Which Direction of Arrival (DOA) algorithm works best for wideband voice
recognition?
● Which noise reduction (NR) algorithm works best in an automotive environment?
● Can a combination of beamforming, direction-of-arrival, noise reduction, or
dereverberation be implemented together to improve speech intelligibility?

Section 5 - Methodology

Section 5.1 - Design Methodology


This project is split into two distinct sections:
● DSP Algorithm Development
● Embedded Software and Beamforming Hardware Development
These sections can be done in parallel, as DSP algorithm development will primarily involve
MATLAB simulations for the initial development. The embedded software development
involves writing and porting software to the embedded system, to run the DSP algorithm
and provide interconnectivity with the beamforming hardware. Given each of our skills, we
will split the project such that Anthony will focus on the embedded software and
beamforming hardware development, and Michael will focus on the DSP algorithm
development. Clear specifications and requirements will need to be communicated to all of
the team so that the two modules will be compatible.

We have decided to implement a prototyping life-cycle for both the DSP and Embedded part
of the project. The prototyping model is an iterative system design model that involves
creating a series of prototypes that are shown to the project stakeholders to identify and

8
PRIVATE AND CONFIDENTIAL

confirm the correct features are being developed. Prototype life-cycle development starts
with a simple system, that conveys a single feature or concept, which will then evolve to
either refine a feature or develop a new section of the project until an acceptable solution is
made (Radcliffe, 2015).

Section 5.2 - Resource Planning


Given the number of stakeholders in this project, clear and frequent communication is
essential for this project to succeed. These stakeholders are outlined below:
● Softbank and Renesas - Clients
● Fiberdyne Systems - Industry Sponsor
● Dr PJ Radcliffe - Academic Supervisor
● Anthony Ashton and Michael Stekla - Student Engineers

One of the key meetings for this project is yet to occur on the 5th-6th April 2018, in which
representatives from Renesas will be flying to Melbourne, to meet with us and our industry
sponsors Fiberdyne Systems. These meeting will clearly outline the project specifications
and requirements, as well as provide an opportunity to discuss early concepts and potential
solutions with our client. Following this meeting, our industry sponsors Fiberdyne will be
responsible for handling all communication with the clients, Softbank and Renesas. We plan
to meet on a weekly basis with Fiberdyne and our academic supervisor, Dr PJ Radcliffe, to
facilitate an ongoing discussion on the development of the project.

As we are privileged to work on an industry project, the required development hardware,


test equipment and project budget is being supplied by Fiberdyne. They have a custom DSP
test platform and hardware processors available for us to use in development. A
microphone array header for a raspberry pi was already ordered at the beginning of the
semester to begin the prototyping the system, and we will also require components to
construct a MEMS microphone array on the Analog Devices hardware. In the case that
something additional is needed, it will be ordered through Fiberdyne. There is also an
additional RMIT budget that also may be used in the case that anything extra is required.

9
PRIVATE AND CONFIDENTIAL

Test and development equipment for this project includes laptops, audio interfaces,
speakers, signal generators, and oscilloscopes. These have also provided by Fiberdyne. In
the case that this equipment is unable to be used (Perhaps it will be used by the other
Fiberdyne engineers on a different project), the project is portable enough to be worked on
at RMIT with RMIT equipment. As this is an audio project that deals with frequencies in the
audible range, standard test equipment may be used.

Section 5.3 - Alternative Designs


Instead of using an array of MEMS microphones, regular condenser mics could be used. This
would likely be bulkier as they could not be soldered directly onto a PCB the way MEMS
mics can and instead require an enclosure to be built.

Several alternative embedded hardware designs could be used. Texas Instruments DSP’s are
popular and provide a similar environment and processing capabilities. However, there are
already capable Analog Devices development boards in use, so it would take a lot of time
porting it to this platform.

An alternative design would also be to write the software on the SoC that it is planned to be
ported to, but this does not have JTAG line-by-line debugging, so that would take a long
time to develop on. Iif JTAG debugging on the SoC was made available, it may be a good
idea to create software to run on that as it would mean that the software wouldn’t need to
be ported from the analog devices processors to the SoC.

10
PRIVATE AND CONFIDENTIAL

Section 5.4 - Project Timeline

11
PRIVATE AND CONFIDENTIAL

Section 6 - Risk Management and Ethical Considerations

Section 6.1 - Risk Assessment


While this is primarily a software project, there are still tangible health and safety risks as
there are environmental risks associated with using the prototype (in an automotive
environment), and there are project risks, as the industry sponsor is working with a third
party which may stop supporting the project.

Section 6.1.1 - SWOT Analysis

Strengths Weaknesses
● Previous experience in developing ● Limited time to develop
similar projects. ● Software development can take
● Familiarity with the development unpredictable lengths of time
environment. ● Will need to work on a wideband
● Previous work has been done on input (cannot high-pass filter input)
this platform(architectural ● Wideband noise reduction for
knowledge). emotive voice recognition is a
● Multi-part solution with many relatively unexplored area.
independent modules to show ● Creating a test setup may be
progress. difficult due to the number of
● DSP algorithm based on proven microphones, array placement, and
techniques. automotive environment.
● Software development already
underway.

12
PRIVATE AND CONFIDENTIAL

Opportunities Threats
● Current lack of solutions for ● A possibility of the client losing
wideband automotive noise interest or funding for the project.
reduction. ● Development boards potentially not
● This sort of system could be applied being available at certain times if
to other microphone inputs - for other industry-sponsored group
example, Bluetooth noise reduction. uses it or is required by the sponsor.
● Other solution to noise reduction
beating this to market

Section 6.2.2 - Risk Solution Chart

Risk Scenario Proposed Solution

Opening a window/door/boot throws off Reduce reliance on beamforming for noise


beamforming control algorithm

Playing music causes voice recognition to Create a series of tests to ensure that the
malfunction noise reduction the music volume
sufficiently

Emotive voice recognition fails due to Create tests that ensure the voice
frequency response being uneven due to reduction solution does not alter the
noise control speech input too much

Voice Recognition not failing results in Make it clear that this system is only to be
vehicle issue or crash used in non-critical systems as voice
recognition is not perfect

Section 6.2 - Ethical Considerations


An important part of ethics in engineering is making sure that the project’s goals and scope
are not misrepresented. More specifically to this project, it is important in this project to
make it clear that the goal is not to make a voice recognition algorithm, but one that

13
PRIVATE AND CONFIDENTIAL

improves a pre-existing voice recognition algorithm by using noise reduction and


beamforming DSP techniques.

It is also important to ensure that the noise reduction that is developed is not being used to
dictate safety-critical decisions for the car. This is as a voice recognition system is naturally
prone to error.

The end user’s data must also be accounted for. Considering that this is being used for a
voice detection algorithm, user data (the user’s speech) must be protected. As this is an
offline algorithm, this is not an issue in the current state. In future after the prototype stage
if the algorithm is used by another party as part of a ‘cloud’ algorithm (in other words the
DSP is done on a server where the user’s device sends the audio data to the server), it will
be their responsibility to ensure their system is secure against attackers.

14
PRIVATE AND CONFIDENTIAL

Section 7 - References
Affes, S. and Grenier, Y. (1997). A signal subspace tracking algorithm for microphone array
processing of speech. IEEE Transactions on Speech and Audio Processing, 5(5), pp.425-437.
Aarabi, P. and Shi, G. (2004). Phase-Based Dual-Microphone Robust Speech Enhancement.
IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 34(4),
pp.1763-1773.

Brandstein, M. (2010). Microphone arrays. Berlin: Springer Berlin, pp.54-55.

Chu, P. (n.d.). Superdirective microphone array for a set-top videoconferencing system.


Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.

Evangelopoulos, G. and Maragos, P. (2006). Multiband Modulation Energy Tracking for Noisy
Speech Detection. IEEE Transactions on Audio, Speech and Language Processing, 14(6),
pp.2024-2038.

Faubel, F., Georges, M., Kumatani, K., Bruhn, A. and Klakow, D. (2011). Improving hands-free
speech recognition in a car through audio-visual voice activity detection. 2011 Joint
Workshop on Hands-free Speech Communication and Microphone Arrays.

Fuster, J J, 2004. A Hardware Architecture for Real-Time Beamforming. Masters Thesis. FL,
USA: University of Florida.

Hendriks, R. and Gerkmann, T. (2012). Noise Correlation Matrix Estimation for


Multi-Microphone Speech Enhancement. IEEE Transactions on Audio, Speech, and Language
Processing, 20(1), pp.223-233.

Hirsch, H. and Ehrlicher, C. (1995). Noise estimation techniques for robust speech
recognition. 1995 International Conference on Acoustics, Speech, and Signal Processing.

InvenSense. (2018). Application Note AN-1140. [online] Available at:


https://www.invensense.com/wp-content/uploads/2015/02/Microphone-Array-Beamformi
ng.pdf [Accessed 23 Mar. 2018].

M Lech, L He, N Allen, (2010). On the Importance of Glottal Flow Spectral Energy for the
Recognition of Emotions in Speech. In INTERSPEECH 2010. Makuhari, Chiba, Japan, 26-30
September 2010. 2010 ACM/IEEE International Symposium on Computer Architecture: ISCA.
2346-2349.

15
PRIVATE AND CONFIDENTIAL

Nave, R. (2018). Microphones. [online] Hyperphysics.phy-astr.gsu.edu. Available at:


http://hyperphysics.phy-astr.gsu.edu/hbase/Audio/mic3.html#c2 [Accessed 22 Mar. 2018].

PJ Radcliffe, (2015). Engineering Design 1. RMIT University, Melbourne.

Ramı́rez, J., Segura, J., Benı́tez, C., de la Torre, Á. and Rubio, A. (2004). Efficient voice
activity detection algorithms using long-term speech information. Speech Communication,
42(3-4), pp.271-287.

SA Vorobyov, AB Gershman, ZQ Luo, 2018. Robust Adaptive Beamforming Using Worst-Case


Performance Optimization: A Solution to the Signal Mismatch Problem. IEEE TRANSACTIONS
ON SIGNAL PROCESSING, [Online]. VOL. 51, NO. 2, FEBRUARY 2003, 313-324. Available at:
http://www.ece.ualberta.ca/~vorobyov/RobBeamformer.pdf [Accessed 15 March 2018].

Shengkui Zhao, Xiong Xiao et al., (2015). Robust Speech Recognition Using Beamforming
With Adaptive Microphone Gains And Multichannel Noise Reduction. In Automatic Speech
Recognition and Understanding Workshop. Scottsdale, AZ, USA, 2015. IEEE Xplore: IEEE.
460-467.

Taghizadeh, M., Garner, P., Bourlard, H., Abutalebi, H. and Asaei, A. (2011). An integrated
framework for multi-channel multi-source localization and voice activity detection. 2011
Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

Tourbabin, V., Malka, I. and Tzirkel-Hancock, E. (2017). Performance of fixed in-car


microphone array beamformer under variations in car noise. 2017 Hands-free Speech
Communications and Microphone Arrays (HSCMA).

V Krishnaveni, T Kesavamurthy, Aparna.B, 2013. Beamforming for Direction-of-Arrival (DOA)


Estimation-A Survey. International Journal of Computer Applications, [Online]. Volume 61–
No.11, January 2013, 4. Available at:
http://research.ijcaonline.org/volume61/number11/pxc3884758.pdf [Accessed 15 March
2018].

16

You might also like