The Effect of Vehicle Noise On Automatic Speech Recognition Systems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Downloaded from SAE International by Vellore Inst of Technology, Thursday, December 07, 2017

The Effect of Vehicle Noise on Automatic Speech Recognition 2017-01-1864

Systems Published 06/05/2017

Joshua Wheeler
Ford Motor Company

CITATION: Wheeler, J., "The Effect of Vehicle Noise on Automatic Speech Recognition Systems," SAE Technical Paper 2017-01-
1864, 2017, doi:10.4271/2017-01-1864.

Copyright 2017 SAE International

Abstract using the objective metrics of Word Error Rate (WER%) and
Sentence Error Rate (SER%), which quantify the percentage of
The performance of a vehicles Automatic Speech Recognition (ASR)
individual words or full commands from the user being successfully
system is dependent on the signal to noise ratio (SNR) in the cabin at
interpreted and executed by the ASR system. The voice recognition
the time a user voices their command. HVAC noise and
software/hardware, and elements of the vehicle design, need to be
environmental noise in particular (like road and wind noise), provide
engineered in concert to ensure satisfactory performance of the core
high amplitudes of broadband frequency content that lower the SNR
ASR system to meet the customers expectations. Its not enough to
within the vehicle cabin, and work to mask the users speech.
design these disparate factors in a vacuum, independent of their
Managing this noise is a vital key to building a vehicle that meets the
influence on the ASR system and to one another.
customers expectations for ASR performance. However, a speech
recognition engineer is not likely to be the same person responsible
for designing the tires, suspension, air ducts and vents, sound package
and exterior body shape that define the amount of noise present in the
cabin. If objective relationships are drawn between the vehicle level
performance of the ASR system, and the vehicle or system level
performance of the individual noise, vibration and harshness (NVH)
attributes, a partnership between the groups is brokered. Compatible
targets are set and hardware selected that works to meet both groups
goals. This paper examines the NVH attributes and performance
metrics that relate to vehicle level ASR performance, and finds that
strong relationships and statistical trends can be drawn between the
Sentence Error Rate (SER%) and standard NVH metrics for that road
surface or HVAC configuration. The paper also establishes that AI%
should be the preferred metric to relate cabin noise to ASR
Figure 1. The systems and subsystems which are responsible for the
performance in the presence of any other kind of steady state noise.
customers satisfaction with their ASR and HFC system.

The focus of this paper concentrates on the effect of the background

noise on the core ASR system performance, which is represented in
Automatic Speech Recognition (ASR) and Hands-Free the lower right hand corner of Figure 1.
Communication (HFC) capabilities have become prominent in the
automotive industry, with over 50% of new vehicle sales equipped
with some level of ASR system. With the common use of mobile Cabin Noise and the Masking Effect on Speech
personal assistants and smartphones with Bluetooth capability, The performance of these ASR and HFC systems is highly dependent
customer expectations for built in ASR and HFC systems have on the level of background or masking noise that competes with the
increased significantly. As shown schematically in Figure 1, the main speech engines ability to correctly convert the drivers voice to
systems that impact a customers satisfaction with their system can be actionable commands. Environmental noises, like those generated by
broken into two halves, the core automatic speech recognition the road and wind, provide some level of broadband frequency
performance portion of the system, and the usability & content that can distort the commands issued by the driver. Road
functionality portion. The success of the latter can be quantified by noise sound pressure levels are mostly dominated by low frequency
Downloaded from SAE International by Vellore Inst of Technology, Thursday, December 07, 2017

content that does not significantly contribute in the frequency range teams need to stop and align on expectations before compatible
where the current generation of automotive speech recognition hardware is discussed. The second goal is that when a road NVH
technology is focused (about 250Hz to 8kHz). However, excessive engineer notes that a proposed tire design improves their performance
tire tread sizzle, tire cavitation noise, and transient impact sounds will to target by a certain decibel level (and customer satisfaction measure),
cause higher frequency issues for the ASR engine. High levels of the ASR engineer can say that the NVH improvement also improves
wind noise caused by aspirations or aerodynamic properties of the their SER% by a similar known amount; thus also improving the ASR
exterior body design will cause broadband excitation that also customer satisfaction measure. The two teams now can work together
influences the high frequency range where engineers are trying to to defend the content proposal on common grounds. The following
preserve the clarity of speech. But HVAC noise provides the greatest sections work to draw these comparisons and propose formulas for this
level of masking noise in the car in the frequency range of concern common language to be used in those discussions.
for speech intelligibility, as high volumes of air are quickly pushed
through resonant ducts and distributed through narrow panel and
defroster openings, generating sound as the climate-controlled air is
Test Methodology
circulated. Figure 2 shows an example of the frequency content in a In order to support this investigation, data was collected on 15
voiced command spoken to the ASR system, and the competing vehicles from the OEM in their test labs anechoic chambers (for
frequency content of a common road noise and HVAC noise masking HVAC sources) and proving grounds ride roads (for road and wind
level. It is evident how the vehicle cabin noise from these sources sources). Standard surfaces, speeds, HVAC modes and blower
covers up the users speech. settings were selected so that NVH and ASR metrics would be
evaluated using common conditions. The test cases discussed in this
paper are the following:

Brushed concrete surface

Smooth road surface
Coarse road surface
Defrost HVAC noise
Panel vent recirculated air HVAC noise

In order to generate the objective ASR performance these different

noise sources would create, the noise recordings were then mixed
with clean speech utterances and processed using batch speech
recognition. Figure 1 shows the process of mixing noise and speech,
Figure 2. Spectrogram showing frequency content of a voiced command from
discussed further in an SAE paper by Huber, et al [1]. Cabin impulse
the user, a common road noise, and common HVAC noise from a sedan. The
background noise (particularly the HVAC noise) almost completely masks the
response recordings from the drivers mouth to the hands-free
voiced command in the frequency range of interest. microphone were also taken on these 15 vehicles in an anechoic
chamber to support the analysis.
NVH engineers work alongside design and release engineers in the
body organization to design hardware that satisfies a number of
different, and often opposing, attribute needs and requirements. The
design of a tire will be scrutinized for how much airborne noise the
tread pattern generates when moving rapidly over a coarse road
surface, but will also be evaluated for how it affects the stopping
distance or dynamic performance feel. The climate control ducting
system will be important to an NVH engineer to ensure that
unpleasant, tonal resonant frequencies are not heard when a user
operates their air conditioner, but those design features to ensure
quiet performance may be in opposition to another team that demands
Figure 3. Block diagram of the Synthetic Speech Mixing process.
that a sufficient volume of air is distributed to cool the customer
quickly. These kinds of design trade off discussions are common in The potential metrics selected to quantify the NVH performance come
the pre-program phase where customer satisfaction and from a set of existing acoustic and psychoacoustic measures that are
dissatisfaction measures are used to determine the direction of the standard in NVH target setting, or bandlimited sound pressure level
design. It is important for the ASR teams to be active players in these measures that are designed to show sensitivity to speech intelligibility.
conversations since their systems performance is intrinsically linked The steady state noise level tested is calculated for each of these
to the cabin noise levels that are byproducts of the component level potential metrics, and compared to the ASR SER% performance:
performance of this hardware.
Articulation Index (AI%)
The relationships that this paper establishes help guide these
Loudness (Sones)
discussions with two goals in mind. The first goal is to ensure that the
20-1000Hz RMS SPL (dBA) - capturing boom and rumble
NVH and ASR target levels are compatible. If the ASR system
regions of the frequency spectrum
performance SER% targets are more aggressive than the NVH team is
proposing to deliver for environmental or HVAC noise levels, the
Downloaded from SAE International by Vellore Inst of Technology, Thursday, December 07, 2017

630-3150HZ RMS SPL (dBA) - disregarding low frequencies

but capturing more of the voice intelligibility spectrum
300-3400Hz RMS SPL (dBA) - mimicking narrowband
bluetooth voice communication frequencies
150-6800Hz RMS SPL (dBA) - mimicking wideband bluetooth
voice communication frequencies

Its also important to note that for the comparisons made below, the
noise level is measured at the drivers outboard ear (DOE). This is a
Figure 5. Panel vent Loudness (Sones) level and Articulation Index (AI%) vs.
common measurement location for NVH testing, and the point in the SER% performance.
car at which NVH targets and performance are quantified. The ASR
performance uses the noise measured at the vehicle microphone As shown in Figure 5, an exponential trend line can be used to
location, since where that microphone is positioned can change the establish the relationship between the Loudness level and the SER%
SNR in the vehicle as it moves closer to, or further away from, the performance. As with the defroster noise data, the noise levels are
drivers mouth. This microphone is often placed in the overhead recorded at the DOE microphone, so issues with HVAC buffeting at
console of the vehicle, but occasionally closer to the driver above the the hands-free microphone are not considered.
sun visor. This inconsistency will somewhat negatively affect the
statistical correlation, but is necessary because the NVH value
represented for the cabin noise level needs to share the same strategy Brushed Concrete Road/Wind Noise
as the NVH metric that it is being compared to. For highway-like brushed concrete speeds and noise, a statistical
relationship can be established the ASR performance and bandlimited
SPL. The frequency range that correlates best from those evaluated is
HVAC Defroster Noise from 630-3150Hz, which contains much of the frequencies
For HVAC defroster noise, a statistical relationship can be established responsible for voice intelligibility. However, AI% performance also
between Loudness or AI% data from the 15 cars at various defroster displays a good trend between data from the 15 cars at various levels
blower speed levels, and the resultant ASR performance. These are of brushed concrete road/wind noise, and the resultant ASR
both metrics that the NVH groups are familiar with, and can use to performance. The bandlimited SPL metric was slightly better from a
establish performance and target links. statistics perspective, but the goal of creating a relationship between
the two attributes is accomplished with AI%.

Figure 4. Defroster Loudness (Sones) level and Articulation Index (AI%) vs.
SER% performance.
Figure 6. Brushed concrete Articulation Index (AI%) and bandlimited SPL
(630-3150 dBA) vs. SER% performance.
As shown in Figure 4, an exponential trend line can be used to
establish the relationship between the Loudness level and the SER% As shown in Figure 6, an exponential or linear trend line can be used
performance. It is worth pointing out again that since the Loudness is to establish the relationship between the Articulation Index (%) level
calculated at the DOE microphone, but the ASR performance is and the SER% performance.
determined at the vehicle microphone, poor performing samples at
the high end of the scale may be due to airflow buffeting on the
vehicle microphone, which the DOE performance will not reflect. Coarse Road/Wind Noise
For city-like coarse road speeds and noise, a statistical relationship
can be established between the Articulation Index (%) and the ASR
HVAC Panel Vent Noise performance. Bandlimited 20-1000Hz RMS SPL (dBA) also displays
For HVAC panel vent noise, a statistical relationship can be a good trend between data from the 15 cars at various levels of coarse
established between Loudness or AI% data from the 15 cars at road noise, and the resultant ASR performance. The AI% metric
various blower speed levels, and the resultant ASR performance. provided the best correlation from a statistics perspective, but the
These are both metrics that the NVH groups are familiar with, and goal of creating a relationship between the two attributes is
can use to establish performance and target links. accomplished with the measure of bandlimited SPL.
Downloaded from SAE International by Vellore Inst of Technology, Thursday, December 07, 2017

DOE microphone. The results from every source, speed and surface are
graphed together at the same time to evaluate which of the previously
considered metrics best correlate to ASR performance.

As shown in Figure 9, of the available acoustic metrics evaluated, the

Articulation Index provides the best statistical correlation when all
stationary noise sources are considered at once. This should not be
surprising, as the AI% metric is constructed to give a measure of the
intelligibility of speech in a noisy environment.
Figure 7. Coarse road Articulation Index (AI%) and bandlimited SPL
(20-1000 dBA) vs. SER% performance.
As shown in Figure 7, an exponential or linear trend line can be used
to establish the relationship between the bandlimited SPL level and In this paper, ASR performance (represented by SER%) is
the SER% performance. successfully linked to NVH metrics for major road/wind and HVAC
noise levels. This establishes a groundwork from which NVH and
ASR engineers can relate design characteristics and attribute
All Steady State Noises Combined performance when ensuring pre-program targets are compatible, and
Apart from correlating ASR performance to existing NVH metrics, it select hardware that will meet both teams needs. Also considered is
is also advantageous to establish what kind of acoustic or that regardless of the steady state noise source identified, a
psychoacoustic metric best relates ASR performance to the sound relationship can be drawn between the Articulation Index (%) metric
level at the vehicle microphone, regardless of the source of steady and ASR performance at the vehicle microphone. Therefore the effect
state background noise encountered. of any noise source outside of the standard NVH metrics can also be
considered on the grounds of a commonly used and understood
psychoacoustic metric.

1. Huber, J., Rangarajan, R., Ji, A., Charette, F. et al., "Validation
of In-Vehicle Speech Recognition Using Synthetic Mixing,"
SAE Int. J. Passeng. Cars - Electron. Electr. Syst. 10(1):2017,

Contact Information
Josh Wheeler
[email protected]

ASR - Automatic speech recognition
DOE - Drivers outboard ear
Figure 9. Articulation Index (%) vs. SER% performance for all stationary
noise sources considered together. HFC - Hands free calling
HVAC - Heating, ventilation, air conditioning
There are non-standard noise sources that also have an effect on ASR
NVH - Noise, vibration, and harshness
performance that may eventually need to be considered, like rain noise,
cabin exterior noise, windows down noise, etc. Establishing a metric SER - Sentence error rate
that relates all stationary noise to the vehicle mic ASR performance SNR - Signal to noise ratio
will be important to understand when these sources are considered. For SPL - Sound pressure level
this evaluation, all road and HVAC sound recordings are grouped
WER - Word error rate
together at the vehicle-specific microphone location, instead of the

The Engineering Meetings Board has approved this paper for publication. It has successfully completed SAEs peer review process under the supervision of the session organizer. The process
requires a minimum of three (3) reviews by industry experts.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or
otherwise, without the prior written permission of SAE International.

Positions and opinions advanced in this paper are those of the author(s) and not necessarily those of SAE International. The author is solely responsible for the content of the paper.

ISSN 0148-7191

You might also like