0% found this document useful (0 votes)

9 views11 pages

Extra Paper

This paper presents an automatic speech recognition system designed for recognizing user commands in a graphical editor using Dynamic Time Warping (DTW) and Mel-Frequency Cepstral Coefficients (MFCC). The system is speaker-dependent with a limited vocabulary of 20 Ukrainian commands, achieving an average recognition accuracy of 93%. The methodology includes signal analysis using the Fourier transform and feature extraction through MFCC, making it suitable for real-time command processing.

Uploaded by

onlinemedia0011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views11 pages

Extra Paper

Uploaded by

onlinemedia0011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Automatic Speech Recognition System with Dynamic Time

Warping and Mel-Frequency Cepstral Coefficients

Kateryna Yalova, Mykhailo Babenko and Kseniia Yashyna
Dniprovsky State Technical University, Dniprobydivska str.2, Kamyanske, 51918, Ukraine

Abstract
The approach to speech recognition presented in this paper is used to create a system for
automatic recognition of user commands for a graphical editor. The automatic speech
recognition system is used as a recognition module in a plug-in for a graphics editor. The
proposed automatic speech recognition system has a limited dictionary size, is speaker-
dependent, and is used to recognize separate speech given by the user in the form of short
speech commands. The list of commands contains 20 commands in Ukrainian language, the
name of which corresponds to the name of the pictograms in the graphical editor. The user’s
voice command is used as an input, which is processed, recognized and presented as a
command for the graphical editor. The user must use a microphone to transmit a voice
command. The system allows processing user commands in real time. Isolated command
words are used as commands. The stages of voice command recognition are next: analysis of
an analog signal, its transformation into a digital signal, formation of a filter bank,
comparison of the processed command with a template. To analyze the sound wave, it is
proposed to use the Fourier transform. The Hamming function is used to reduce spectrum
blurring. For feature extraction from voice commands, it is proposed to use the Mel-
frequency cepstral coefficients algorithm. Matching voice commands with a template is
carried out using the Dynamic time warping method. The use of the Mel-frequency cepstral
coefficients and Dynamic time warping algorithm is justified by the fact that the vocabulary
is limited and the commands are short. The accuracy of command recognition was evaluated
for various speakers. The average recognition accuracy is 93%.

Keywords 1
Automated speech recognition, dynamic time warping (DTW), (MFCC).

1. Introduction
Speaking is the most natural form of human communication, and therefore the implementation of
an interface based on the analysis of speech information is a promising direction for the development
of intelligent control systems. One of the current unsolved problems in information and measurement
systems is the construction of systems for automatic recognition of speech signals that are invariant to
the speaker [1]. Its solution would make it possible to expand the range of users of such systems and
significantly increase the efficiency of information exchange in man-machine systems. The task of
language analysis includes a wide range of tasks. Traditionally, they are divided into three subclasses:
identification, classification, and diagnostic tasks. Identification tasks include the tasks of verification
and identification of announcers. The tasks of classification include the task of recognizing key
words, recognizing fused speech, and the task of semantic language analysis. Diagnostic tasks include
the task of determining the psychophysical state of the announcer.
The creation of language interfaces can be used in systems of various purposes: voice control for
people with disabilities, answering machines, automatic call processing, implementation of “smart

COLINS-2023: 7th International Conference on Computational Linguistics and Intelligent Systems, April 20–21, 2023, Lviv, Ukraine
EMAIL: [email protected] (K. Yalova); [email protected] (M. Babenko); [email protected] (K. Yashyna)
ORCID: 0000-0002-2687-5863 (K. Yalova); 0000-0003-1013-9383 (M. Babenko); 0000-0002-8817-8609 (K. Yashyna)
©️ 2023 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
home” commands. However, despite the rapidly increasing computing power, the creation of speech
recognition systems remains an extremely difficult problem. This is due to both its interdisciplinary
nature (it is necessary to have knowledge of linguistics, digital signal processing, acoustics, pattern
recognition, etc.), and the high computational complexity of the developed algorithms [2]. The latter
imposes significant limitations on automatic speech recognition systems – on the volume of the
processed dictionary, the speed of receiving an answer, and its accuracy. The task of optimizing the
quality and level of recognition in automatic speech recognition systems is a relevant scientific and
practical task, the solution of which allows improving the quality of the developed SILK (speech,
image, language, knowledge) interfaces of software systems and applications.
The purpose of the article is to present the results of applying the Dynamic warping method
(DTW), fast Fourier transform (FFT), Mel-frequency cepstral coefficients (MFCC), to solving the
problem of developing a system for automatic speech signal recognition. The authors implemented
the following tasks in the course of achieving the goal:
- the peculiarities of the application of the DTW algorithm for speech signal recognition are
analyzed. A fast Fourier transform was applied to the analysis of the input signal and MFCC were
used to construct the input vector of features;
- a model of an automatic speech recognition system with characteristics has been designed: the
developed system is command-based, dependent on the speaker with the type of structural unit – the
speaker’s command phrase, which sets a command for working with the text within the text editor;
- a speech recognition software module for text editors has been developed in the form of a plug-
in, which allows a user to evaluate the quality of the proposed solutions and establish the recognition
error.

1.1. Related works

During the recognition of the incoming speech signal, various methods are used: hidden Markov
model (HMM), Decision Trees, Linear Predictive Codes, DTW, methods based on the use of neural
networks of various architectures.
The earliest attempts to create systems for automatic speech recognition began in the 1950s, when
the first speaker-dependent system that recognized numbers was developed [3], and resonances of
vowel sounds in words were used as characteristics of the input signal. In the 70s, the DTW and the
method of linear predictive coding (Linear Predictive Coding – LPC) were discovered. Despite the
rapid development of the neural network approach to solving the problem of automatic speech
recognition, DTW remains a popular method.
Such scientists as K. Chakraborty, A. Talele, S.R. Suralkar, A. C. Wani, M. Mahajan, A. Katuri,
A. G. Siva and many others devoted their works to the use of DTW in the task of automatic speech
recognition. Methods for using DTW and MFCC in an automatic speech recognition system for
certain native languages were presented in their works by M.S. Nguyen, L. Muda Awad, H. Omar, Y.
Farghaly. In the works of X. Sun, Y. Miyanaga, B. Sai, it is proposed to use multireferences DTW to
reduce the calculation cost. Scientists S. Joshi, S. Nagar, A. Ismail, S. Abdlerazek proposed to apply
the DTW algorithm within language-dependent and speaker-dependent speech signal recognition
systems. In scientific papers A. S. Haq, C. Setianingsih, R. Martinek, J. Vanus, J. Nedoma, M.
Fridrich, J. Frnda, T. Desot the effectiveness of using DTW and MFCC in the task of creating
automatic speech recognition systems for home automation and creating systems for interacting with
smart home devices is substantiated. In the works of E. Principi, V. Prasad, M. A. Anusuya, K.
Sharma, S.D. Dhingra, B.J. Mohan carried out the development of automatic speech recognition
systems that transform an acoustic signal into text. The versatility of DTW and MFCC application
areas testifies to the effectiveness of their use. The speech recognition problem solution has a number
of different applications from voice control of digital devices, to determining the owner of the voice
and even recognizing the species of birds by their sounds.
Despite the significant number of scientific works devoted to the problem of speech signal
recognition, the tasks of improving the quality of speech recognition and developing new approaches
to the implementation of automatic speech signal recognition systems remain relevant scientific and
practical tasks [4].
2. Proposed methodology
The input data for the proposed system are the user's voice commands entered through the
microphone. Since the system must process voice signals in real time, it is proposed to use FFT to
analyze the speech signal. To change the size of the spectrum, removed after the FFT stop, it is
recommended to use the Hamming window.
In this paper, it is proposed to use MFCCs, a method that allows extracting features that an
acoustic signal received from a speaker has. It was introduced by Davies and Mermelstein in the
1980s and has been relevant ever since, supplanting linear prediction coefficients (LPCs) and linear
predictive cepstral coefficients (LPCCs), which were previously the main features for automatic
speech recognition, especially with HMMs classifiers. The disadvantage of this algorithm is a
significant dependence on the correctness of the process of converting an analog signal into a digital
one. Also, the recognition process will be negatively affected by extraneous noise and speech defects
of the speaker. After building the filter bank, you can proceed to compare the processed commands
with templates.
DTW method is an automatic speech recognition method based on pattern matching. DTW allows
to find the difference between two 2-time series of voice commands that have different durations.
Although the accuracy of speech recognition using DTW is lower than that of methods using neural
networks, it is still popular and is used for speech recognition systems with a limited vocabulary. The
expediency of using DTW when developing a plug-in for a graphic editor as an effective method for
recognizing speech commands is justified by the characteristics of the automatic speech recognition
system, namely: the limited size of the dictionary, recognition of separate speech given by the user in
the form of short speech commands.
The adequacy of the proposed methods was evaluated by determining the level of recognition
accuracy for different speakers and voice commands.

3. Results
Speech recognition is the process of transforming a speech signal into digital information [5]. The
automatic voice recognition system is an information system that converts an incoming speech signal
into a recognized message. At the same time, the message can be presented both in the form of the
text of this message, and immediately transformed into a form convenient for its further processing
[6].
Most speech recognition systems (Automatic Speech Recognition – ASR) consist of an analog
signal analysis and processing process and a recognition process. When analyzing an analog signal
from speech, properties are extracted that are used further in the recognition process to determine
what was said. Automatic voice recognition systems are classified according to the following
characteristics: dictionary size (limited set of words or a large dictionary), dependence on the
announcer (announcer-dependent or announcer-independent), language type (fused, separate),
purpose (dictation systems, command systems), recognition algorithm that is used, the type of
structural unit (phrases, words, phonemes, etc.), principles of selection of structural units [7].
The general scheme of the speech signal recognition process is next: receiving the acoustic signal
that comes from the user's microphone, digitizing the sound signal, obtaining the characteristics of the
signal and comparing the feature vector of the input signal with templates.

3.1. Speech signal analysis

To analyze a sound wave, let’s use Fourier’s theorem, which states that any complex periodic
oscillation can be decomposed into the sum of simple harmonic oscillations. As a result, we will get a
set of amplitudes, phases and frequencies for each sinusoidal wave component:
𝑁−1 (1)
2𝜋
𝑋𝑘 = ∑ 𝑥𝑛 𝑒 − 𝑁 𝑘𝑛 ,
𝑛=0
where N – the number of signal values, К – the number of frequencies, xn – the value of the signal at
certain points in time, Хk – complex amplitudes of sinusoidal signals that make up the initial signal,
k=0,…K-1 – frequency index, n=0,…,N-1 – discrete time points at which the signal was measured.
The frequency or phase point taken together with the amplitude is called the spectrum. To restore a
discrete signal from the spectrum, we use the inverse Fourier transform:
𝐾−1 (2)
1 2𝜋
𝑥𝑛 = ∑ 𝑋𝑘 𝑒 𝐾 𝑘𝑛 .
𝐾
𝑘=0
To monitor changes in the spectrum of a signal over time, you can use a spectrogram - a
visualization of changes in the spectrum over the entire sound segment. For its construction, a
windowed Fourier transform is used – the spectrum is calculated from successive windows of the
signal, each of which overlaps a part of the previous window. To significantly speed up the
spectrogram construction process, the FFT algorithm was used, which works with complex numbers
and transformation sizes representing powers of two [8]. If the frequency of the tone coincides with
one of the frequencies of the FFT grid, then the spectrum will look “perfect”: a single sharp peak will
indicate the frequency and amplitude of the tone. If the frequency of the tone does not coincide with
any of the frequencies of the grid, then the FFT will “collect” the tone from the frequencies available
in the grid, combined with different weights. It is worth taking into account that the FFT decomposes
the signal not according to the frequencies that are actually present in the signal, but according to a
fixed uniform frequency grid. Such blurring is usually undesirable, as it can mask weaker sounds at
nearby frequencies.
To reduce the effect of spectrum blurring, the signal before FFT calculation is multiplied by
weight windows – functions falling to the edges of the interval [8]. They reduce the blurring of the
spectrum due to some deterioration of the frequency resolution. In this work, it is proposed to use the
Hamming window to reduce the blurring of the spectrum obtained after applying the FFT. The
formula by which the Hamming function can be determined can be presented in the form [9]:
2𝜋𝑛 (3)
𝑤(𝑛) = 0.54 − 0.46 ∗ 𝑐𝑜𝑠 ( ),
𝑁−1
where n – total amount of samples in a single frame.
Applying a Hamming window reduces the level of spectral blurring by about 40 dB relative to the
main peak. The input signal was divided into intervals of 20-40 ms, since the size of such an interval
is sufficient to obtain a reliable spectral estimate. To compensate for peak broadening when applying
weight windows, longer FFT windows can be used: for example, 8192 counts instead of 4096.

3.2. Building a bank of filters

The sound signal is constantly changing, so for simplicity, let’s assume that the sound signal
hardly changes over a short period of time. That is why the paper suggests dividing the signal into
intervals of 20-40 ms. If the frame is much shorter, we don’t have enough samples to get a reliable
spectral estimate, if longer, the signal changes too much over the entire frame. Spectral estimation
determines which frequencies are present in the frame.
Spectral estimation still contains a lot of information that is not needed for automatic speech
recognition. In particular, it is not possible to distinguish between two closely spaced frequencies.
This effect becomes more pronounced with increasing frequency. To construct a vector of
characteristics, it is advisable to use the MFCC algorithm, which will allow dividing the spectrum into
sections that will be represented by frequency projections in the corresponding range on the Mel scale
[10]:
𝑓 (4)
𝑀(𝑓) = 1127 ∗ 𝑙𝑛 (1 + ),
700
where f – the frequency that is projected onto the Mel scale.
The obtained values need to be converted back to the frequency form:
𝑚 (5)
ℎ(𝑚) = 700 ∗ (𝑒𝑥𝑝 ( ) − 1),
1127
where m – frequency projection on the Mel scale
ℎ(𝑖) (6)
𝑓(𝑖) = 𝐹𝑙𝑜𝑜𝑟 ((𝑛 + 1) ∗ ),
𝑅
where n –FFT window size; R – signal sampling rate.
The equation is used to form the filter bank:

0, 𝑘 < 𝑓(𝑚 − 1) (7)

𝑘 − 𝑓(𝑚 − 1)
, 𝑓(𝑚 − 1) ≤ 𝑘 ≤ 𝑓(𝑚))
𝑓(𝑚) − 𝑓(𝑚 − 1)
𝐻𝑚 (𝑘) = ,
𝑓(𝑚 + 1) − 𝑘
, 𝑓(𝑚) ≤ 𝑘 ≤ 𝑓(𝑚 + 1)
𝑓(𝑚 + 1) − 𝑓(𝑚)
{ 0, 𝑘 > 𝑓(𝑚 + 1)
where m – amount of MFCC; k – current frequency.
The Mel filter bank contains triangular-shaped overlapping filters [11] as shown in Figure 1.

Figure 1: View of the filter bank

After calculating the energy of the filter bank, it is necessary to calculate their logarithm, since
humans do not hear loudness on a linear scale. This operation makes Mel coefficients more similar to
human perception of sound. The last step is to calculate the Discrete Cosine Transform (DCT) of the
energy of the logarithms of the filter bank:
𝑁−1 (8)
𝜋 1
𝑋𝑘 = ∑ 𝑥𝑛 ∗ 𝑐𝑜𝑠 [ (𝑛 + ) 𝑘].
𝑁 2
𝑛=0
In the process of language recognition, the most difficult thing is to carry out the procedure of
comparing two language elements, which are also characterized by a length in time, therefore there
are quite a lot of such procedures and methods.

3.3. Dynamic time warping application

DTW allows to find the minimum distance between two sequences or time series depending on
certain values, such as time scales, which is effectively used in automatic speech recognition systems
[12].
Let’s assume there are two numerical sequences P=(p1, p2, ..., pn) and Q=(q1, q2, ..., qm),
schematically the alignment of the sequences in time can be represented as shown in Figure 2.
Figure 2: Scheme of time alignment of two sequences

To calculate local deviations between elements of two sequences, you can calculate the absolute
deviation of the values of two elements (Euclidean distance) [13,14]. As a result, the matrix of
deviations d will be obtained:
𝑛 (9)
ⅆ(𝑝, 𝑞) = √∑(𝑝𝑖 − 𝑞𝑖 )2 ,
𝑖=1

Next, it is necessary to calculate the matrix of minimum distances between the sequences. Its
elements are according to the following formula:
𝑚ⅆ𝑖𝑗 = ⅆ𝑖𝑗 + 𝑚𝑖𝑛(𝑚ⅆ𝑖−1,𝑗−1 , 𝑚ⅆ𝑖−1,𝑗 , 𝑚ⅆ𝑖,𝑗−1 ), (10)
where mdij – minimum distance between the i-th and the j-th elements of sequences P and Q.
After that, we find the minimum path in the obtained matrix, which can be followed from the
element mdnm to md00 by following the next rules [15-16]:
1. The path is laid only forward – indices i and j are never increased.
2. Indices are only decremented by one per iteration.
3. The path starts in the lower right corner and ends in the upper left corner of the matrix.
Based on the obtained path, we estimate the global deformation:
𝑘 (11)
1
𝐺𝐶 = ∑ 𝑤𝑖 ,
𝑘
𝑖=1
where wi – elements of the minimum deformation path; k – the number of elements of the
deformation path.

4. Implementation
To evaluate the quality of the proposed solutions regarding the speech signal recognition process, a
command-based, announcer-dependent automatic speech signal recognition system was designed that
converts the announcer’s input speech signal into a text formatting command within the text editor.
C# language is selected as the programming language. For software implementation, an object-
oriented approach to the analysis of the data domain and the construction of the architecture of the
internal classes of the system was used. Testing of the program code was carried out in manual mode.
The developed system is equipped with functions: formation of a dictionary of announcer
commands, training of the system for a specific announcer, and execution of recognized commands.
The automatic speech recognition system is used as a software module for the implementation of a
graphic editor plug-in, the use of which allows you to use the functionality of the graphic editor
through voice control commands. The developed voice recognition system has a limited command
dictionary size, aimed at recognizing 20 isolated command words. The developed application has the
following functional requirements:
 the possibility of training the system for a specific announcer;
 recognition of received voice commands;
 conversion of voice commands to commands of a graphic editor
The scheme of application of DTW and MFCC, used during the development of the system of
automatic recognition of voice commands, is presented in Figure 3.

Figure 3: Application scheme of DTW and MFCC

The use of an object-oriented approach made it possible to present the program code in the form of
a set of classes available for repeated use. The architecture of the developed application is presented
in the form of a class diagram in Figure 4.

Figure 4: Architecture of software application classes

The presented class diagram was generated automatically within the software environment
VisualStudio.net during the software implementation. The architecture of the class does not contain
inheritance relations. To increase the software flexibility, the composition relations were implemented
for interaction between classes. A composition relationship is a relationship in which a data consumer
class has a data provider class type field. This type of relation is not displayed in the class diagram
generated in the VisualStudio.net.
The SoundRecording class is responsible for receiving the digitized sound signal that comes from
the user’s microphone. The OnDataAvailable method is called every 100ms, and passes the digitized
signal to the InputProcessing method of the SoundProcessing class. This class manages all stages of
recognition – obtaining the spectrum of the signal window by Fourier transformation (FFT class),
calculating low-frequency cepstral coefficients (MFCC class), comparing the input signal feature
vector with templates (DTW class). Auxiliary classes are also used to store constant values (Class
Constant), common to the project, and a class to store sequentially calculated values (Class
Precalculated), which significantly speeds up calculations, thanks to the calculation of values only
once – when the program is started.
The result of the developed program code execution is a new item displaying – the Addins menu
item in the main toolbar of the graphic editor. After clicking the Addins button, the Start button is
displayed, which activates the command recognition mode and Preferences, which starts the
application settings mode. This part of the software is implemented as a desktop application within
Windows Application Form in C# programming language. The functionality of the developed plugin
is divided into two parts:
 system training with the ability to create and save a command template
 command execution mode.
By pressing the Start button and starting the voice command recognition mode, you can speak
commands and evaluate the result of their execution. The recognition mode stops functioning when
the Stop button is pressed.
In the program settings mode, you can change the voice signal transmission device, add the voice
equivalent for the command to the dictionary, analyze the sound wave and its spectrum. Figure 5
shows the application settings window.

Figure 5: Screen form of software application settings

When the Start Training button is pressed, the user needs to say the command, after which the
training results window will open, where you can see which command the spoken signal is similar to.
If necessary, it can be added to the command dictionary with the “Save as template” button.
The software application was tested in the following mode: 20 test commands, 10 announcers, 10
requests to pronounce each command, the sampling frequency of speech signals is 44.1 kHz, the
resolution of the speech sample is 16 bits, the number of channels is 2. The templates of the test
commands coincide with the name of the commands. presented on the Microsoft Word quick access
panel in the form of icon buttons. In order to establish the adequacy of the proposed solutions and
estimate the speech signal recognition error, a transaction log was developed, which received a
description of each speaker command recognition operation. Analysis of the transaction log data
allowed for the average value of recognition accuracy for each command. Recognition accuracy
values can be determined as [8,17]:

𝑐𝑜𝑟𝑒𝑐𝑡𝑙𝑦 ⅆ𝑒𝑡𝑒𝑐𝑡𝑒ⅆ 𝑐𝑜𝑚𝑚𝑎𝑛ⅆ𝑠 (12)

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = .
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑚𝑎𝑛ⅆ𝑠 𝑖𝑛 ⅆ𝑎𝑡𝑎 𝑠𝑒𝑡

Table 1 shows the average results of recognition level estimation for the described conditions of
the experiment.

Table 1
The results of the conducted experiment
Command Accuracy Rate
Align left 95%
Align right 93%
Justify 98%
Center 96%
Bold 93%
Strikethrough 90%
Italic 97%
Bullets 92%
Multilevel list 91%
Superscript 91%
Numbering 93%
Cancel input 98%
Display all signs 94%
Repeat the input 96%
Subscript 89%
Underline 91%
Page break 95%
Save 99%
Increase font size 94%
Decrease font size 94%

Based on the data presented in table 1, it is possible to determine the average level of command
recognition accuracy. The average recognition error does not exceed 7%. It should be noted that the
main factor of erroneous recognition is the diction of the user and the similarity of voice commands
[18], for example, such as: “Superscript”, “Subscript”. The obtained result of recognition and its
comparison with the recognition accuracy of systems using DTW and MFCC allows to draw
conclusions regarding the adequacy and effectiveness of the proposed design solutions.

5. Conclusions
Automatic speech recognition allows computers, machines and digital devices to understand
natural speech and perform appropriate actions. The generalized speech recognition algorithm
includes the steps of:
1. Receiving an analog signal.
2. Transformation of the received signal into a digital one.
3. Spectogram formation using FFT and Hamming algorithm.
4. Definition of a filter bank when using MFCC.
5. DTW application to compare the received command and template.
There are various techniques, methodologies and algorithms that are used in each stage of the
speech recognition process. Combining different approaches allows to achieve the desired level of
recognition accuracy. The paper presents the results of the development of an automatic speech
recognition system using DTW and MFCC. DTW, FFT and MFCC are well known methods that are
commonly used in the speech recognition, but even systems implemented using these methods have
different levels of recognition accuracy. Recognition accuracy varies significantly, for example, from
46% [19] to 96,4% [20] or even 99.5% for the dictionary with 8 isolated words [21] and largely
depends on the implementation of the recognition system, on the volume of the dictionary, input
signal transmission method, the language that is used for recognition. The proposed system is
speaker-dependent, has a limited dictionary and is designed to recognize 20 short user commands
specified by the user in Ukrainian using a microphone. The developed system is used as a program
module of a plug-in for a graphic editor. The use of such a plug-in allows you to control the
functionality of the graphic editor through the user’s voice commands. The language of the software
implementation is C #. The software has been developed using an object-oriented approach within the
Windows Application Form as a desktop application to set option and record the results of the
experiments and as a plug-in for the graphical redactor.
A feature of the developed system of automatic speech recognition is that the system is able to
work with the user in real time, unlike other developments, where it is necessary to transmit an audio
recording of a certain format to the input. At the same time, the system response time (recognition
time) does not exceed 2 seconds on average.
The DTW speech command recognition algorithm is an effective method for accounting for
temporal variations when comparing related time series in the task of automatic speech recognition
and can be effectively applied in systems with a limited dictionary, as presented in the paper. Using
(12), the accuracy of command recognition was estimated, the average value of accuracy for all
commands and recognition variants was approximately 93%. Comparing the obtained level of
accuracy with the results of the other authors is not indicative, since systems use different
dictionaries, with different commands or isolated words, use different natural languages. However,
the obtained level of recognition indicates a high efficiency of the proposed solutions and also
demonstrates the promise of DTW, FFT and MFCC approaches combining for speech recognition in
Ukrainian language. The methods, principles and algorithms used in the research, as well as the
results of developed system testing, make it possible to state that the results of the research are valid
and reproducible.

6. References
[1] C. Agarwal, P. Chakraborty, S. Barai, V. Goyal, Quantitative analysis of feature extraction
techniques for isolated word recognition, Advances in computing and data sciences (2019) 618–
627. doi: 10.1007/978-981-13-9942-8_58.
[2] І. Sultana, N.K. Pannu, Automatic speech recognition system, International journal of advance
research, ideas and innovations in technology 4 (2018) 277–279.
[3] D. Pandey, K. K. Singh. Implementation of DTW algorithm for voice recognition using VHDL,
in: Proceedings of the International Conference on Inventive Systems and Control, ICISC ’17,
Coimbatore, 2017, pp. 1– 4. doi: 10.1109/ICISC.2017.8068638.
[4] M. Sood, S. Jain, Speech recognition employing MFCC and Dynamic time warping algorithm,
Innovations in information and communication technologies (2021) 235–242. doi: 10.1007/978-
3-030-66218-9_27.
[5] H. Pawar, N. Gaikwad, A. Kulkarni, A study of techniques and processes involved in speech
recognition system, International journal of engineering and technology 7 (2020) 1905–1911.
[6] S. K. Ali, Z. M. Mahdi, Arabic voice system to help illiterate or blind for using computer,
Journal of physics 1804 (2021) 1–11. doi:10.1088/1742-6596/1804/1/012137.
[7] L. Lerato, T. Niesler, Feature trajectory dynamic time warping for clustering of speech segments,
Journal on Audio, Speech and Music 6 (2019) 1–9. doi: 10.1186/s13636-019-0149-9.
[8] H. F. C. Chuctaya, R. N. M, Mercado, J. J. G. Gaona, Isolated automatic speech recognition of
quechua numbers using MFCC, DTW and KNN, International journal of advanced computer
scince and application 9 (2018) 24–29. doi: 10.14569/IJACSA.2018.091003.
[9] S. Lokesh, M. R. Devi, Speech recognition system using enchanced mel grequency cepstral
coefficient with windowing and framing method, Cluster Computing 22 (2019) 1669–1679. doi:
10.1007/s10586-017-1447-6.
[10] I. D. Jokić, S. S. Jokić, V.D. Delić, Z.H. Perić, One solution of extension of Mel-frequency
cepstral coefficients feature vector for automatic speaker recognition, Information, technology
and control 49 (2020) 224–236. doi: 10.5755/j01.itc.49.2.22258.
[11] A. Awad, H. Omar, Y. Ahmed, Y. Farghaly, Speech Recognition System Using MFCC and
DTW 4 (2018).
[12] A. S. Haq, C. Setianingsih, M. Nasrun, M.A. Murti, Speech recognition emplemetation using
MFCC and DTW algorithm for home automation Proceeding of the Electrical Engineering
Computer Science and Informatics 7 (2020) 78-85. doi: 10.11591/eecsi.v7.2041.
[13] B. Kurniadhani, S. Hadiyoso, S. Aulia, R. Magdalena, FPGA-based implementation of speech
recognition for robocar control using MFCC, Telekomnika 17(4) (2019) 1914–1922. doi:
10.12928/telkomnika.v17i4.12615.
[14] Y. Permanasari, E. H. Harahap, E. P. Ali, Speech recognition using Dynamic Time Warping
(DTW), Journal of Physics 1366 (2019) 1–6 doi:10.1088/1742-6596/1366/1/012091.
[15] R. G. Kanke, R. M. Gaikwad, M. R. Baheti, Enhanced Marathi Speech Recognition Using
Double Delta MFCC and DTW, International Journal of Digital Technologies 2(1) (2023) 49–
58.
[16] I. Wibawa, I. Darmawan, Implementation of audio recognition using mel frequency cepstrum
coefficient and dynamic time warping in wirama praharsini, Journal of Physics 1722(2021) 1–8.
doi: 10.1088/1742-6596/1722/1/012014.
[17] B. Paul, R. Paul, S. Bera, S. Phadikar, Isolated Bangla spoken digit and word recognition using
MFCC and DTW, Engineering mathematics and computing 1042 (2022) 235–246. doi:
10.1007/978-981-19-2300-5_16.
[18] R. Harshavardhini, P. Jahnavi, S.K. Zaiba Afrin, S. Harika, Y. Annapa, N. Naga Swathi, MFCC
and DTW based speech recognition, International research journal of engineering and technology
5 (2018) 1937–1940.
[19] I. Khine, C. Su, Speech recognition system using MFCC and DTW, International Journal of
Electrical, Electronics and Data Communication 6 (2018) 29–34.
[20] S. Riyaz, B. L. Bhavani, S. V. Kumar, Automatic Speaker Recognition System in Urdu using
MFCC and HMM, International Journal of Recent Technology and Engineering 7 (2019) 109–
113.
[21] N. Adnene, B. Sabri, B. Mohammed, Design and implementation of an automatic speech
recognition based voice control system, EasyChair Preprint 5116 (2021) 1–7.

Speech Recognition
100% (4)
Speech Recognition
576 pages
A Study On Automatic Speech Recognition
100% (1)
A Study On Automatic Speech Recognition
2 pages
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
No ratings yet
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
5 pages
Speech Recognition Seminar
No ratings yet
Speech Recognition Seminar
19 pages
Methodology For Speaker Identification and Recognition System
100% (1)
Methodology For Speaker Identification and Recognition System
13 pages
Effect of Dynamic Time Warping On Alignment of Phrases and Phonemes
No ratings yet
Effect of Dynamic Time Warping On Alignment of Phrases and Phonemes
6 pages
Effect of MFCC Based Features For Speech Signal Alignments
No ratings yet
Effect of MFCC Based Features For Speech Signal Alignments
7 pages
Speech Recognition Using Artificial Neural Network: - A Review
100% (1)
Speech Recognition Using Artificial Neural Network: - A Review
4 pages
A Dynamic Time Warping Based Macedonian
No ratings yet
A Dynamic Time Warping Based Macedonian
6 pages
Effect of MFCC Based Features For Speech Signal Alignments
No ratings yet
Effect of MFCC Based Features For Speech Signal Alignments
7 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
6 pages
ASR Proof
No ratings yet
ASR Proof
19 pages
Comp Sci - Recognition Isolated - Shanthi Teressa1
No ratings yet
Comp Sci - Recognition Isolated - Shanthi Teressa1
6 pages
Automatic Speech Recognition (Attempt) : ECE 113DB Final Project, Winter 2019 Fong Chi Ho, Zijun Sun, Shao Xiong Lee
No ratings yet
Automatic Speech Recognition (Attempt) : ECE 113DB Final Project, Winter 2019 Fong Chi Ho, Zijun Sun, Shao Xiong Lee
4 pages
Speech To Text Conversion STT System Using Hidden Markov Model HMM
No ratings yet
Speech To Text Conversion STT System Using Hidden Markov Model HMM
4 pages
A Review On Speech Recognition Methods: Ram Paul Rajender Kr. Beniwal Rinku Kumar Rohit Saini
No ratings yet
A Review On Speech Recognition Methods: Ram Paul Rajender Kr. Beniwal Rinku Kumar Rohit Saini
7 pages
Feature Extraction Methods LPC, PLP and MFCC
100% (1)
Feature Extraction Methods LPC, PLP and MFCC
5 pages
Recognizing Voice For Numerics Using MFCC and DTW
No ratings yet
Recognizing Voice For Numerics Using MFCC and DTW
4 pages
Advances in Speech Recognition
No ratings yet
Advances in Speech Recognition
174 pages
Dissertation Presentation Slides
100% (2)
Dissertation Presentation Slides
8 pages
Speech Recognition
No ratings yet
Speech Recognition
4 pages
Automatic Recognition of Correctly Pronounced English Words Using Machine Learning
No ratings yet
Automatic Recognition of Correctly Pronounced English Words Using Machine Learning
12 pages
MR Book
0% (1)
MR Book
414 pages
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
No ratings yet
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
5 pages
Modern Speech Recognition Approa
No ratings yet
Modern Speech Recognition Approa
337 pages
Chapter: Measures of Dispersion: Batsman 1 Batsman 2
No ratings yet
Chapter: Measures of Dispersion: Batsman 1 Batsman 2
25 pages
Ijreas Volume 3, Issue 3 (March 2013) ISSN: 2249-3905 Efficient Speech Recognition Using Correlation Method
No ratings yet
Ijreas Volume 3, Issue 3 (March 2013) ISSN: 2249-3905 Efficient Speech Recognition Using Correlation Method
9 pages
Voice Recognition System Speech To Text
No ratings yet
Voice Recognition System Speech To Text
5 pages
Sensors 20 02326 PDF
No ratings yet
Sensors 20 02326 PDF
19 pages
Human-Computer Interaction Based On Speech Recogni
No ratings yet
Human-Computer Interaction Based On Speech Recogni
9 pages
134 Rashid Bicet2021
No ratings yet
134 Rashid Bicet2021
9 pages
IRJET Speech Scribd
No ratings yet
IRJET Speech Scribd
3 pages
Research Paper
No ratings yet
Research Paper
9 pages
Hidden Markov Model and Persian Speech Recognition
No ratings yet
Hidden Markov Model and Persian Speech Recognition
9 pages
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
No ratings yet
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
8 pages
Speech Recognition Using MFCC and DTW: January 2014
No ratings yet
Speech Recognition Using MFCC and DTW: January 2014
5 pages
Literature Review On Automatic Speech Recognition
No ratings yet
Literature Review On Automatic Speech Recognition
9 pages
25 The Comprehensive Analysis Speech Recognition System
No ratings yet
25 The Comprehensive Analysis Speech Recognition System
5 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
9 pages
AJSAT Vol.5 No.2 July Dece 2016 pp.23 30
No ratings yet
AJSAT Vol.5 No.2 July Dece 2016 pp.23 30
9 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
Nis Microproject
No ratings yet
Nis Microproject
21 pages
Speech Recognition Using Correlation Tec
No ratings yet
Speech Recognition Using Correlation Tec
8 pages
Applsci 12 01091
No ratings yet
Applsci 12 01091
18 pages
Speech Processing 15-492/18-492: Speech Recognition Template Matching
No ratings yet
Speech Processing 15-492/18-492: Speech Recognition Template Matching
24 pages
Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India
No ratings yet
Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India
6 pages
Punjabi Speech Recognition: A Survey: by Muskan and Dr. Naveen Aggarwal
No ratings yet
Punjabi Speech Recognition: A Survey: by Muskan and Dr. Naveen Aggarwal
7 pages
Development & Evaluation of Different Acoustic Models For Malayalam Continuous Speech Recognition
No ratings yet
Development & Evaluation of Different Acoustic Models For Malayalam Continuous Speech Recognition
8 pages
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
No ratings yet
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
3 pages
Digital Signal Processing: The Final
No ratings yet
Digital Signal Processing: The Final
13 pages
WSQ Gray-Scale Specification Version 3
No ratings yet
WSQ Gray-Scale Specification Version 3
56 pages
Iot Based Personal Voice Assistant: Research Paper
No ratings yet
Iot Based Personal Voice Assistant: Research Paper
7 pages
Field Evaluation of Text-Dependent Speaker Recognition in An Access Control Application
No ratings yet
Field Evaluation of Text-Dependent Speaker Recognition in An Access Control Application
4 pages
Speech Recognition: A Complete Perspective: Ashok Kumar, Vikas Mittal
No ratings yet
Speech Recognition: A Complete Perspective: Ashok Kumar, Vikas Mittal
6 pages
NLP Project Reportttt
No ratings yet
NLP Project Reportttt
9 pages
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
No ratings yet
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
6 pages
Speech Recognition of Isolated Words Usi
No ratings yet
Speech Recognition of Isolated Words Usi
10 pages
A Review On Feature Extraction and Noise Reduction Technique
No ratings yet
A Review On Feature Extraction and Noise Reduction Technique
5 pages
DCT Application in Speech Recognition: A Survey
No ratings yet
DCT Application in Speech Recognition: A Survey
5 pages
Investigation of Combined Use of MFCC and LPC Features in Speech Recognition Systems
No ratings yet
Investigation of Combined Use of MFCC and LPC Features in Speech Recognition Systems
7 pages
Digital Filter Banks
No ratings yet
Digital Filter Banks
34 pages
Fyp Proposal
No ratings yet
Fyp Proposal
4 pages
Voice Command Recognition System Based On MFCC and DTW: Anjali Bala
No ratings yet
Voice Command Recognition System Based On MFCC and DTW: Anjali Bala
8 pages
EEL6586 Final Project:: A Speaker Identification and Verification System
No ratings yet
EEL6586 Final Project:: A Speaker Identification and Verification System
16 pages
Delta Modulation Report
No ratings yet
Delta Modulation Report
36 pages
Adaptive Filters Theory and Applications 2nd Edition Behrouz Farhang-Boroujeny Instant Download
No ratings yet
Adaptive Filters Theory and Applications 2nd Edition Behrouz Farhang-Boroujeny Instant Download
52 pages
Moving Window-Based Double Haar Wavelet Transform For Image Processing
No ratings yet
Moving Window-Based Double Haar Wavelet Transform For Image Processing
9 pages
Chowdhury 1998
No ratings yet
Chowdhury 1998
12 pages
Smith Julius, Signal Processing Libraries For Faust
No ratings yet
Smith Julius, Signal Processing Libraries For Faust
9 pages
Unit-Iv Fir Digital Filters: Characteristics o F Fir Filters W I T H Linear Phase
No ratings yet
Unit-Iv Fir Digital Filters: Characteristics o F Fir Filters W I T H Linear Phase
27 pages
Discrete Wavelet Transforms - Algorithms and Applications PDF
No ratings yet
Discrete Wavelet Transforms - Algorithms and Applications PDF
308 pages
Lecture Notes - Speech Processing
No ratings yet
Lecture Notes - Speech Processing
80 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Audiosegment Readthedocs Io en Latest
No ratings yet
Audiosegment Readthedocs Io en Latest
23 pages
Near Perfect Reconstruction Quadrature Mirror Filter: A. Kumar, G. K. Singh, and R. S. Anand
No ratings yet
Near Perfect Reconstruction Quadrature Mirror Filter: A. Kumar, G. K. Singh, and R. S. Anand
4 pages
OFDM and FBMC Transmission Techniques - A Bile High Performance Proposal For Broadband Power Line Communications - Belanger
No ratings yet
OFDM and FBMC Transmission Techniques - A Bile High Performance Proposal For Broadband Power Line Communications - Belanger
6 pages
Lecture 6
No ratings yet
Lecture 6
6 pages
Ultra Low-Power ANSI S1.11 Filter Bank For Digital Hearing Aids
No ratings yet
Ultra Low-Power ANSI S1.11 Filter Bank For Digital Hearing Aids
7 pages
2D Wavelets For Different Sampling Grids and The Lifting Scheme
No ratings yet
2D Wavelets For Different Sampling Grids and The Lifting Scheme
50 pages
QMF
No ratings yet
QMF
3 pages
Urban Sound Classification PaperV2
No ratings yet
Urban Sound Classification PaperV2
6 pages
4 6filter Banks
No ratings yet
4 6filter Banks
9 pages
Discrete Communication Systems (Oxford Graduate Texts) Stevan Berber - The Ebook With All Chapters Is Available With Just One Click
100% (1)
Discrete Communication Systems (Oxford Graduate Texts) Stevan Berber - The Ebook With All Chapters Is Available With Just One Click
52 pages
Unit Iv Audio and Video Coding
No ratings yet
Unit Iv Audio and Video Coding
15 pages
Fattah Lecture 01
No ratings yet
Fattah Lecture 01
30 pages
Fattah Lecture Structure
No ratings yet
Fattah Lecture Structure
57 pages
Speech Signal Analysis
No ratings yet
Speech Signal Analysis
11 pages
tr05 01 PDF
No ratings yet
tr05 01 PDF
4 pages
Module3 SSP
No ratings yet
Module3 SSP
73 pages
A "TRICK" For The Design of FIR Half-Band Filters
No ratings yet
A "TRICK" For The Design of FIR Half-Band Filters
4 pages
Proposal 1
No ratings yet
Proposal 1
2 pages
Project Proposal
No ratings yet
Project Proposal
2 pages
20050303-Personal Heart Attack Detector
No ratings yet
20050303-Personal Heart Attack Detector
28 pages
Gonzalez Ch06
No ratings yet
Gonzalez Ch06
47 pages
Khamene 2000
No ratings yet
Khamene 2000
10 pages
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
From Everand
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Extra Paper

Uploaded by

Extra Paper

Uploaded by

Automatic Speech Recognition System with Dynamic Time

Warping and Mel-Frequency Cepstral Coefficients

1.1. Related works

3.1. Speech signal analysis

3.2. Building a bank of filters

0, 𝑘 < 𝑓(𝑚 − 1) (7)

Figure 1: View of the filter bank

3.3. Dynamic time warping application

Figure 3: Application scheme of DTW and MFCC

Figure 4: Architecture of software application classes

Figure 5: Screen form of software application settings

𝑐𝑜𝑟𝑒𝑐𝑡𝑙𝑦 ⅆ𝑒𝑡𝑒𝑐𝑡𝑒ⅆ 𝑐𝑜𝑚𝑚𝑎𝑛ⅆ𝑠 (12)

You might also like