0% found this document useful (0 votes)
68 views6 pages

Multithreaded Java Approach To Speaker Recognition: Radosław Weychan, Tomasz Marciniak, Adam Dąbrowski

This document discusses using multithreading in Java to improve the speed of speaker recognition algorithms. It presents two approaches: 1) Using multithreading to reduce the total time needed for training and testing speaker models on large databases. 2) Using multithreading to reduce the time required for a single speaker recognition, such as in access systems. The authors implemented a multithreaded Gaussian mixture model approach in Java and measured reductions in computation time of over 5 times on 8-core CPUs. They make their implementation available on GitHub.

Uploaded by

Tran Trung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views6 pages

Multithreaded Java Approach To Speaker Recognition: Radosław Weychan, Tomasz Marciniak, Adam Dąbrowski

This document discusses using multithreading in Java to improve the speed of speaker recognition algorithms. It presents two approaches: 1) Using multithreading to reduce the total time needed for training and testing speaker models on large databases. 2) Using multithreading to reduce the time required for a single speaker recognition, such as in access systems. The authors implemented a multithreaded Gaussian mixture model approach in Java and measured reductions in computation time of over 5 times on 8-core CPUs. They make their implementation available on GitHub.

Uploaded by

Tran Trung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

SIGNaL PROCESSING

algorithms, architectures, arrangements, and applications


SPa 2016
THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS INC. September 21-23rd, 2016, Pozna, POLAND

Multithreaded Java approach


to speaker recognition
Radosaw Weychan, Tomasz Marciniak, Adam Dbrowski
Division of Signal Processing and Electronic Systems
Chair of Control and Systems Engineering
Pozna University of Technology
Pozna, Poland
[email protected]

AbstractIn this paper an analysis of multithread approach


to speaker recognition is presented. Two cases have been ( | )= ( | , ) (2)
investigated the use of multithreading first, to reduce the
total time of training and testing (when processing the
speaker database) and second, to reduce the time of a single where:
speaker recognition (e.g. in voice-based access systems). We
have investigated the processing time in function of a number
- a set of MFCCs
of threads in order to find the optimal number for 8-core - the final model
CPU. The obtained results show that the computation time ( | , ) - Gaussian component densities (Gaussians)
can be reduced even over 5 times for this kind of processor. - the number of Gaussians
The prepared implementation is available on Github source - -th weight
code repository. -th mean
-th covariance matrix.
Keywords- speaker identification, Java, multithreading

I. INTRODUCTION The number of Gaussians depends on the application.


In the standard GMM algorithm it is usually set between
Speaker recognition is a subject described in the
16 and 64 components [2].
literature with various applications, e.g., for the general
The recognition of particular unlabeled voice sample
purpose access, high-level customization, and speaker
lies in the computation of log-likelihood of the extracted
diarisation.
features (MFCCs) using each model in the previously
The main idea of the speaker recognition is to extract
prepared database. The model with the highest obtained
individual features from the speaker voice and generate a
value is assumed to be the recognized speaker.
model, which guarantees further ability to distinguish
There are also other techniques of modeling: the
individual speakers. The most common features extracted
Gaussian mixture model the universal background model
from voice are mel-frequency cepstral features (MFCC)
(GMM-UBM) [3], support vector machine (SVM) [4],
[1] computed according to the following formula
joint factor analysis (JFA) and i-vectors [5], but in most
MFCC( ) = log cos ( 0.5) , cases GMM is the basis for them.
(1)
= 1, , II. SPEECH PROCESSING AND MODELING TOOLKITS
where: Although the number of papers and experiments in the
- number of cepstral features field of speaker recognition is rather large, there are only
- number of mel-scale filters in the filterbank few tools, which can be used for developing the speaker
mel-scale spectral magnitudes. recognition system. One of the most popular is Matlab
environment with toolboxes like VoiceBox [6] (Vector
Typically 13 cepstral coefficients are extracted from a Quantization [7], Gaussian Mixture Models) or MSR
speech frame of length from 10 to 30 ms (depending on Identity Toolbox [8] (GMM-UBM, i-vectors), but it is
sampling rate of analog to digital conversion). closed environment for scientific purposes. There are also
Having a set of those features, collected from recording C++ toolboxes like HTK [9] (Hidden Markov Model [9])
of appropriate length (at least 20 second of speech [2]), a or Alize [10] (GMM-UBM), but they are dedicated for
model technique can be applied. The base of most modern scientific purposes and are developed more as general
modeling technique in the field of speaker recognition is a systems than libraries which can be used in various
Gaussian mixture model (GMM) [2]. The algorithm scenarios. More flexible in this case is Pythons scikit-
models a set of numbers (vectors of particular MFCCs in learn [11] package which includes very good
this case) with the use of weighted mixture of Gaussians implementation of GMM, but it can be used only on
and the expectation-maximization (EM) technique. The machines which are able to run Python. In the case of Java,
final form of the formula representing the GMM model is which can be run on the most modern devices, there is
as follows: WEKA [12] toolkit, but it is hard to use from the practical

292
point of view (no direct access to speaker model As the threads do not share any resources and are not
parameters when using GMM). Another fact is that almost dependent on each other, they are commonly called
none of presented toolkits makes use of multithreading, processes. The system automatically assigns time slots for
where modern CPUs have up to 8 or even more cores in them. The situation is much more complicated when the
some cases. The only package which uses multithreading threads are processing the same data source, or returning
is MSR Identity Toolbox, but it cannot be used outside data which is used by other threads, so they must be
Matlab. The number of papers related to parallel synchronized.
implementation of speaker recognition is very limited. The There are many solutions for synchronization between
presented solutions are based on ineffective methods like threads. The critical sections can be blocked for other
VQ [13] and were not shared to be used by others [14]. threads with the use of lock() method from
In this paper we present a multithread approach for RentrantLock class, or with the use of synchronized
speaker recognition with the use of general GMM keyword. The primitive data can be set as volatile, which
modeling algorithm. The main idea of this paper is to is information for the compiler and Java virtual machine
measure the reduction of time of the recognition process that it may be modified by more than one thread at a time.
while using multithreading in function of the number of Having this information, Java virtual machine can handle
processors and threads. The proposed approach was used the process.
in two scenarios as a technique for reducing total time of The low-level concurrency methods can be, in most
training/testing part of speaker recognition when using cases, replaced by high-level mechanisms like queues with
large amount of input files, and as a technique for reducing producers and consumers. The producers threads put
time for single speaker recognition in e.g. security and elements (results) in queue, and the consumers threads
access systems. We used Java programming language take them and process in other ways (sending, displaying
because it can be used in most of modern platforms, etc.). The queue allows for safe data exchange between
including small computers, smartphones and tablets as the threads. In the java.util.concurrent package the
implementations of the algorithms are independent of automatically synchronized queue is a blocking queue. It
processor type. Second reason is that Java has a very causes blocking of a thread while attempting to put an
extensive multithreading library. It is also very fast in element if the queue is full, or while attempting to remove
computations. The ANSI C and C++ are faster in general, an element if the queue is empty. Blocking queues are
but the codes written in these languages are CPU- used in tasks of coordination of many threads. Specific
dependent in contrast to JAVA. threads put results in queue, while other (dedicated)
The provided implementation of GMM is based on the threads take them (removing from queue) and continue to
implementation available in scikit-learn Python package, process them. The queue automatically controls the flow
as it is very intuitive to use, and also based on our previous of the work. If one set of threads works slower than the
solutions presented in [15, 16, 17, 18, 19]. The second set, the latter set must wait for the first one to finish
experiments showed that the training time was lower in the their job. There are few kinds of blocking queues in Java
case of Python implementation, but it uses lots of C++ like ArrayBlockingQueue, LinkedBlockingQueue,
code in time consuming maximization parts of K-means PriorityBlockingQueue, DelayQueue and
and GMM. In the case of testing part (computing log- SynchronousQueue. As the speaker-recognition task does
likelihood in this case), where no C++ parts were used, as not require a special order of data acquisition in training
Java is much faster. The detailed data will be presented in and testing part, it is also not time-triggered, and no
the following sections. priority for training/testing data is needed, the
ArrayBlockingQueue was chosen in this case as the best
solution. It is a bounded FIFO blocking queue backed by
III. CONCURRENCY IN JAVA
an array. The length of the array must be previously set in
The methods, classes and interfaces related to contrast to LinkedBlockingQueue, but it prevents from the
concurrency issue are supported by java.util.concurrent high memory consumption.
package. The basic way of creating new thread is to put the
task code inside the run() method of class which IV. SOFTWARE DESCRIPTION
implements the Runnable interface: The main part of the research was to develop the code
Class MyRunnableClass implements related to all steps of the speaker recognition: reading
Runnable unprocessed wav files, computing MFCC set from the
{ speech, modeling the coefficients with the use of GMM-
public void run() { EM algorithm initialized by K-means++. The written
//the task code software uses FFT function from jtransforms library. It
} was written in the pure Java. Thus it can be used in all
} machines, even tablets and smartphones. The producer-
In order to run the thread, the new instance of the class consumer scenario was chosen as the basis for further
must be created and set as a parameter of the new Thread improvements due to its functionality and simplicity.
object: The presented software contains three independent
Runnable r = new MyRunnableClass(); parts related to offline training and testing with the use of
Thread newThread = new Thread(r); speaker databases, and an approach for fast single speaker
newThread.start();

293
recognition which can be used in real-time scenario access Last solution for multithreaded real-time testing, is
system. suitable only for testing a single sentence. One thread is
The multithreaded offline speaker training runs one used for consequently filling the queue with models to be
thread for consequently filling the queue with training tested, and 2 or more threads for computing log-likelihood
files, and 2 or more threads for taking file handle from ratio. Result variable is shared between threads and the
queue, processing them (reading wav file, computing access to it must be synchronized. The workflow is
MFCC and modeling them with the use of GMM). Saving presented in Fig. 3.
all data (speaker ID and GMM parameters) is done with
the use of serialization method. The database (ArrayList<>
of speakers models) is shared among threads and must be
synchronized individually. Queue and the cooperated
consumer and producer threads are presented in Fig. 1.

Fig. 3. Scheme of multithreaded speaker testing solution


for single sentence testing

Fig. 1. Scheme of multithreaded speaker training

The multithreaded offline speaker recognition testing,


works similarly to the previous one. It is presented in Fig.
2. One thread is used for consequently filling the queue
with testing files, and 2 or more threads for taking file
handle from queue, processing them (computing MFCC
and log-likelihood under each model the saved database
consists of) and saving the result. The access to the result
must be synchronized.

Fig. 4. Class diagram of speaker training project

Figure 4 presents the class diagram of the project. The


dotted lines represent the dependency relationship, while
the solid line represents the association relationship. The
core of MFCC, GMM and K-Means classes are final
classes Matrixes and Statistics, containing static methods
for manipulating arrays and basic algorithms. They were
placed at the top because of the same level of significance
(processing of audio signal) for speaker recognition task.
SpeakerModel is only a container for data (speaker ID,
means, variances, weights) and the ability to compute the
log-likelihood of given data under its own model.
FileEnumerationTask is a producer thread enumerating
files and putting them into queue declared in
SpeakerTrainingTest class (which includes main method),
Fig. 2. Scheme of multithreaded speaker testing
294
while TrainingTask contains the steps for computing In the same scenario, the time of processing single
speaker model (consumer task). As the number of utterance of average length 3.07 s was measured. The
dependencies between main classes is very small, the code results are presented in Fig. 6. It has to be noted that there
can be used in a number of various solutions in a very is no possibility of manual assignment of thread to CPU
simple way. The software was developed in Eclipse core. It is automatically done by the Java virtual machine
MARS IDE for Java Developers. and the operating system. Thus, in every case with the
number of threads more than 1, additional time for
managing tasks to the system (CPU time slots) is necessary.
V. THE ANALYSIS OF PROCESSING TIME FOR TRAINING This is the reason of time increase of processing a single
AND TESTING PART sentence. It has nothing in common with the total time of
The correctness of the implementation of machine training as in Fig. 6 the time was measured individually for
learning and parameterization techniques was tested with each processed file irrespective to the number of threads.
the use of the TIMIT database, which contains 10 short The processing time of a single file was about 1.7 s but less
various utterances of 630 speakers recorded with sampling than the average duration of a single sentence. The average
rate of 16 kSps. In this scenario (without multithreading) 6 training time of the Python implementation was about 0.9 s
files of each speaker was used in training part, and another but as it was previously stressed, it uses C++ for time
4 in testing part. Number of Gaussians was set to 32, while consuming parts of the K-means-based GMM.
number of MFCC to 13. The resolution of FFT was
dependent on sampling rate and was equal to 256. The
overall recognition rate was 96.4 %, which proves the
correctness of implementation. As the recognition rate is
not the aim of the paper, the subsequent experiments do
not provide this parameter.
In order to select best processing parameters, a set of
experiments were performed, where time of processing
were measured in a function of number of threads. It has to
be noticed that the length of the queue does not influence
the result, but is should be lower or equal to the number of
consumer threads. In most cases it was set to 20, but in the
case of 100 threads it had also be set to 100.
All tests were also performed with the use of TIMIT
speaker database. It has to be noted that the average
duration of 1 TIMIT file is equal to 3.07 s. All files were
processed separately, which means that 1 file of each Fig. 5. Influence of number of threads on total training time
speaker was chosen for training step, and 9 for testing. The (offline case)
processing parameters were the same as in first test (32
GMM, fs = 16 kSps, 13 MFCC, 256-point FFT).
The experiments were performed on PC equipped with
CPU of 8 cores (Intel Core i7 @2.93 GHz). It has to be
noted that operation system consequently performs many
operations which interrupt each other. It is impossible to
switch off all of them, thus the presented results may vary
each time, especially when the number of running threads
is greater than number of cores in the CPU. Although the
difference may not be significant, only the general
tendency has to be taken into consideration.
The first experiment was related to training part where
total time of computing speaker models was measured.
Each experiment was performed with different number of
threads from 1 to 100. The results are presented in Fig. 5.
Without multithreading, total consumption time was
relatively high almost 1050 s (about 17 minutes). As the
number of threads increases, the total time of training part
decreases even over 5 times. In the book [20] the optimal
number of threads (for application without
communications with external sources) was described as Fig. 6. Influence of number of threads on processing time
= _ +1 (3) of single file (offline case)
It is proved that number of threads higher than 9 does The time of processing of a single file increases in
not significantly decrease computation time, when the general, as the Java virtual machine assigns various tasks to
number of CPU cores was 8. available time slots of the CPU. In the case of 100 threads,
295
the processing of a single (average) sentence took over 32 synchronization was not optimal in the authors opinion,
seconds, while the total training time was at the same level and need to be revised. Thus, only results for processing a
in comparison to 9 threads, but the time of processing a single file are presented in Fig. 9.
single sentence was significantly lower about 3 s. In the previous experiment the time of testing a single
The second experiment was related to the testing part, file against 630 speakers was about 1.7 s. The use of a
where 5670 files of 630 speakers were compared against queue for storing the models has decreased the recognition
the database of 630 speakers. The results are presented in time up to 3.5 times. The best result was obtained for the
Fig. 7. The general tendency is very similar to the result number of threads equal to the number of the CPU cores.
from the previous experiment. The function saturates
around 8th thread, the total training time is also over 5
times lower in comparison to the single thread case.
The processing time of a utterance of average duration
equals 3.07 s as presented in Fig. 8. The number of
operations performed for a single tested sentence seems to
be smaller in this case (only computation of MFCC and
log-likelihood), but the number of comparisons can be very
high, depending on the size of the speaker database. In this
case, 630 comparisons for a single file took still less time
than the average sentence time duration (in the case of a
single thread). With the use of multithreading approach, the
total time of the testing part can be decreased also over 5
times. The general tendency of increasing of the processing
time can be noted, which is a consequence of the same
process described for the previous experiment.
In comparison to the Python implementation of
computing the probability model, the Java approach is
much faster 1.7 s (similar to the training part) instead of
3 s, which makes Python implementation hard to use in Fig. 8. Influence of number of threads on processing time
real-time scenarios, in which the speaker databases can be of single file (offline case)
even larger than 630 speakers.

Fig. 7. Influence of number of threads on total testing time


(offline case)

The last experiment differed from the previous Fig. 9. Influence of number of threads on processing time
experiments the queue was filled with the speaker
models instead of files. This queue-related approach is
VI. CONCLUSIONS
suitable for a single file/sentence processing. In other cases
(training/testing with the use of all input sentences as in Multithreading can be used in two ways: as a technique
previous experiments), other synchronization techniques to accelerate speaker recognition to speed up the general
have to be applied. The queue may be filled again with training and testing parts and, moreover, to speed up a
models already processed when the current recognition single recognition e.g. in the security applications. In the
part did not finished and the next file started to be first case, the total time of training and testing has been
processed in this case also the number of threads may reduced over 5 times with the use of 8-core CPU and 8-9
increase unexpectedly. In this more complex testing case, threads. In the second case, the acceleration was about 3.5
all 5670 files were tested, but the solution for
296
and allowed for real-time recognition with the use of IEEE Signal Processing: Algorithms, Architectures, Arrangements, and
Applications (SPA), 2010, pp. 95-98.
speaker databases containing even thousands of speakers.
The presented software is available on the Github >@ R. Weychan, T. Marciniak, A. Stankiewicz, A. Dabrowski, Real
source code repository [21]. Time Recognition Of Speakers From Internet Audio Stream,
Foundations of Computing and Decision Sciences, Volume 40, Issue
3, September 2015, DOI: 10.1515/fcds-2015-0014, , pp. 223-233

REFERENCES >@ B. Goetz et.al., Java Concurency in Practice, Addison-Wesley


Professional, 2006.

>@ F. Zheng, G. Zhang, S. Zhanjiang, Comparison of different >@ Source code repository for Multithreaded speaker recognition
implementations of MFCC, Journal of Computer Science and project, https://github.com/audiodsp/Java-multithreaded-GMM-
Technology, vol. 16, no 6/2001, pp. 582-589. speaker-recognition

>@ D. Reynolds, Robust text-independent speake identification This work was prepared within the DS-2016 project.
using Gaussian mixture models, IEEE Trans. Speech Audio Proc., vol.
3, no 1/1995, pp. 72-83.
>@ D. Reynolds, T. Quatieri, R. Dunn, Speaker verification using
adapted gaussian mixture models, Digital Signal Processing, no
10/2000, pp. 19-41.
>@ W. M. Campbell, D. E. Sturim, and D. A. Reynolds, Support
vector machines using gmm supervectors for speaker verication,
Signal Processing Letters, IEEE, vol. 13, no. 5/5006, pp. 308-311.
>@ P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint
factoranalysis versus eigenchannels in speaker recognition, Audio,
Speech, and Language Processing, IEEE Transactions on, vol. 15, no.
4/2007, pp. 1435-1447.
>@ M. Brooks. Voicebox: Speech processing toolbox for Matlab.
[Online]. Available:
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
>@ C. M. Bishop, Pattern Recognition and Machine Learning.
Springer Science & Business Media, 2006.
>@ S. O. Sadjadi, M. Slaney, and L. Heck, Msr identity toolbox v1. 0:
A matlab toolbox for speaker recognition research, Speech and
Language Processing Technical Committee Newsletter, 2013.
>@ Cambridge University Engineering Department, HTK: The
Hidden Markov Model Toolkit, 2004, [Online].
Available: http://htk.eng.cam.ac.uk/
>@ A. Larcher, J.-F. Bonastre, B. G. Fauve, K.-A. Lee, C. Levy, H. Li,J.
S. Mason, and J.-Y. Parfait, Alize 3.0-open source toolkit for state-of-
the-art speaker recognition., http://mistral.univ-avignon.fr/
index_en.html
[11]F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O.
Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J.
Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E.
Duchesnay, Scikit-learn: Machine learning in Python, Journal of
Machine Learning Research, vol. 12/2011, pp. 2825-2830.
[12] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H.
Witten, The WEKA Data Mining Software: An Update, SIGKDD
Explorations, Volume 11, Issue 1, 2009.
[13] R. Soganci, F. Gurgen, H. Tupcuoglu, Parallel Implementation of a
VQ-Based Text-Independent Speaker Identification, Advances in
Information Systems 2005, pp. 291-300
[14] T. Herbig, F. Gerl, W. Minker, Self-Learning Speaker Identification:
A System for Enhanced Speech Recognition, Springer, 2011.
[15] R. Weychan, T. Marciniak, A. Dabrowski, Implementation aspects
of speaker recognition using Python language and Raspberry Pi
platform, IEEE Signal Processing: Algorithms, Architectures,
Arrangements, and Applications (SPA), 2015, Poznan, 2015, pp. 162-
167.
[16] T. Marciniak, R. Weychan, A. Krzykowska, Speaker recognition
based on telephone quality short Polish sequences with removed silence,
Przeglad Elektrotechniczny 2012, vol. 88, no. 6, pp. 42-46.
[17] A. Dbrowski, T. Marciniak, A. Krzykowska and R. Weychan
Influence of silence removal on speaker recognition based on short
Polish sequences, Signal Processing Algorithms, Architectures,
Arrangements, and Applications SPA 2011, Poznan, pp. 159-163.
[18] T. Marciniak, R. Weychan, Sz. Drgas, A. Dbrowski,
A. Krzykowska, Speaker recognition based on short Polish sequences,

297

You might also like