0% found this document useful (0 votes)

59 views4 pages

SpokenLanguages2 - Report

The author used a simple approach of training 176 Gaussian Mixture Models (one for each language) with 2048 mixtures and 62 features extracted from the audio files. Logistic regression was then used to calibrate the individual language scores from the GMMs. Training the models took around 80 hours while prediction took 15 seconds. Key steps included extracting MFCC and delta features from wav files, training GMMs on the feature vectors, and calibrating predictions with logistic regression. Open source tools used included Python, NumPy, scikit-learn, Sox and CMU Sphinx.

Uploaded by

Maged Hamouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views4 pages

SpokenLanguages2 - Report

Uploaded by

Maged Hamouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

SpokenLanguages2 - report

Personal Information

Name: Catalin-Stefan Tiseanu

Topcoder handle: CatalinT
Email-address: [email protected]
Final submission: sub27.csv (output only task), https://www.dropbox.com/s/29onhpuuikz6ajc/
sub_new_1.csv?dl=1

Approach Used

SUMMARY

I used a very simple, yet highly effective, approach. Basically, I trained 176 Gaussian Mixture Mod-
els with 2048 mixtures and 62 features (one GMM for each language). Finally, I used Logistic Re-
gression to calibrate the individual language scores.

I pretty much followed this: http://www.egr.msu.edu/~deller/icslp-submission-copy.pdf

In terms of steps:
(1) Use sox to convert the mp3’s to wav’s of 16khz, 16 bit and 1 channel
(2) Use sphinx_fe (from CMU-Sphinx, BSD license) to extract 13 Mel Frequency Cepstral Coeffi-
cients (MFCC) features
(3) Add shifted delta coefficients (SDC) using 7-1-3-7 parametrization
(4) At this point I reduced every wav to a 999 x 62 feature array. The 999 comes from the fact that
the mfcss are computed every 10 ms and one frame is lost in the conversion process. The 62 comes
from 13 mfcc featurex per frame + 7 x 7 sdc features per 10ms frame
(5) Collect every feature vector from every wav by language, therefore obtaining a ((376 * 999) x 62)
feature matrix for each language
(6) Train a Gaussian Mixture Model with 2048 mixtures for each language feature matrix, us-
ing https://github.com/juandavm/em4gmm (GPL license). The training was done using the stan-
dard EM algorithm, with a 0.001 tolerance for the log-likelihood increase at each step
(7) Finally, calibrate the individual language predictions using Logistic Regression
(sklearn.linear_model.LogisticRegression(C=1e5, random_state = 0)

Prediction takes 15 seconds on a m4.xlarge machine.

Training (and prediction on training set) takes a lot :), around 80 hours on a c4.8xlarge machine.

APPROACHES CONSIDERED

From the start, I considered a GMM based approach, based on the decent performance in the first
iteration of the contest, SpokenLanguages. The huge difference between that contest and this one
was in terms of the data provided:

(1) There were far more samples per language for training (376 vs 120)
(2) From what I could tell the samples were from male samples only, which helped immensely
From the get go, GMM performed admirably, even with only 13 features and 64 mixtures. From
that point, it was a matter of training larger models fast enough (which I would say was the biggest
challenge for me in this competition).

STEPS TO APPROACH ULTIMATELY CHOSEN

See the above text for info.

My whole code is in main.py

First, there is an init step to be done, detailed in compute_general_dict, which compute lan-
guage→id mappings and so on.

There is a helper method predict_first_K which predicts the results for the first K samples from the
testing dataset, in terms of alphabetical order. This is useful in order to replicate the results.

The main function is predict_mp3

I will detail the workings of each step:

(1) I used the following sox command “sox INITIAL_MP3 -R -r 16000 -b 16 -c 1 TMP_WAV”

(2) I used the following sphinx_fe command “sphinx_fe -i TMP_WAV -o TMP_MFC -ofmt text”
(3) I used the compute_sdc python function to compute a 7-1-3-7 SDC feature configuration, for 7 x
7 additional features, reaching 62 in total
(4) The resulting 999 x 62 feature matrix is converted into a .gz format suitable for the gmmclass
binary (from the em4gmm library detailed above)
(5) Each of the 176 models is called on the gz file via “server = subprocess.call([”gmmclass”, ”-d”,
feature_62_gz_filename, ”-m”, ”emm/2048_62_emm_{}”.format(i), ”-t”, ”4”], stdout = output)”
(6) The predictions are collected into a 176 element vector
(7) Finally, the logistic regression model is called with predict_proba via “probs = lo-
greg.predict_proba(X_inter)”
(8) From there on, it’s only a problem of sorting the probabilities in a non-ascending order and writ-
ing out the top 3 languages

Finally, in order to train the model, train_models is used. It is similar to predict_mp3, with the
caveat that gmmtrain is used instead of gmmclass, and obviously the logistic regression is fit (in-
stead of predict_proba).

The steps are detailed below:

(1) In the first step, feature are extracted as in the predict stage, and a feature matrix is assembled
for each language (by vertically stacking the feature matrices for all samples in the language, i.e the
999 x 62 matrices).
(2) “gmmtrain -d FEATURE_MATRIX.gz -m 2048_62_emm_LANG -s 0.001 -t 4: is called for each
LANG
(3) Once all models are trained, prediction is done (exactly as in the predict stage) for the training
dataset
(4) Finally, a sklearn.linear_model.LogisticRegression(C=1e5, random_state=0) is fitted over the cor-
responding language probabilities. What this does is calibrate the predictions of each individual lan-
guage model (i.e toning down optimistic language models).
OPEN SOURCE RESOURCES AND TOOLS USED

I made heavy use of Python and related python machine learning libraries. Specifically:

● Python
● Numpy, pandas, sklearn for data processing
● multiprocessing
● joblib for model serialization
● sox for converting from mp3 to wav

Additionally, I used two highly important open source libraries:

(1) CMU Sphinx
(2) https://github.com/juandavm/em4gmm for GMM training.

I used (1) for extracting the 13 mfcc features, and (2) for the speed boost over GMM’s from sklearn.

ADVANTAGES AND DISADVANTAGES OF THE APPROACH CHOSEN

The main advantage of the above approach is it’s simplicity (only one model + one calibration
model are involved) and it’s applicability to pretty much any good dataset (good means lots of sam-
ples and same-sex speakers).

The biggest disadvantage is the time requirement for prediction (which is actually significantly
slower than the training time). It took one c4.8xlarge machine 16 hours to predict the results for
the test samples (12320), and 4 machines the same amount of time for the training set prediction
(which I needed in order to calibrate the prediction using logistic regression).

COMMENTS ON LIBRARIES

I used a fairly standard Python stack for machine learning. The 2 libraries mentioned above were
crucial for getting quality features and training them in a quick enough time.

COMMENTS ON OPEN SOURCE RESOURCES USED

I didn’t use any external dataset.

SPECIAL GUIDANCE GIVEN TO ALGORITHM BASED ON TRAINING

The only guidance was given to the LogisticRegression C parameter, namely C=1e5. This is actu-
ally the first one I tried, and it performed the best. It should be noted that changing this parameter
would not meaningfully affect the result.

POTENTIAL IMPROVEMENTS TO YOUR ALGORITHM

It should be said from the get go that the approach I used is definitely not the state-of-the-art. This
distinction goes to i-vectors trained from an universal background model GMM (UBM-GMM), with
a probabilistic LDA (PLDA) fusion mechanism. However, the main advantages of this method vs
the one I chose consist in higher robustness to noise, which is basically non-existant on the dataset
for this competition. I doubt that using the i-vector approach could have improved on the result sig-
nificantly.

Max A. Little Machine Learning For Signal Processing Data Science Algorithms and Computational Statistics Oxford University Press USA 2019
100% (1)
Max A. Little Machine Learning For Signal Processing Data Science Algorithms and Computational Statistics Oxford University Press USA 2019
378 pages
PROJECT REPORT For Machine Learning
100% (1)
PROJECT REPORT For Machine Learning
22 pages
M.sc. Information Technology Vide Item No. 6.2 N Sem. III IV
No ratings yet
M.sc. Information Technology Vide Item No. 6.2 N Sem. III IV
51 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
IOUG - Oracle-Application-Express-Administration-Francis-Mignault
100% (2)
IOUG - Oracle-Application-Express-Administration-Francis-Mignault
312 pages
Mix Design
No ratings yet
Mix Design
47 pages
Structural Equation Modeling For Health and Medicine, 1st Edition Complete EPUB Download
100% (11)
Structural Equation Modeling For Health and Medicine, 1st Edition Complete EPUB Download
16 pages
Project
No ratings yet
Project
13 pages
Ripeness Evaluation of Mango Using Image Processing
No ratings yet
Ripeness Evaluation of Mango Using Image Processing
7 pages
Voice Morphing Seminar
No ratings yet
Voice Morphing Seminar
23 pages
Speech Emotion Recognition
No ratings yet
Speech Emotion Recognition
55 pages
PROJECT REPORT For Machine Learning
No ratings yet
PROJECT REPORT For Machine Learning
22 pages
Classification of Bangla Regional Languages and Recognition of Artificial Bangla Speech Using Deep Learning
No ratings yet
Classification of Bangla Regional Languages and Recognition of Artificial Bangla Speech Using Deep Learning
92 pages
Old Exam
No ratings yet
Old Exam
104 pages
Stata Finite Mixture Models Reference Manual: Release 16
No ratings yet
Stata Finite Mixture Models Reference Manual: Release 16
138 pages
Latent Clustering W Mplus v2
No ratings yet
Latent Clustering W Mplus v2
57 pages
PF Chapter5
No ratings yet
PF Chapter5
70 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Bryn Lansdown
No ratings yet
Bryn Lansdown
48 pages
Exploring Machine Learning Perspectives For Electroglottographic Signals (2023) Minh Châu NGUYÊN Livrable CLD2025
No ratings yet
Exploring Machine Learning Perspectives For Electroglottographic Signals (2023) Minh Châu NGUYÊN Livrable CLD2025
47 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
Pattern Recognition Sahil Malek
No ratings yet
Pattern Recognition Sahil Malek
42 pages
Induction of Decision Trees: Machine Learning
No ratings yet
Induction of Decision Trees: Machine Learning
52 pages
Ivector Tutorial Interspeech 27aug2011 PDF
No ratings yet
Ivector Tutorial Interspeech 27aug2011 PDF
146 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
41 pages
MSC Behra123van Hamid
No ratings yet
MSC Behra123van Hamid
75 pages
A Survey of Recent Interactive Image Segmentation Methods
No ratings yet
A Survey of Recent Interactive Image Segmentation Methods
30 pages
Xiao Guest Lecture ASR
No ratings yet
Xiao Guest Lecture ASR
39 pages
Lecture 9 - Speech Recognition
No ratings yet
Lecture 9 - Speech Recognition
65 pages
Flow Chart:: Input Audio Preprocessing
No ratings yet
Flow Chart:: Input Audio Preprocessing
14 pages
Voice Assistant
No ratings yet
Voice Assistant
34 pages
Abstrak UGM
No ratings yet
Abstrak UGM
23 pages
Aiml Unit 4
No ratings yet
Aiml Unit 4
17 pages
Language Detection & Translation
No ratings yet
Language Detection & Translation
15 pages
Minor Project G-24 (Audio Sentiment Analysis)
No ratings yet
Minor Project G-24 (Audio Sentiment Analysis)
15 pages
NLP
No ratings yet
NLP
12 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
Voice Recognition Using Machine Learning PDF
No ratings yet
Voice Recognition Using Machine Learning PDF
16 pages
Tutorial On Speech Recognition: Alex Acero Microsoft Research
No ratings yet
Tutorial On Speech Recognition: Alex Acero Microsoft Research
38 pages
Ouriginal Report - Research Paper-1.pdf (D148071491) PDF
No ratings yet
Ouriginal Report - Research Paper-1.pdf (D148071491) PDF
12 pages
Modelling Tabular Data Using Conditional GAN's
No ratings yet
Modelling Tabular Data Using Conditional GAN's
15 pages
1 s2.0 S1569843224000888 Main
No ratings yet
1 s2.0 S1569843224000888 Main
17 pages
Major Project
No ratings yet
Major Project
22 pages
Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015
No ratings yet
Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015
10 pages
Saxena - Machine Learning in Visible Light Communication System A
No ratings yet
Saxena - Machine Learning in Visible Light Communication System A
12 pages
On The Use of Deep Feedforward Neural Networks For Aut - 2016 - Computer Speech
No ratings yet
On The Use of Deep Feedforward Neural Networks For Aut - 2016 - Computer Speech
14 pages
ML Lab Program 7
No ratings yet
ML Lab Program 7
7 pages
Significance of Neural Phonotactic Models For Large-Scale Spoken Language Identification
No ratings yet
Significance of Neural Phonotactic Models For Large-Scale Spoken Language Identification
9 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
2208.12666v1 Feature Extraction
No ratings yet
2208.12666v1 Feature Extraction
13 pages
Econometrics I 1
No ratings yet
Econometrics I 1
22 pages
Distinguishing Between Two Human Voices Using AI
No ratings yet
Distinguishing Between Two Human Voices Using AI
11 pages
Cs 224S / Linguist 281 Speech Recognition, Synthesis, and Dialogue
No ratings yet
Cs 224S / Linguist 281 Speech Recognition, Synthesis, and Dialogue
59 pages
Image Segmentation Using Gaussian Mixture Models
No ratings yet
Image Segmentation Using Gaussian Mixture Models
9 pages
Voice Recognition System Using Machine L
No ratings yet
Voice Recognition System Using Machine L
7 pages
Hidden Markov Models in Speech Recognition: Wayne Ward
No ratings yet
Hidden Markov Models in Speech Recognition: Wayne Ward
35 pages
Synopsis Content
No ratings yet
Synopsis Content
6 pages
Gaussian Mixture Model and The EM Algorithm in Speech Recognition
No ratings yet
Gaussian Mixture Model and The EM Algorithm in Speech Recognition
22 pages
Rapid Language Identification
No ratings yet
Rapid Language Identification
12 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
35 pages
Clustering
No ratings yet
Clustering
11 pages
Speaker Fluency Level Classification Using Machine
No ratings yet
Speaker Fluency Level Classification Using Machine
7 pages
Isolated-Word Speech Recognition Using Hidden Markov Models: H Akon Sandsmark December 18, 2010
No ratings yet
Isolated-Word Speech Recognition Using Hidden Markov Models: H Akon Sandsmark December 18, 2010
9 pages
Spoken Language Identification Using Language Bottleneck Features
No ratings yet
Spoken Language Identification Using Language Bottleneck Features
9 pages
Identification of Indian Languages Using Naïve Bayes, CNN, LSTM, and HMM
No ratings yet
Identification of Indian Languages Using Naïve Bayes, CNN, LSTM, and HMM
7 pages
A Phonotactic Language Model For Spoken Language Identification
No ratings yet
A Phonotactic Language Model For Spoken Language Identification
8 pages
Karafiat Icassp2018 0005789
No ratings yet
Karafiat Icassp2018 0005789
5 pages
Prelab GMM 1
No ratings yet
Prelab GMM 1
5 pages
Text Independent Speaker Verification System: Khushboo Modi
No ratings yet
Text Independent Speaker Verification System: Khushboo Modi
12 pages
A Comparative Study of Indian and Western Music Forms
No ratings yet
A Comparative Study of Indian and Western Music Forms
6 pages
NLPPR8
No ratings yet
NLPPR8
4 pages
CEP Report
No ratings yet
CEP Report
5 pages
Ace The Data Science Interview-1
No ratings yet
Ace The Data Science Interview-1
5 pages
Gaussian Observation HMM For EEG
No ratings yet
Gaussian Observation HMM For EEG
9 pages
A Feature Extraction Method Based On Gauss Wavelet Filter and Linear Prediction Filter Coefficients in Speech Recognition
No ratings yet
A Feature Extraction Method Based On Gauss Wavelet Filter and Linear Prediction Filter Coefficients in Speech Recognition
13 pages
Speaker and Language Recognition by GMM
No ratings yet
Speaker and Language Recognition by GMM
5 pages
Comparative Study On Spoken Language Identification Based On Deep Learning
No ratings yet
Comparative Study On Spoken Language Identification Based On Deep Learning
5 pages
Integrating Acoustic, Prosodic and Phonotactic Features For Spoken Language Identification
No ratings yet
Integrating Acoustic, Prosodic and Phonotactic Features For Spoken Language Identification
5 pages
Integrating Acoustic, Prosodic and Phonotactic Features For Spoken Language Identification
No ratings yet
Integrating Acoustic, Prosodic and Phonotactic Features For Spoken Language Identification
5 pages
Parallel Phonetically Aware Dnns and Lstm-Rnns For Frame-By-Frame Discriminative Modeling of Spoken Language Identification
No ratings yet
Parallel Phonetically Aware Dnns and Lstm-Rnns For Frame-By-Frame Discriminative Modeling of Spoken Language Identification
5 pages
CampbellW SingerE
No ratings yet
CampbellW SingerE
4 pages
Blind-Spot Vehicle Detection Using Motion and Static Feature
No ratings yet
Blind-Spot Vehicle Detection Using Motion and Static Feature
6 pages
Update On Speech Recognition System Using LibriSpeech
No ratings yet
Update On Speech Recognition System Using LibriSpeech
3 pages
Deep Learning For Spoken Language Identification: GR Egoire Montavon
No ratings yet
Deep Learning For Spoken Language Identification: GR Egoire Montavon
4 pages
Learning Structured Models For Phone Recognition
No ratings yet
Learning Structured Models For Phone Recognition
9 pages
E9 205 - Machine Learning For Signal Processing
No ratings yet
E9 205 - Machine Learning For Signal Processing
2 pages
Call For Your Symphony - Final
No ratings yet
Call For Your Symphony - Final
5 pages
Presentation On Speech Recognition
No ratings yet
Presentation On Speech Recognition
11 pages
Analysis of Credit Card Fraud Detection Methods
No ratings yet
Analysis of Credit Card Fraud Detection Methods
3 pages
ML Model Set 1
No ratings yet
ML Model Set 1
2 pages
IGNOU Operating System Previous Years Solved Papers
From Everand
IGNOU Operating System Previous Years Solved Papers
Manish Soni
No ratings yet
Apps - Testing OAF Extensions From Jdeveloper - Things To Keep in Mind
No ratings yet
Apps - Testing OAF Extensions From Jdeveloper - Things To Keep in Mind
2 pages
Grey Color: (Questions Written in Won't Appear in The Mid-Term/exam)
No ratings yet
Grey Color: (Questions Written in Won't Appear in The Mid-Term/exam)
2 pages
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Bayesian Hierarchical Models - With Applications Using R - Congdon P.D. (CRC 2020) (2nd Ed.)
100% (3)
Bayesian Hierarchical Models - With Applications Using R - Congdon P.D. (CRC 2020) (2nd Ed.)
593 pages
Vowel Recognition
No ratings yet
Vowel Recognition
3 pages
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
From Everand
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
Mr Troy
No ratings yet
DS3 Lab7
No ratings yet
DS3 Lab7
3 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

SpokenLanguages2 - Report

Uploaded by

SpokenLanguages2 - Report

Uploaded by

SpokenLanguages2 - report

Name: Catalin-Stefan Tiseanu

I pretty much followed this: http://www.egr.msu.edu/~deller/icslp-submission-copy.pdf

Prediction takes 15 seconds on a m4.xlarge machine.

STEPS TO APPROACH ULTIMATELY CHOSEN

See the above text for info.

My whole code is in main.py

The main function is predict_mp3

I will detail the workings of each step:

(1) I used the following sox command “sox INITIAL_MP3 -R -r 16000 -b 16 -c 1 TMP_WAV”

The steps are detailed below:

Additionally, I used two highly important open source libraries:

ADVANTAGES AND DISADVANTAGES OF THE APPROACH CHOSEN

COMMENTS ON OPEN SOURCE RESOURCES USED

I didn’t use any external dataset.

SPECIAL GUIDANCE GIVEN TO ALGORITHM BASED ON TRAINING

POTENTIAL IMPROVEMENTS TO YOUR ALGORITHM

You might also like