0% found this document useful (0 votes)
33 views6 pages

Weather System

This document describes the process of preparing data, training, and testing for a weather system speech recognition project using Sphinx software. It involves recording speech data from multiple speakers saying city names, segmenting the recordings into individual files, and preparing various files needed by Sphinx including audio files, dictionaries, filler files, phone files, test/train file IDs and transcriptions, a corpus file, and a language model file generated from the corpus. Over 30 minutes of recording is required from each speaker to collect a large dataset for the speaker independent system.

Uploaded by

rida fatima
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views6 pages

Weather System

This document describes the process of preparing data, training, and testing for a weather system speech recognition project using Sphinx software. It involves recording speech data from multiple speakers saying city names, segmenting the recordings into individual files, and preparing various files needed by Sphinx including audio files, dictionaries, filler files, phone files, test/train file IDs and transcriptions, a corpus file, and a language model file generated from the corpus. Over 30 minutes of recording is required from each speaker to collect a large dataset for the speaker independent system.

Uploaded by

rida fatima
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Weather System

Introduction
This document describes the preparation of the data, testing and training of Weather System. Weather system uses Sphinx software as a speech toolkit. This system is intended to be a speaker independent system. For this reason a large number of data has to be prepared. Weather system is a large vocabulary speech system. This system will have speech from number of speakers. It will deal with huge amount of data from multiple speakers.

Preparing Data
Define Dataset
The first step is to determine the size of the vocabulary of the system and then based on this the next step is the preparation of dataset for training and testing of system. In case of our system the dataset is the name of 36 cities of Punjab. The cities names are in isolated form. But we will concentrate our attention on 19 cities because their weather data is easily available on the internet. For speaker independent system a huge amount of data is required. For this purpose a large number of speakers are required for recording the cities name. All 36 cities name are written on 30 different lists in a random order. First 15 lists contain 19 districts and next 15 lists contain next 16 districts. So there are total 525 cities wave files are created by a single user. From which 285 wave files are presently processed. At least 30 minutes recording is required by a single user for this purpose.

Recorded Data
Recording is simultaneously carried out on mobile as well as on microphone. For this reason a laptop, mobile phone and microphone is needed. A recording software PRAAT has been used. Recording is being done on 8 kHz. A complete noiseless environment is required for recording because noise disrupts the recorded data and hence making it useless. Speech files should be stored in .wav extension.

Segmentation of Recorded Data


After recording data is saved as a wav files segmentation is needed because each wav file consist of a list and each list contain 16 to 19 cities. We have to segment the lists in such a way that each city file is segmented from each other. Most importantly before segmentation we have set the naming rules of segmented data. As this system is a city based and large number of speakers are required for recording. So I recommend to name the wav files in such a manner so that it can easily be understandable e.g.

Sp1_m_att1.wav Sp1_m_bah1.wav . . . Sp1 means speaker number 1, m means that the speaker gender is male and att mean District Attock and 1 means of list1.

Preparing Files
Now we need to prepare certain files for Sphinx. Sphinx is open source speech recognition software. Speech recognition system requires two types of models i.e. acoustic model and language model. Acoustic model is created form audio files and their text transcription. For language model we need training data. Therefore 10% of the data is placed in testing data and remaining 90% data is in the training data. Following are the files needed for sphinx

Audio files
Audio files are created after segmentation of the recorded data.

Dictionary File
The dictionary file is named as an4.dic. The dictionary file contains the utterances of the phones mappings. In dictionary file we have to place the words defined in the dataset. ATTAK A TT A K BAHAAVALPUR B A H AA V A L P U R B_HAKAR T_SHAKVAAL . . . Each word is separated from other word by a tab. First half consists of the phones defined in the dataset and the second half consists of the phonemes separated by spaces. B_H A K A R T_SH A K V AA L

Filler File
The filler file is named as an4.filler. This file contains the silence which is incorporated in our speech.

The silence at start of any utterance is <s>. The silence at end of any utterance is </s> and the silence in the middle of any utterance is <sil> <s> </s> <sil> SIL SIL SIL

Above three words are also separated from each other by a tab.

Phone File
The phone file is named as an4.phone. In the phone file all the phonemes of all the words defined in a data set are listed in such a manner that there is no repetition. A TT K B H AA V L P U R B_H T_SH F . . .

Test File IDs


The test file ids file is named as an4_test.fileids. This file contains the fileids of the files in the test section of your data e.g. test/sp1_f_att1 test/sp1_f_bah1 test/sp1_f_bha1 test/sp1_f_cha1 test/sp1_f_fsl1 test/sp1_f_gujr1 test/sp1_f_guj1 . . . Wave file format have been described earlier. test/ shows that these wave files are present in the test folder

Test Transcription File


The test transcription file is named as an4_test.transcription. This file contains the transcription of the wave files present in the test.fileids file. <s> ATTAK </s> (sp1_f_att1) <s> BAHAAVALPUR </s> (sp1_f_bah1) <s> B_HAKAR </s> (sp1_f_bha1) <s> T_SHAKVAAL </s> (sp1_f_cha1) . . . As sp1_f_att1 contains ATTAK thats why between starting and ending silences ATTAK is placed.

Train File IDs


The train fileids file is named as an4_train.fileids. This file contains the file ids of the files in the train section of your data e.g. train/sp1_f_att2 train/sp1_f_att3 train/sp1_f_att4 train/sp1_f_att5 train/sp1_f_att6 . . .

Train Transcription File


The train transcription file is named as an4_train.transcription. This file contains the transcription of the wave file present in the train.fileids file. <s> ATTAK </s> (sp1_f_att2) <s> ATTAK </s> (sp1_f_att3) <s> ATTAK </s> (sp1_f_att4) <s> ATTAK </s> (sp1_f_att5) . . .

Corpus
Corpus is the file which is made from the an4_train.transcription file by removing the wave file names present at the end of each line as shown below. <s> ATTAK </s> <s> ATTAK </s> <s> ATTAK </s> <s> ATTAK </s>

<s> ATTAK </s> . . . This can be done in PSPad software. 1. 2. 3. 4. 5. Open the an4_train.transcription file is PSPad software Type Crtl+H, then click the regular expression In find tab write \(S.*\) Then click OK Rename the file as Corpus.txt

Language Model
This file is named as an4.lm. Language model is created from Corpus.txt. Language model is created by the following way 1. Download the CMU toolkit from internet 2. Run the following commands on your terminal

a. ./text2wfreq <Corpus.txt> a.wfreq b. ./wfreq2vocab <a.wfreq> a.vocab c. ./text2idngram -n 3 -vocab a.vocab <Corpus.txt> a.idngram d. ./idngram2lm -n 3 -vocab_type 2 -witten_bell -oov_fraction 0.5 -idngram a.idngram -vocab a.vocab -context training.ccs -arpa LanguageModel.arpa
3. A file name LanguageModel.arpa will be created 4. Copy the languageModel.arpa to lm3g2dmp folder and run the following command

5. lm3g2dmp LanguageModel.arpa .\
6. A LanguageModel.arpa.DMP will be created

You might also like