Weather System
Weather System
Introduction
This document describes the preparation of the data, testing and training of Weather System. Weather system uses Sphinx software as a speech toolkit. This system is intended to be a speaker independent system. For this reason a large number of data has to be prepared. Weather system is a large vocabulary speech system. This system will have speech from number of speakers. It will deal with huge amount of data from multiple speakers.
Preparing Data
Define Dataset
The first step is to determine the size of the vocabulary of the system and then based on this the next step is the preparation of dataset for training and testing of system. In case of our system the dataset is the name of 36 cities of Punjab. The cities names are in isolated form. But we will concentrate our attention on 19 cities because their weather data is easily available on the internet. For speaker independent system a huge amount of data is required. For this purpose a large number of speakers are required for recording the cities name. All 36 cities name are written on 30 different lists in a random order. First 15 lists contain 19 districts and next 15 lists contain next 16 districts. So there are total 525 cities wave files are created by a single user. From which 285 wave files are presently processed. At least 30 minutes recording is required by a single user for this purpose.
Recorded Data
Recording is simultaneously carried out on mobile as well as on microphone. For this reason a laptop, mobile phone and microphone is needed. A recording software PRAAT has been used. Recording is being done on 8 kHz. A complete noiseless environment is required for recording because noise disrupts the recorded data and hence making it useless. Speech files should be stored in .wav extension.
Sp1_m_att1.wav Sp1_m_bah1.wav . . . Sp1 means speaker number 1, m means that the speaker gender is male and att mean District Attock and 1 means of list1.
Preparing Files
Now we need to prepare certain files for Sphinx. Sphinx is open source speech recognition software. Speech recognition system requires two types of models i.e. acoustic model and language model. Acoustic model is created form audio files and their text transcription. For language model we need training data. Therefore 10% of the data is placed in testing data and remaining 90% data is in the training data. Following are the files needed for sphinx
Audio files
Audio files are created after segmentation of the recorded data.
Dictionary File
The dictionary file is named as an4.dic. The dictionary file contains the utterances of the phones mappings. In dictionary file we have to place the words defined in the dataset. ATTAK A TT A K BAHAAVALPUR B A H AA V A L P U R B_HAKAR T_SHAKVAAL . . . Each word is separated from other word by a tab. First half consists of the phones defined in the dataset and the second half consists of the phonemes separated by spaces. B_H A K A R T_SH A K V AA L
Filler File
The filler file is named as an4.filler. This file contains the silence which is incorporated in our speech.
The silence at start of any utterance is <s>. The silence at end of any utterance is </s> and the silence in the middle of any utterance is <sil> <s> </s> <sil> SIL SIL SIL
Above three words are also separated from each other by a tab.
Phone File
The phone file is named as an4.phone. In the phone file all the phonemes of all the words defined in a data set are listed in such a manner that there is no repetition. A TT K B H AA V L P U R B_H T_SH F . . .
Corpus
Corpus is the file which is made from the an4_train.transcription file by removing the wave file names present at the end of each line as shown below. <s> ATTAK </s> <s> ATTAK </s> <s> ATTAK </s> <s> ATTAK </s>
<s> ATTAK </s> . . . This can be done in PSPad software. 1. 2. 3. 4. 5. Open the an4_train.transcription file is PSPad software Type Crtl+H, then click the regular expression In find tab write \(S.*\) Then click OK Rename the file as Corpus.txt
Language Model
This file is named as an4.lm. Language model is created from Corpus.txt. Language model is created by the following way 1. Download the CMU toolkit from internet 2. Run the following commands on your terminal
a. ./text2wfreq <Corpus.txt> a.wfreq b. ./wfreq2vocab <a.wfreq> a.vocab c. ./text2idngram -n 3 -vocab a.vocab <Corpus.txt> a.idngram d. ./idngram2lm -n 3 -vocab_type 2 -witten_bell -oov_fraction 0.5 -idngram a.idngram -vocab a.vocab -context training.ccs -arpa LanguageModel.arpa
3. A file name LanguageModel.arpa will be created 4. Copy the languageModel.arpa to lm3g2dmp folder and run the following command
5. lm3g2dmp LanguageModel.arpa .\
6. A LanguageModel.arpa.DMP will be created