NLP Programming en 01 Unigramlm
NLP Programming en 01 Unigramlm
NLP Programming en 01 Unigramlm
S'eech
S'eech
Language models assign a 'robability to each sentence P(#1! * ,01+1 2 11-P(#+! * 304-+ 2 11-, P(#-! * +0,-+ 2 11-5 P(#,! * 401+, 2 11-+-
Incremental 8om'utation
W + 1
P ( wiw 0 wi1 )
c ( w1 wi ) P ( wiw 1 w i1)= c ( w 1 w i 1)
i li%e in osa.a 0 >?s6 i am a graduate student 0 >?s6 my school is in nara 0 >?s6 P(li%e ; >s6 i! * c(>s6 i li%e!?c(>s6 i! * 1 ? + * 10D P(am ; >s6 i! * c(>s6 i am!?c(>s6 i! * 1 ? + * 10D
Training:
i li%e in osa.a 0 >?s6 i am a graduate student 0 >?s6 my school is in nara 0 >?s6 >s6 i li%e in nara 0 >?s6
Test:
Unigram Model
P ( wiw 1 w i1) P ( wi )=
c ( wi ) ) w c (w
P(nara! * 1?+1 * 101D i li%e in osa.a 0 >?s6 * +?+1 * 101 i am a graduate student 0 >?s6 P(i! my school is in nara 0 >?s6 P(>?s6! * -?+1 * 101D P(#*i li%e in nara 0 >?s6! * 101 2 101D 2 101 2 101D 2 101D 2 101D * D0G+D 2 11 -5
11
$ ./my-program.py 0
$ ./my-program.py 0.5
12
Sa%e some 'robability for un.no)n )ords (Jun. * 1-J1! Guess total %ocabulary siKe (N!( including un.no)ns
1 P ( wi )=1 P ML ( wi )+( 1 1) N
13
Total %ocabulary siKe: N*11G Un.no)n )ord 'robability: Jun.*101D (J1 * 104D!
1 P ( wi )=1 P ML ( wi )+( 1 1) N
P(nara! * 104D2101D L 101D2(1?11G! * 101,5D111D P(i! * 104D21011 L 101D2(1?11G! * 1014D1111D P(.yoto! * 104D21011 L 101D2(1?11G! * 101111111D
14
15
&C'erimental Setu'
Training Fata
i li%e in osa.a i am a graduate student my school is in nara 000
Model
Testing Fata
i li%e in nara i am a student i ha%e lots of home)or. M
Li.elihood
Li.elihood is the 'robability of some obser%ed data (the test set #test!( gi%en the model M
P ( W testM )= w W P ( wM )
test
C C
* 1034211-517
Log Li.elihood
Li.elihood uses %ery small numbers*underflo) Ta.ing the log resol%es this 'roblem
-+10D3 -130,D
* -5+0G1
18
8alculating Logs
19
&ntro'y
1 H ( W testM )= | W test |
i li%e in nara i am a student my classes are hard
w W
test
log 2 P ( wM )
( )
G30,G10-+
L ?
1+ * +10120
2 note( )e can also count >?s6 in N of )ords (in )hich case it is 1D!
&ntro'y B is also the a%erage number of bits needed to encode information (ShannonIs information theory!
1 H= | W test | w W
wtest
log 2 P ( wM )
&ncoding
1111111111111111
Per'leCity
PPL =2
(Mainly because it ma.es more im'ressi%e numbers! Eor uniform distributions( eAual to the siKe of %ocabulary
V =5
1 H =log 2 5
PPL =2 =2
log 2
1 5
=2
log2 5
=5
22
8o%erage
The 'ercentage of .no)n )ords in the cor'us a bird a cat a dog a >?s6
=dog< is an un.no)n )ord 8o%erage: 5?3 2
23
&Cercise
24
&Cercise
train-unigram: 8reates a unigram model test-unigram: 9eads a unigram model and calculates entro'y and co%erage for the test set
Test them test?11-train-in'ut0tCt test?11-test-in'ut0tCt Train the model on data?)i.i-en-train0)ord 8alculate entro'y and co%erage on data?)i.i-entest0)ord 9e'ort your scores neCt )ee.
25
train-unigram Pseudo-8ode
create a map counts create a variable total_count * 1 for each line in the training_file split line into an array of words append =>?s6< to the end of words for each word in words add 1 to countsPwordQ add 1 to total_count open the model_file for )riting for each )ord( count in counts probability * countsPwordQ?total_count print word( probability to model_file
26
test-unigram Pseudo-8ode
1 = 104D, Jun. * 1-J1( R * 1111111( # * 1( B * 1
Load Model
create a map probabilities for each line in model_file split line into w and P set probabilitiesPwQ * P
Than. Sou/
28