NLP Programming en 01 Unigramlm

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

NLP Programming Tutorial 1 Unigram Language Model

NLP Programming Tutorial 1 Unigram Language Models

Graham Neubig Nara Institute of Science and Technology (N IST!

NLP Programming Tutorial 1 Unigram Language Model

Language Model "asics

NLP Programming Tutorial 1 Unigram Language Model

#hy Language Models$

#e ha%e an &nglish s'eech recognition system( )hich ans)er is better$


#1 * s'eech recognition system #+ * s'eech cognition system #- * s'ec. 'odcast histamine #, *

S'eech

NLP Programming Tutorial 1 Unigram Language Model

#hy Language Models$

#e ha%e an &nglish s'eech recognition system( )hich ans)er is better$


#1 * s'eech recognition system #+ * s'eech cognition system #- * s'ec. 'odcast histamine #, *

S'eech

Language models tell us the ans)er/


4

NLP Programming Tutorial 1 Unigram Language Model

Probabilistic Language Models

Language models assign a 'robability to each sentence P(#1! * ,01+1 2 11-P(#+! * 304-+ 2 11-, P(#-! * +0,-+ 2 11-5 P(#,! * 401+, 2 11-+-

#1 * s'eech recognition system #+ * s'eech cognition system #- * s'ec. 'odcast histamine #, *

#e )ant P(#1! 6 P(#+! 6 P(#-! 6 P(#,!

(or P(#,! 6 P(#1!( P(#+!( P(#-! for 7a'anese$!

NLP Programming Tutorial 1 Unigram Language Model

8alculating Sentence Probabilities

#e )ant the 'robability of


# * s'eech recognition system

9e'resent this mathematically as:


P(;#; * -( )1*<s'eech<( )+*<recognition<( )-*<system<!

NLP Programming Tutorial 1 Unigram Language Model

8alculating Sentence Probabilities

#e )ant the 'robability of


# * s'eech recognition system

9e'resent this mathematically as (using chain rule!:


P()1*=s'eech< ; )1 * =>s6<! 2 P()+*<recognition< ; )1 * =>s6<( )1*=s'eech<! 2 P()-*<system< ; )1 * =>s6<( )1*=s'eech<( )+*<recognition<!
2 P(),*<>?s6< ; )1 * =>s6<( )1*=s'eech<( )+*<recognition<( )-*<system<!

P(;#; * -( )1*<s'eech<( )+*<recognition<( )-*<system<! *

N@T&: sentence start >s6 and end >?s6 symbol

N@T&: P()1 * >s6! * 1

NLP Programming Tutorial 1 Unigram Language Model

Incremental 8om'utation

Pre%ious eAuation can be )ritten:

P ( W )=i =1 P ( wiw 0 wi1 )

W + 1

Bo) do )e decide 'robability$

P ( wiw 0 wi1 )

NLP Programming Tutorial 1 Unigram Language Model

MaCimum Li.elihood &stimation

8alculate )ord strings in cor'us( ta.e fraction

c ( w1 wi ) P ( wiw 1 w i1)= c ( w 1 w i 1)
i li%e in osa.a 0 >?s6 i am a graduate student 0 >?s6 my school is in nara 0 >?s6 P(li%e ; >s6 i! * c(>s6 i li%e!?c(>s6 i! * 1 ? + * 10D P(am ; >s6 i! * c(>s6 i am!?c(>s6 i! * 1 ? + * 10D

NLP Programming Tutorial 1 Unigram Language Model

Problem #ith Eull &stimation

#ea. )hen counts are lo):

Training:

i li%e in osa.a 0 >?s6 i am a graduate student 0 >?s6 my school is in nara 0 >?s6 >s6 i li%e in nara 0 >?s6

Test:

P(nara;>s6 i li%e in! * 1?1 * 1 P(#*>s6 i li%e in nara 0 >?s6! * 1


10

NLP Programming Tutorial 1 Unigram Language Model

Unigram Model

Fo not use history:

P ( wiw 1 w i1) P ( wi )=

c ( wi ) ) w c (w

P(nara! * 1?+1 * 101D i li%e in osa.a 0 >?s6 * +?+1 * 101 i am a graduate student 0 >?s6 P(i! my school is in nara 0 >?s6 P(>?s6! * -?+1 * 101D P(#*i li%e in nara 0 >?s6! * 101 2 101D 2 101 2 101D 2 101D 2 101D * D0G+D 2 11 -5
11

NLP Programming Tutorial 1 Unigram Language Model

"e 8areful of Integers/

Fi%ide t)o integers( you get an integer (rounded do)n!

$ ./my-program.py 0

8on%ert one integer to a float( and you )ill be @H

$ ./my-program.py 0.5

12

NLP Programming Tutorial 1 Unigram Language Model

#hat about Un.no)n #ords$/

Sim'le ML estimation doesnIt )or.


i li%e in osa.a 0 >?s6 i am a graduate student 0 >?s6 my school is in nara 0 >?s6 P(nara! * 1?+1 * 101D P(i! * +?+1 * 101 P(.yoto! * 1?+1 * 1

@ften( un.no)n )ords are ignored ( S9! "etter )ay to sol%e

Sa%e some 'robability for un.no)n )ords (Jun. * 1-J1! Guess total %ocabulary siKe (N!( including un.no)ns

1 P ( wi )=1 P ML ( wi )+( 1 1) N
13

NLP Programming Tutorial 1 Unigram Language Model

Un.no)n #ord &Cam'le

Total %ocabulary siKe: N*11G Un.no)n )ord 'robability: Jun.*101D (J1 * 104D!

1 P ( wi )=1 P ML ( wi )+( 1 1) N
P(nara! * 104D2101D L 101D2(1?11G! * 101,5D111D P(i! * 104D21011 L 101D2(1?11G! * 1014D1111D P(.yoto! * 104D21011 L 101D2(1?11G! * 101111111D
14

NLP Programming Tutorial 1 Unigram Language Model

&%aluating Language Models

15

NLP Programming Tutorial 1 Unigram Language Model

&C'erimental Setu'

Use training and test sets


Train Model

Training Fata
i li%e in osa.a i am a graduate student my school is in nara 000

Model

Testing Fata
i li%e in nara i am a student i ha%e lots of home)or. M

Test Model Model ccuracy Li.elihood Log Li.elihood &ntro'y Per'leCity 16

NLP Programming Tutorial 1 Unigram Language Model

Li.elihood

Li.elihood is the 'robability of some obser%ed data (the test set #test!( gi%en the model M

P ( W testM )= w W P ( wM )
test

i li%e in nara i am a student my classes are hard

P()*<i li%e in nara<;M! * P()*<i am a student<;M! * P()*<my classes are hard<;M! *

+0D+211-+1 -0,3211-14 +01D211--,

C C

* 1034211-517

NLP Programming Tutorial 1 Unigram Language Model

Log Li.elihood

Li.elihood uses %ery small numbers*underflo) Ta.ing the log resol%es this 'roblem

log P ( W testM )= w W log P ( wM )


test

i li%e in nara i am a student my classes are hard

log P()*<i li%e in nara<;M! * log P()*<i am a student<;M! *

-+10D3 -130,D

log P()*<my classes are hard<;M! * ---0G5

* -5+0G1
18

NLP Programming Tutorial 1 Unigram Language Model

8alculating Logs

PythonIs math 'ac.age has a function for logs

$ ./my-program.py 4.60517018599 2.0

19

NLP Programming Tutorial 1 Unigram Language Model

&ntro'y

&ntro'y B is a%erage negati%e log+ li.elihood 'er )ord

1 H ( W testM )= | W test |
i li%e in nara i am a student my classes are hard

w W

test

log 2 P ( wM )

log+ P()*<i li%e in nara<;M!* log+ P()*<i am a student<;M!*

log+ P()*<my classes are hard<;M!* 11103, N of )ords*

( )
G30,G10-+

L ?

1+ * +10120

2 note( )e can also count >?s6 in N of )ords (in )hich case it is 1D!

NLP Programming Tutorial 1 Unigram Language Model

&ntro'y and 8om'ression

&ntro'y B is also the a%erage number of bits needed to encode information (ShannonIs information theory!

a bird a cat a dog a >?s6


a bird cat dog >?s6 O1 O 111 O 111 O 111 O 111 P()* =a<! * 10D -log+ 10D * 1 P()* =bird<! * 101+D -log+ 101+D * P()* =cat<! * 101+D -log+ 101+D * P()* =dog<! * 101+D -log+ 101+D * P()* =>?s6<! * 101+D -log+ 101+D * 21

1 H= | W test | w W

wtest

log 2 P ( wM )

&ncoding

1111111111111111

NLP Programming Tutorial 1 Unigram Language Model

Per'leCity

&Aual to t)o to the 'o)er of 'er-)ord entro'y

PPL =2

(Mainly because it ma.es more im'ressi%e numbers! Eor uniform distributions( eAual to the siKe of %ocabulary

V =5

1 H =log 2 5

PPL =2 =2

log 2

1 5

=2

log2 5

=5
22

NLP Programming Tutorial 1 Unigram Language Model

8o%erage

The 'ercentage of .no)n )ords in the cor'us a bird a cat a dog a >?s6
=dog< is an un.no)n )ord 8o%erage: 5?3 2

2 often omit the sentence-final symbol O G?5

23

NLP Programming Tutorial 1 Unigram Language Model

&Cercise

24

NLP Programming Tutorial 1 Unigram Language Model

&Cercise

#rite t)o 'rograms


train-unigram: 8reates a unigram model test-unigram: 9eads a unigram model and calculates entro'y and co%erage for the test set

Test them test?11-train-in'ut0tCt test?11-test-in'ut0tCt Train the model on data?)i.i-en-train0)ord 8alculate entro'y and co%erage on data?)i.i-entest0)ord 9e'ort your scores neCt )ee.
25

NLP Programming Tutorial 1 Unigram Language Model

train-unigram Pseudo-8ode
create a map counts create a variable total_count * 1 for each line in the training_file split line into an array of words append =>?s6< to the end of words for each word in words add 1 to countsPwordQ add 1 to total_count open the model_file for )riting for each )ord( count in counts probability * countsPwordQ?total_count print word( probability to model_file
26

NLP Programming Tutorial 1 Unigram Language Model

test-unigram Pseudo-8ode
1 = 104D, Jun. * 1-J1( R * 1111111( # * 1( B * 1

Load Model
create a map probabilities for each line in model_file split line into w and P set probabilitiesPwQ * P

Test and Print


for each line in test_file split line into an array of words append =>?s6< to the end of words for each w in words add 1 to W set P * Jun. ? R if probabilitiesPwQ eCists set P L* J1 2 'robabilitiesPwQ else add 1 to unk add -log2 P to H print =entro'y * H!W print =co%erage * < L (#-un.!?#
27

NLP Programming Tutorial 1 Unigram Language Model

Than. Sou/

28

You might also like