0% found this document useful (0 votes)
106 views41 pages

Using MALLET For Conditional Random Fields: Matthew Michelson & Craig A. Knoblock CSCI 548 - Lecture 3

This document discusses conditional random fields (CRFs) and how to use the MALLET toolkit to build CRF models for labeling sequence data. It begins with an overview of hidden Markov models and their limitations, then introduces CRFs as undirected graphical models that allow arbitrary overlapping features without independence assumptions. The document explains how MALLET's SimpleTagger implements CRFs and demonstrates training a model to label parts of speech using example sentence features.

Uploaded by

RichaSinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views41 pages

Using MALLET For Conditional Random Fields: Matthew Michelson & Craig A. Knoblock CSCI 548 - Lecture 3

This document discusses conditional random fields (CRFs) and how to use the MALLET toolkit to build CRF models for labeling sequence data. It begins with an overview of hidden Markov models and their limitations, then introduces CRFs as undirected graphical models that allow arbitrary overlapping features without independence assumptions. The document explains how MALLET's SimpleTagger implements CRFs and demonstrates training a model to label parts of speech using example sentence features.

Uploaded by

RichaSinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Using MALLET for Conditional Random Fields

Matthew Michelson & Craig A. Knoblock CSCI 548 Lecture 3

The road to CRFs


In the beginningGenerative Models (Probability of X and Y P(X,Y)?) Markov assumption: prob. in current state only depends on previous and current state Standard model: Hidden Markov Model (HMM)

Markov Process
Lets say were independent of time, then we can define aij = P(qt=Sj|qt-1=Si) as a STATE TRANSITION from Si to Sj aij >= 0

This conserves all of the Mass of probability; i.e. all outgoing probabilities sum to 1

Markov Process

Two more terms to define: i = P(q1=Si) = probability that we start in state Si bj(k) = P(k|qt = Sj) = probability of observation symbol k in State j. So, lets say symbols = {A,B}, then we could have something like b1(A) = P(A|S1) i.e. what is the probability that we output A in state 1?

Hidden Markov Model


A Hidden Markov Model (HMM)


Set of states, Set of ai,j , Set of i ,Set of bj(k)

learn a set of sequence of observations, and their transition and emission probabilities. Training When testing, input comes in, and fits models internal observations with some probability, output best state transition sequence to produce the input observation Decoding you can observe the sequence of emissions, but you do not know what state the model is in Hidden If 2 states output yes, all I see is yes, I have no idea what state or set of states produced this!

HMM - Example
Urn and Ball Model Each urn has large num. of M distinct colored balls. Randomly pick an urn, and pick out a colored ball, repeat. S = set of states = set of urns Transition Probs = choice of next urn bi(color) = prob. of getting that colored ball in urni

Urn and Ball Problem

Urn and Ball Example


Lets say we have the following: 2 urns 2 colors (Red,Blue) a1,1 = 0.25 a1,2 = 0.75 a2,1 = 0.3 a2,2 = 0.7 b1(Red) = 0.9, b1(Blue) = 0.1 b2(Red) = 0.4, b2(Blue) = 0.6

Urn and Ball Example

Lets say its perfectly random to start with either urn, i.e. 1 = 2 = 0.5 What is the most probable state sequence that produces {Red,Red,Blue}?

Urn and Ball Example

We will use the Viterbi algorithm to do this, recursively: Define (i) = max P[q1,q2,,qt = i,O1,O2,,On| HMM model] (Remember, qt = current state, O are observations) So, t+1(i) = [max t(i) * ai,j] * bj(Ot+1)

Urn and Ball Example

We need a first set of initialized values: 1(i) = i*bi(O1 = Red) i = {1,2} 1(1) = 1*b1(O1 = Red) = 0.5*0.9 = 0.45 1(2) = 2*b2(O1 = Red) = 0.5*0.4 = 0.2

Urn and Ball Example

Now, recurse: 2(1) = max ( {1(1)*a1,1 , 1(2)*a2,1} )*b1(O2 = Red) = max( {0.45*0.25, 0.2*0.3) * 0.9 = 0.10125 2(2) = max( {1(1)*a1,2, 1(2)*a2,2} )*b2(O2 = Red) = max( {0.45*0.75, 0.2*0.7} )*0.4 = 0.135

Urn and Ball Example

Now, recurse: 3(1) = max ( {2(1)*a1,1 , 2(2)*a2,1} )*b1(O3 = Blue) = max( {0.10125*0.25, 0.135*0.3} ) * 0.1 = 0.00405 3(2) = max( {2(1)*a1,2, 2(2)*a2,2} )*b2(O3 = Blue) = max( {0.10125*0.75, 0.135*0.7} )*0.6 = 0.0567

Urn and Ball Example

So, we see that at each step, maximally we have: 3(2) = 0.0567, 2(2) = 0.135 , 1(1) = 0.45 So, working backwards, know the state transitions went Urn 2 Urn 2 Urn 1. So, if we are given observation (Red,Red,Blue) we say that the most probable State transition set is {Start in Urn 1/red, Go to Urn 2/red,Stay Urn 2/blue}

HMM Issues
1 Independence Assumption Current observation only depends on what state you are in right now. Or, to say it differently, the current output has no dependence on previous outputs. For our urn example, we couldnt model the fact that if urn1 outputs a red ball, than urn2 should decrease its probability of doing so.

HMM Issues
2 Multiple Features Issue HMM generates a set of probabilities given an observation. But what if you want to capture many features from an observation, and these features interact? E.g. observation is Doug. This is a noun, capital, and masculine. Now, what if transition is into state = MAN? Now, we know that state MAN probably depends on the observations noun and capital. But, what if we have state CITY too? Doesnt that depend on noun and cap? To transfer into MAN might require a masculine name. This observation strongly depends on the word having feature masculine.

HMM Issues
3 an abundance of training data for one state has no effect on the others

Hidden Markov Model


states

Yi-1
transitions

Yi

Yi+1
observations

Xi-1

Xi

Xi+1

P( X , Y ) = P ( X i | Yi ) P (Yi | Yi 1 )
i

But how do we model this?


Yi-1 Yi Yi+1

noun is Doug Capit.

Xi-1

X DEPENDENT FEATURES!!

Xi+1

Choice #1: Model all dependencies


Yi-1 Yi Yi+1

is Doug

Capit. noun

Xi-1

Xi+1

Grows infeasible. Need LOTS of training data

Choice #2: Ignore dependencies


Yi-1 Yi Yi+1

noun is Doug Capit.

Xi-1

X Not really a solution

Xi+1

Conditional Model

We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o):

Allow arbitrary, non-independent features on the observation sequence X


Examine features, but dont generate them. (There is not a directed transition from a state to an output) Dont have to explicitly model their dependencies.

Conditionally trained means, Given a set of observations (input) what is the most likely set of labels (states,nodes in the graph) that the model has been trained to traverse given this input

Maximum Entropy Markov Models (MEMMs)


Exponential model Given training set X with label sequence Y: Train a model that maximizes P(Y|X, ) For a new data sequence x, the predicted label y maximizes P(y|x, )

Yi

Yi+1

Xi+1

MEMMs (contd)

MEMMs have all the advantages of Conditional Models

Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states (conservation of score mass) Subject to Label Bias Problem

Label Bias Problem


Consider this MEMM:

P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)

Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). Per-state normalization does not allow the required expectation

Another view of Label Bias

Conditional Random Fields (CRFs)


CRFs have all the advantages of MEMMs without label bias problem

MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence

Undirected acyclic graph Allow some transitions vote more strongly than others depending on the corresponding observations

Random Field what it looks like

CRF what it looks like

CRF the guts

CRFdefined
We make feature functions to define features Not generated by model (Xs of HMM)

CRF

Now we have Pr(label|obs.,model)


Find most probable label sequence (ys), given an observation sequence (xs) No more independence assumption

conditionally trained for a whole label sequence given an input sequence (so long range and multi-feature reflected by this)

Example of a feature funct. (ys are labels, xs are input obs)

MALLET

Machine learning toolkit specifically for language tasks Developed at U. Mass. by Andrew McCallum and his group For our purposes, we will use the SimpleTagger class which implements Conditional Random Fields

Getting MALLET to work


1. 2. 3. 4.

Install Cygwin (HW Instructions) Install Ant (HW Inst.) Install MALLET (HW Inst.) Train/Test/Label with SimpleTagger

SimpleTagger

Training
Each line is of the form: <feature1> <feature2> <featureN> <label>

Lets start with an example of a sentence: Los Angeles is a great city! We want to find all nouns, like the example in: http://mallet.cs.umass.edu/index.php/SimpleTagger_example

Training CRFs
Lets say we have some tools that can identify features: Colors List of colors Regex Apostrophe finder Regex Capitalized Stop-Words Common tokens: a, the, etc.. (not etc. the word..)

The red bears favorite color is green?


STOPWORD CAPITALIZED APOS COLOR

Training CRFs
GOAL: Find NOUNS LABELED INPUTS: The SW CAP not-noun red COLOR not-noun bears APOS noun
Note: In SimpleTagger, the default ignore label is O (Used in HW)

The red bears favorite color is green?


STOPWORD CAPITALIZED APOS COLOR

Train SimpleTagger

java -cp "class;lib/mallet-deps.jar" edu.umass.cs.mallet.base.fst.SimpleTagger -train true --model-file SAVEDMODEL TrainingData.txt

Labeling with SimpleTagger


Once you have a trained model, can re-use it to label new data! java -cp "class;lib/mallet-deps.jar" edu.umass.cs.mallet.base.fst.SimpleTagger -include-input true --model-file SAVEDMODEL NotLabeledText.txt > LabeledOutput.txt

CRFs and MALLET


Have fun!

You might also like