0% found this document useful (0 votes)
261 views120 pages

Mallet Tutorial

This document provides an overview of MALLET (Machine Learning for Language Toolkit), an open-source Java-based machine learning library for natural language processing. It was created by researchers at the University of Massachusetts and is maintained by David Mimno. MALLET can be used for text classification, sequence tagging, and topic modeling. It represents documents as feature vectors and provides tools for classification, training models from labeled instances, and evaluating models.

Uploaded by

Kujtim Rahmani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
261 views120 pages

Mallet Tutorial

This document provides an overview of MALLET (Machine Learning for Language Toolkit), an open-source Java-based machine learning library for natural language processing. It was created by researchers at the University of Massachusetts and is maintained by David Mimno. MALLET can be used for text classification, sequence tagging, and topic modeling. It represents documents as feature vectors and provides tools for classification, training models from labeled instances, and evaluating models.

Uploaded by

Kujtim Rahmani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 120

License:

CC BY 4.0

Machine Learning with MALLET

http://mallet.cs.umass.edu/mallet-tutorial.pdf

David Mimno
Department of Information Science,
Cornell University

Outline

About MALLET
Representing Data
Classification
Sequence Tagging
Topic Modeling

Outline

About MALLET
Representing Data
Classification
Sequence Tagging
Topic Modeling

Who?
Andrew McCallum (most of the
work)
Charles Sutton, Aron Culotta,
Greg Druck, Kedar Bellare,
Gaurav Chandalia
Fernando Pereira, others at
Penn

Who am I?
Chief maintainer of MALLET
Primary author of MALLET topic modeling
package

Why?
Motivation: text classification and
information extraction
Commercial machine learning (Just
Research, WhizBang)
Analysis and indexing of academic
publications: Cora, Rexa

What?
Text focus: data is discrete rather than
continuous, even when values could be
continuous:
double value = 3.0

How?
Command line scripts:
bin/mallet [command] --[option] [value]
Text User Interface (tui) classes

Direct Java API


http://mallet.cs.umass.edu/api
Most of this talk

History
Version 0.4: c2004
Classes in edu.umass.cs.mallet.base.*

Version 2.0: c2008


Classes in cc.mallet.*
Major changes to finite state transducer
package
bin/mallet vs. specialized scripts
Java 1.5 generics

Learning More
http://mallet.cs.umass.edu
Quick Start guides, focused on command line
processing
Developers guides, with Java examples

[email protected] mailing list


Low volume, but can be bursty

Outline

About MALLET
Representing Data
Classification
Sequence Tagging
Topic Modeling

Models for Text Data


Generative models (Multinomials)
Nave Bayes
Hidden Markov Models (HMMs)
Latent Dirichlet Topic Models

Discriminative Regression Models


MaxEnt/Logistic regression
Conditional Random Fields (CRFs)

Representations
Transform text
documents to
vectors x1, x2,
Retain meaning
of vector indices
Ideally sparsely

Call me
Ishmael.

Document

Representations
Transform text
documents to
vectors x1, x2,
Retain meaning
of vector indices
Ideally sparsely

Call me
Ishmael.

Document

1.0
0.0

0.0
6.0
0.0

3.0

xi

Representations
Elements of vector
are called feature
values
Example: Feature
at row 345 is
number of times
dog appears in
document

1.0
0.0

0.0
6.0
0.0

3.0

xi

Documents to Vectors
Call me Ishmael.

Document

Documents to Vectors
Call me Ishmael.

Document

Call

me Ishmael
Tokens

Documents to Vectors
Call

me Ishmael
Tokens

call

me ishmael
Tokens

Documents to Vectors
call

me ishmael
Tokens

473, 3591, 17
Features
17.
ishmael

473. call

3591 me

Documents to Vectors
473, 3591, 17
Features (sequence)
17.
ishmael

473. call

3591 me

17
473
3591

1.0
1.0
1.0

Features (bag)
17.
473.

3591
473.

3591

ishmael
call
call
me
me

Instances
Email message, web page, sentence, journal
abstract
What is it called?
Name
What is the input?
Data
Target/Label
What is the output?
Source
What did it originally look like?

Instances

Name
Data
Target
Source

cc.mallet.types

String
!

TokenSequence

ArrayList<Token>
FeatureSequence

int[]
FeatureVector

int -> double map

Alphabets
17.
ishmael

473. call

3591 me

TObjectIntHashMap map
ArrayList entries
!

int lookupIndex(Object o, boolean shouldAdd)


!
Object lookupObject(int index)
cc.mallet.types, gnu.trove

Alphabets
17.
ishmael

473. call

3591 me

TObjectIntHashMap map
ArrayList entries
!

for

int lookupIndex(Object o, boolean shouldAdd)


!
Object lookupObject(int index)
cc.mallet.types, gnu.trove

Alphabets
17.
ishmael

473. call

3591 me

TObjectIntHashMap map
ArrayList entries
!

Do not add entries for


void stopGrowth() new Objects -- default
is to allow growth.
!
void startGrowth()
cc.mallet.types, gnu.trove

Creating Instances
Instance
constructor
method
Iterators

cc.mallet.pipe.iterator

new Instance(data, target,



name, source)

Iterator<Instance>
FileIterator(File[], )
CsvIterator(FileReader, Pattern)
ArrayIterator(Object[])

Creating Instances
FileIterator
/data/bad/
Label from dir name
/data/good/

cc.mallet.pipe.iterator

Each instance in
its own file

Creating Instances
CsvIterator

Each instance on
its own line

1001. Melville
1002. Dickens

Call me Ishmael. Some years ago


It was the best of times, it was

^([^\t]+)\t([^\t]+)\t(.*)
Name, label, data from regular expression groups.
CSV is a lousy name. LineRegexIterator?
cc.mallet.pipe.iterator

Instance Pipelines
Sequential
transformations
of instance fields
(usually Data)
Pass an
ArrayList<Pipe>
to SerialPipes

cc.mallet.pipe

// data is a String
CharSequence2TokenSequence
// tokenize with regexp
TokenSequenceLowercase
// modify each tokens text
TokenSequenceRemoveStopwords
// drop some tokens
TokenSequence2FeatureSequence
// convert token Strings to ints
FeatureSequence2FeatureVector
// lose order, count duplicates

Instance Pipelines
A small number
of pipes modify
the target field
There are now
two alphabets:
data and label

// target is a String
Target2Label
// convert String to int
// target is now a Label

cc.mallet.pipe, cc.mallet.types

Alphabet > LabelAlphabet

Label objects
implements Labeling
Weights on a
!
fixed set of
int getBestIndex()
classes
Label getBestLabel()
For training data,
weight for
correct label is
1.0, all others 0.0 You cannot create a Label, they

are only produced by


LabelAlphabet

cc.mallet.types

InstanceLists
A List of Instance InstanceList instances =
new InstanceList(pipe);
objects, along
!
with a Pipe, data instances.addThruPipe(iterator);
Alphabet, and
LabelAlphabet

cc.mallet.types

Putting it all together


ArrayList<Pipe> pipeList = new ArrayList<Pipe>();

!
pipeList.add(new
pipeList.add(new
pipeList.add(new
pipeList.add(new

Target2Label());
CharSequence2TokenSequence());
TokenSequence2FeatureSequence());
FeatureSequence2FeatureVector());

!
InstanceList instances =
new InstanceList(new SerialPipes(pipeList));

!
instances.addThruPipe(new FileIterator(. . .));

Persistent Storage
Most MALLET
classes use Java
serialization to
store models and
data

java.io

ObjectOutputStream oos =
new ObjectOutputStream();
oos.writeObject(instances);
oos.close();

Pipes, data objects, labelings, etc


all need to implement
Serializable.
!
Be sure to include custom classes
in classpath, or you get a
StreamCorruptedException

Review
What are the four main fields in an
Instance?

Review
What are the four main fields in an
Instance?
What are two ways to generate Instances?

Review
What are the four main fields in an
Instance?
What are two ways to generate Instances?
How do we modify the value of Instance
fields?

Review
What are the four main fields in an
Instance?
What are two ways to generate Instances?
How do we modify the value of Instance
fields?
Name some classes that appear in the
data field.

Outline

About MALLET
Representing Data
Classification
Sequence Tagging
Topic Modeling

Classifier objects
Classifiers map
from instances to
distributions
Given data
over a fixed set
watery
of classes
MaxEnt, Nave
Bayes, Decision
Trees
cc.mallet.classify

NN
JJ
PRP
VB
CC

Which class
is best?
!
(this one!)

Classifier objects
Classifiers map
from instances to Labeling labeling =
distributions

classifier.classify(instance);
!
over a fixed set
Label l = labeling.getBestLabel();
of classes
!
System.out.print(instance + \t);
MaxEnt, Nave
System.out.println(l);
Bayes, Decision
Trees
cc.mallet.classify

Training Classifier objects


Each type of
ClassifierTrainer trainer =
new MaxEntTrainer();
classifier has one
!
or more
Classifier classifier =
trainer.train(instances);
ClassifierTrainer
classes

cc.mallet.classify

Training Classifier objects


Some classifiers
require
numerical
optimization of
an objective
function.

cc.mallet.optimize

log P(Labels | Data) =


log f(label1, data1, w) +
log f(label2, data2, w) +
log f(label3, data3, w) +

Maximize w.r.t. w!

Parameters w
Association
between
feature, class
label
How many
parameters for K
classes and N
features?

action
action
action
SUFF-tion
SUFF-tion
SUFF-tion
SUFF-on
SUFF-on

NN
VB
JJ
NN
VB
JJ
NN
VB

0.13
-0.1
-0.21
1.3
-2.1
-1.7
0.01
-0.02

Training Classifier objects


interface Optimizer
Limited-memory BFGS,
boolean optimize()
Conjugate gradient
!
interface Optimizable
interface ByValue
interface ByValueGradient
Specific objective functions
cc.mallet.optimize

Training Classifier objects


For
Optimizable
interface

MaxEntOptimizableByLabelLikelihood
double[] getParameters()
void setParameters(double[] parameters)

double getValue()
void getValueGradient(double[] buffer)

Log likelihood and its first derivative


cc.mallet.classify

Evaluation of Classifiers
InstanceList[] instanceLists =
Create
instances.split(new Randoms(),
random test/
new double[] {0.9, 0.1,
train splits

0.0});

90% training
10% testing
0% validation
cc.mallet.types

Evaluation of Classifiers
The Trial class
stores the
results of
classifications
on an
InstanceList
(testing or
training)
cc.mallet.classify

Trial(Classifier c, InstanceList list)


double getAccuracy()
double getAverageRank()
double getF1(int/Label/Object)
double getPrecision()
double getRecall()

Review
I have invented a new classifier: David
regression.
What class should I implement to classify
instances?

Review
I have invented a new classifier: David
regression.
What class should I implement to train a David
regression classifier?

Review
I have invented a new classifier: David
regression.
I want to train using ByValueGradient. What
mathematical functions do I need to code up,
and what class should I put them in?

Review
I have invented a new classifier: David
regression.
How would I check whether my new classifier
works better than Nave Bayes?

Outline

About MALLET
Representing Data
Classification
Sequence Tagging
Topic Modeling

Sequence Tagging
Data occurs in
sequences
Categorical labels
for each position
Labels are
correlated

DET NN VBS VBG


the dog likes running

Sequence Tagging
Data occurs in
sequences
Categorical labels
for each position
Labels are
correlated

?? ?? ?? ??
the dog likes running

Sequence Tagging
Classification: n-way
Sequence Tagging: nT-way
NN NN NN NN NN NN
JJ
JJ
JJ
JJ
JJ
JJ
PRP PRP PRP PRP PRP PRP
VB VB VB VB VB VB
CC CC CC CC CC CC
or red dogs on blue trees

NN
JJ
PRP
VB
CC

Avoiding Exponential Blowup


Markov property
Dynamic programming

Andrei Markov

Avoiding Exponential Blowup


Markov property
Dynamic programming
DET JJ NN VB
This one
Given this one
Is independent of these

Andrei Markov

Avoiding Exponential Blowup


Markov property
Dynamic programming
NN NN NN NN NN NN
JJ
JJ
JJ
JJ
JJ
JJ
PRP PRP PRP PRP PRP PRP
VB VB VB VB VB VB
CC CC CC CC CC CC
or red dogs on blue trees

Andrei Markov

Avoiding Exponential Blowup


Markov property
Dynamic programming
NN NN NN NN NN
JJ
JJ
JJ
JJ
JJ
PRP PRP PRP PRP PRP
VB VB VB VB VB
CC CC CC CC CC
red dogs on blue trees

Andrei Markov

Avoiding Exponential Blowup


Markov property
Dynamic programming
NN NN NN NN
JJ
JJ
JJ
JJ
PRP PRP PRP PRP
VB VB VB VB
CC CC CC CC
dogs on blue trees

Andrei Markov

Hidden Markov Models and Conditional


Random Fields
Hidden Markov
Model: fully
generative

P(Labels | Data) =
P(Data, Labels) / P(Data)

Conditional Random
Field: conditional

P(Labels | Data)

Hidden Markov Models and Conditional


Random Fields
Hidden Markov Model:
simple (independent)
output space

NSF-funded

Conditional Random
Field: arbitrarily
complicated outputs

NSF-funded
CAPITALIZED
HYPHENATED
ENDS-WITH-ed
ENDS-WITH-d

Hidden Markov Models and Conditional


Random Fields
Hidden Markov Model:
simple (independent)
output space

FeatureSequence
int[]

Conditional Random
Field: arbitrarily
complicated outputs

FeatureVectorSequence
!
FeatureVector[]

Importing Data
SimpleTagger
format: one
word per line,
with instances
delimited by a
blank line

Call VB
me PPN
Ishmael NNP
..
!
Some JJ
years NNS

Importing Data
SimpleTagger
format: one
word per line,
with instances
delimited by a
blank line

Call SUFF-ll VB
me TWO_LETTERS PPN
Ishmael BIBLICAL_NAME NNP
. PUNCTUATION .
!
Some CAPITALIZED JJ
years TIME SUFF-s NNS

Importing Data
LineGroupIterator
!
SimpleTaggerSentence2TokenSequence()
//String to Tokens, handles labels
!
TokenSequence2FeatureVectorSequence()
//Token objects to FeatureVectors

cc.mallet.pipe, cc.mallet.pipe.iterator

Importing Data
LineGroupIterator
!
SimpleTaggerSentence2TokenSequence()
//String to Tokens, handles labels
!
[Pipes that modify tokens]
!
TokenSequence2FeatureVectorSequence()
//Token objects to FeatureVectors
cc.mallet.pipe, cc.mallet.pipe.iterator

Importing Data
must match
entire string

//Ishmael
TokenTextCharSuffix(C2=, 2)
//Ishmael C2=el
RegexMatches(CAP, Pattern.compile(\\p{Lu}.*))
//Ishmael C2=el CAP
LexiconMembership(NAME, new File(names), false)
//Ishmael C2=el CAP NAME

one name per line


ignore case?
cc.mallet.pipe.tsf

Sliding window features


a red dog on a blue tree

Sliding window features


a red dog on a blue tree

Sliding window features


a red dog on a blue tree
red@-1

Sliding window features


a red dog on a blue tree
red@-1
a@-2

Sliding window features


a red dog on a blue tree
red@-1
a@-2
on@1

Sliding window features


a red dog on a blue tree
red@-1
a@-2
on@1
a@-2_&_red@-1

Importing Data
int[][] conjunctions =
conjunctions[0]
conjunctions[1]
conjunctions[2]

previous
position

new int[3][];
= new int[] { -1 }; next
= new int[] { 1 };
= new int[] { -2, -1 };

position

!
OffsetConjunctions(conjunctions)

// a@-2_&_red@-1 on@1

cc.mallet.pipe.tsf

previous two

Importing Data
int[][] conjunctions =
conjunctions[0]
conjunctions[1]
conjunctions[2]

previous
position

new int[3][];
= new int[] { -1 }; next
= new int[] { 1 };
= new int[] { -2, -1 };

position

!
TokenTextCharSuffix("C1=", 1)
OffsetConjunctions(conjunctions)

// a@-2_&_red@-1 a@-2_&_C1=d@-1

cc.mallet.pipe.tsf

previous two

Finite State Transducers


Finite state
machine over
two alphabets
(observed,
hidden)

Finite State Transducers


Finite state
machine over
two alphabets
(observed,
hidden)

DET

P(DET)

Finite State Transducers


Finite state
machine over
two alphabets
(observed,
hidden)

DET
the
P(the | DET)

Finite State Transducers


Finite state
machine over
two alphabets
(observed,
hidden)

DET NN
the
P(NN | DET)

Finite State Transducers


Finite state
machine over
two alphabets
(observed,
hidden)

DET NN
the dog
P(dog | NN)

Finite State Transducers


Finite state
machine over
two alphabets
(observed,
hidden)

DET NN VBS
the dog
P(VBS | NN)

How many parameters?


Determines
efficiency of
training
Too many leads
to overfitting

Trick: Dont allow


certain transitions

P(VBS | DET) = 0

How many parameters?


Determines
efficiency of
training
Too many leads
to overfitting

DET NN VBS
!
the dog runs
DET NN VBS
!
the dog runs
DET NN VBS
!
the dog runs

Finite State Transducers


abstract class Transducer
CRF
HMM
!
abstract class TransducerTrainer
CRFTrainerByLabelLikelihood
HMMTrainerByLikelihood

cc.mallet.fst

Finite State Transducers


DET NN VBS
!
the dog runs

First order: one weight


for every pair of labels
and observations.

CRF crf = new CRF(pipe, null);


crf.addFullyConnectedStates();
// or
crf.addStatesForLabelsConnectedAsIn(instances);

cc.mallet.fst

Finite State Transducers


DET NN VBS
!
the dog runs

three-quarter order: one


weight for every pair of
labels and observations.

crf.addStatesForThreeQuarterLabelsConnectedAsIn(instances);

cc.mallet.fst

Finite State Transducers


DET NN VBS
!
the dog runs

Second order: one weight


for every triplet of labels
and observations.

crf.addStatesForBiLabelsConnectedAsIn(instances);

cc.mallet.fst

Finite State Transducers


DET NN VBS
!
the dog runs

Half order: equivalent to


independent classifiers,
except some transitions
may be illegal.

crf.addStatesForHalfLabelsConnectedAsIn(instances);

cc.mallet.fst

Training a transducer
CRF crf = new CRF(pipe, null);
crf.addStatesForLabelsConnectedAsIn(trainingInstances);

CRFTrainerByLabelLikelihood trainer =
new CRFTrainerByLabelLikelihood(crf);

!
trainer.train();

cc.mallet.fst

Evaluating a transducer
CRFTrainerByLabelLikelihood trainer =
new CRFTrainerByLabelLikelihood(transducer);

!
TransducerEvaluator evaluator =
new TokenAccuracyEvaluator(testing, "testing"));

!
trainer.addEvaluator(evaluator);

!
trainer.train();

cc.mallet.fst

Applying a transducer
Sequence output = transducer.transduce (input);

!
for (int index=0; index < input.size(); input++) {

System.out.print(input.get(index) + /);

System.out.print(output.get(index) + );
}

cc.mallet.fst

Review
How do you add new features to
TokenSequences?

Review
How do you add new features to
TokenSequences?
What are three factors that affect the
number of parameters in a model?

Outline

About MALLET
Representing Data
Classification
Sequence Tagging
Topic Modeling

Topics: Semantic Groups

News Article

Topics: Semantic Groups

Sports
News Article

Negotiation

Topics: Semantic Groups

Sports
News Article

strike
team
player deadline
union
game

Negotiation

Topics: Semantic Groups


strike
team
player deadline
union
game
News Article

Series Yankees Sox Red World League game Boston team games
baseball Mets Game series won Clemens Braves Yankee teams

players League owners league baseball union commissioner


Baseball Association labor Commissioner Football major teams
Selig agreement strike team bargaining

Training a Topic Model


ParallelTopicModel lda = new ParallelTopicModel(numTopics);
lda.addInstances(trainingInstances);
lda.estimate();

cc.mallet.topics

Evaluating a Topic Model


ParallelTopicModel lda = new ParallelTopicModel(numTopics);
lda.addInstances(trainingInstances);
lda.estimate();

!
MarginalProbEstimator evaluator =
lda.getProbEstimator();

!
double logLikelihood =
evaluator.evaluateLeftToRight(testing, 10, false, null);

cc.mallet.topics

Inferring topics for new documents


ParallelTopicModel lda = new ParallelTopicModel(numTopics);
lda.addInstances(trainingInstances);
lda.estimate();

!
TopicInferencer inferencer =
lda.getInferencer();

!
double[] topicProbs =
inferencer.getSampledDistribution(instance, 100,
10, 10);

cc.mallet.topics

More than words


Text collections
mix free text and
structured data

David Mimno
Andrew McCallum
UAI
2008

More than words


Text collections
mix free text and
structured data

David Mimno
Andrew McCallum
UAI
2008
!
Topic models conditioned
on arbitrary features using
Dirichlet-multinomial
regression.

Dirichlet-multinomial Regression
(DMR)

The corpus specifies a vector of real-valued


features (x) for each document, of length F.
Each topic has an F-length vector of
parameters.

Topic parameters for feature published in


JMLR
2.27

kernel, kernels, rational kernels, string kernels, fisher kernel

1.74

bounds, vc dimension, bound, upper bound, lower bounds

1.41

reinforcement learning, learning, reinforcement

1.40

blind source separation, source separation, separation, channel

1.37

nearest neighbor, boosting, nearest neighbors, adaboost

-1.12

agent, agents, multi agent, autonomous agents

-1.21

strategies, strategy, adaptation, adaptive, driven

-1.23

retrieval, information retrieval, query, query expansion

-1.36

web, web pages, web page, world wide web, web sites

-1.44

user, users, user interface, interactive, interface

Feature parameters for RL topic


2.99

Sridhar Mahadevan

2.88

ICML

2.56

Kenji Doya

2.45

ECML

2.19

Machine Learning Journal

-1.38

ACL

-1.47

CVPR

-1.54

IEEE Trans. PAMI

-1.64

COLING

-3.76

<default>

Topic parameters for feature published in


UAI
2.88

bayesian networks, bayesian network, belief networks

2.26

qualitative, reasoning, qualitative reasoning, qualitative simulation

2.25

probability, probabilities, probability distributions,

2.25

uncertainty, symbolic, sketch, primal sketch, uncertain, connectionist

2.11

reasoning, logic, default reasoning, nonmonotonic reasoning

-1.29

shape, deformable, shapes, contour, active contour

-1.36

digital libraries, digital library, digital, library

-1.37

workshop report, invited talk, international conference, report

-1.50

descriptions, description, top, bottom, top bottom

-1.50

nearest neighbor, boosting, nearest neighbors, adaboost

Feature parameters for Bayes nets topic


2.88

UAI

2.41

Mary-Anne Williams

2.23

Ashraf M. Abdelbar

2.15

Philippe Smets

2.04

Loopy Belief Propagation for Approximate Inference (Murphy, Weiss,


and Jordan, UAI, 1999)

-1.16

Probabilistic Semantics for Nonmonotonic Reasoning (Pearl, KR, 1989)

-1.38

COLING

-1.50

Neural Networks

-2.24

ICRA

-3.36

<default>

Dirichlet-multinomial Regression
Arbitrary observed features of documents
Target contains FeatureVector

DMRTopicModel dmr =
new DMRTopicModel (numTopics);

!
dmr.addInstances(training);
dmr.estimate();

!
dmr.writeParameters(new File("dmr.parameters"));

Polylingual Topic Modeling


Topics exist in more languages than you
could possibly learn
Topically comparable documents are much
easier to get than translation sets
Translation dictionaries
cover pairs, not sets of languages
miss technical vocabulary
arent available for low-resource languages

Topics from
European
Parliament
Proceedings

Topics from
European
Parliament
Proceedings

Topics from
Wikipedia

Aligned instance lists


dog
cat
pig

chien
chat

hund
schwein

Polylingual Topics
InstanceList[] training =
new InstanceList[] { english, german,
arabic, mahican };

!
PolylingualTopicModel pltm =
new PolylingualTopicModel(numTopics);

!
pltm.addInstances(training);

MALLET hands-on tutorial


http://mallet.cs.umass.edu/mallet-handson.tar.gz

You might also like