Mallet Tutorial
Mallet Tutorial
CC BY 4.0
http://mallet.cs.umass.edu/mallet-tutorial.pdf
David
Mimno
Department
of
Information
Science,
Cornell
University
Outline
About
MALLET
Representing
Data
Classification
Sequence
Tagging
Topic
Modeling
Outline
About
MALLET
Representing
Data
Classification
Sequence
Tagging
Topic
Modeling
Who?
Andrew
McCallum
(most
of
the
work)
Charles
Sutton,
Aron
Culotta,
Greg
Druck,
Kedar
Bellare,
Gaurav
Chandalia
Fernando
Pereira,
others
at
Penn
Who
am
I?
Chief
maintainer
of
MALLET
Primary
author
of
MALLET
topic
modeling
package
Why?
Motivation:
text
classification
and
information
extraction
Commercial
machine
learning
(Just
Research,
WhizBang)
Analysis
and
indexing
of
academic
publications:
Cora,
Rexa
What?
Text
focus:
data
is
discrete
rather
than
continuous,
even
when
values
could
be
continuous:
double value = 3.0
How?
Command
line
scripts:
bin/mallet
[command]
--[option]
[value]
Text
User
Interface
(tui)
classes
History
Version
0.4:
c2004
Classes
in
edu.umass.cs.mallet.base.*
Learning
More
http://mallet.cs.umass.edu
Quick
Start
guides,
focused
on
command
line
processing
Developers
guides,
with
Java
examples
Outline
About
MALLET
Representing
Data
Classification
Sequence
Tagging
Topic
Modeling
Representations
Transform
text
documents
to
vectors
x1,
x2,
Retain
meaning
of
vector
indices
Ideally
sparsely
Call me
Ishmael.
Document
Representations
Transform
text
documents
to
vectors
x1,
x2,
Retain
meaning
of
vector
indices
Ideally
sparsely
Call me
Ishmael.
Document
1.0
0.0
0.0
6.0
0.0
3.0
xi
Representations
Elements
of
vector
are
called
feature
values
Example:
Feature
at
row
345
is
number
of
times
dog
appears
in
document
1.0
0.0
0.0
6.0
0.0
3.0
xi
Documents
to
Vectors
Call me Ishmael.
Document
Documents
to
Vectors
Call me Ishmael.
Document
Call
me Ishmael
Tokens
Documents
to
Vectors
Call
me Ishmael
Tokens
call
me ishmael
Tokens
Documents
to
Vectors
call
me ishmael
Tokens
473, 3591, 17
Features
17.
ishmael
473. call
3591 me
Documents
to
Vectors
473, 3591, 17
Features (sequence)
17.
ishmael
473. call
3591 me
17
473
3591
1.0
1.0
1.0
Features (bag)
17.
473.
3591
473.
3591
ishmael
call
call
me
me
Instances
Email
message,
web
page,
sentence,
journal
abstract
What is it called?
Name
What is the input?
Data
Target/Label
What is the output?
Source
What did it originally look like?
Instances
Name
Data
Target
Source
cc.mallet.types
String
!
TokenSequence
ArrayList<Token>
FeatureSequence
int[]
FeatureVector
int -> double map
Alphabets
17.
ishmael
473. call
3591 me
TObjectIntHashMap map
ArrayList entries
!
Alphabets
17.
ishmael
473. call
3591 me
TObjectIntHashMap map
ArrayList entries
!
for
Alphabets
17.
ishmael
473. call
3591 me
TObjectIntHashMap map
ArrayList entries
!
Creating
Instances
Instance
constructor
method
Iterators
cc.mallet.pipe.iterator
Iterator<Instance>
FileIterator(File[], )
CsvIterator(FileReader, Pattern)
ArrayIterator(Object[])
Creating
Instances
FileIterator
/data/bad/
Label from dir name
/data/good/
cc.mallet.pipe.iterator
Each instance in
its own file
Creating
Instances
CsvIterator
Each instance on
its own line
1001. Melville
1002. Dickens
^([^\t]+)\t([^\t]+)\t(.*)
Name, label, data from regular expression groups.
CSV is a lousy name. LineRegexIterator?
cc.mallet.pipe.iterator
Instance
Pipelines
Sequential
transformations
of
instance
fields
(usually
Data)
Pass
an
ArrayList<Pipe>
to
SerialPipes
cc.mallet.pipe
// data is a String
CharSequence2TokenSequence
// tokenize with regexp
TokenSequenceLowercase
// modify each tokens text
TokenSequenceRemoveStopwords
// drop some tokens
TokenSequence2FeatureSequence
// convert token Strings to ints
FeatureSequence2FeatureVector
// lose order, count duplicates
Instance
Pipelines
A
small
number
of
pipes
modify
the
target
field
There
are
now
two
alphabets:
data
and
label
// target is a String
Target2Label
// convert String to int
// target is now a Label
cc.mallet.pipe, cc.mallet.types
Label
objects
implements Labeling
Weights
on
a
!
fixed
set
of
int getBestIndex()
classes
Label getBestLabel()
For
training
data,
weight
for
correct
label
is
1.0,
all
others
0.0 You cannot create a Label, they
cc.mallet.types
InstanceLists
A
List
of
Instance
InstanceList instances =
new InstanceList(pipe);
objects,
along
!
with
a
Pipe,
data
instances.addThruPipe(iterator);
Alphabet,
and
LabelAlphabet
cc.mallet.types
!
pipeList.add(new
pipeList.add(new
pipeList.add(new
pipeList.add(new
Target2Label());
CharSequence2TokenSequence());
TokenSequence2FeatureSequence());
FeatureSequence2FeatureVector());
!
InstanceList instances =
new InstanceList(new SerialPipes(pipeList));
!
instances.addThruPipe(new FileIterator(. . .));
Persistent
Storage
Most
MALLET
classes
use
Java
serialization
to
store
models
and
data
java.io
ObjectOutputStream oos =
new ObjectOutputStream();
oos.writeObject(instances);
oos.close();
Review
What
are
the
four
main
fields
in
an
Instance?
Review
What
are
the
four
main
fields
in
an
Instance?
What
are
two
ways
to
generate
Instances?
Review
What
are
the
four
main
fields
in
an
Instance?
What
are
two
ways
to
generate
Instances?
How
do
we
modify
the
value
of
Instance
fields?
Review
What
are
the
four
main
fields
in
an
Instance?
What
are
two
ways
to
generate
Instances?
How
do
we
modify
the
value
of
Instance
fields?
Name
some
classes
that
appear
in
the
data
field.
Outline
About
MALLET
Representing
Data
Classification
Sequence
Tagging
Topic
Modeling
Classifier
objects
Classifiers
map
from
instances
to
distributions
Given data
over
a
fixed
set
watery
of
classes
MaxEnt,
Nave
Bayes,
Decision
Trees
cc.mallet.classify
NN
JJ
PRP
VB
CC
Which class
is best?
!
(this one!)
Classifier
objects
Classifiers
map
from
instances
to
Labeling labeling =
distributions
classifier.classify(instance);
!
over
a
fixed
set
Label l = labeling.getBestLabel();
of
classes
!
System.out.print(instance + \t);
MaxEnt,
Nave
System.out.println(l);
Bayes,
Decision
Trees
cc.mallet.classify
cc.mallet.classify
cc.mallet.optimize
Maximize w.r.t. w!
Parameters
w
Association
between
feature,
class
label
How
many
parameters
for
K
classes
and
N
features?
action
action
action
SUFF-tion
SUFF-tion
SUFF-tion
SUFF-on
SUFF-on
NN
VB
JJ
NN
VB
JJ
NN
VB
0.13
-0.1
-0.21
1.3
-2.1
-1.7
0.01
-0.02
MaxEntOptimizableByLabelLikelihood
double[] getParameters()
void setParameters(double[] parameters)
double getValue()
void getValueGradient(double[] buffer)
Evaluation
of
Classifiers
InstanceList[] instanceLists =
Create
instances.split(new Randoms(),
random
test/
new double[] {0.9, 0.1,
train
splits
0.0});
90% training
10% testing
0% validation
cc.mallet.types
Evaluation
of
Classifiers
The
Trial
class
stores
the
results
of
classifications
on
an
InstanceList
(testing
or
training)
cc.mallet.classify
Review
I
have
invented
a
new
classifier:
David
regression.
What
class
should
I
implement
to
classify
instances?
Review
I
have
invented
a
new
classifier:
David
regression.
What
class
should
I
implement
to
train
a
David
regression
classifier?
Review
I
have
invented
a
new
classifier:
David
regression.
I
want
to
train
using
ByValueGradient.
What
mathematical
functions
do
I
need
to
code
up,
and
what
class
should
I
put
them
in?
Review
I
have
invented
a
new
classifier:
David
regression.
How
would
I
check
whether
my
new
classifier
works
better
than
Nave
Bayes?
Outline
About
MALLET
Representing
Data
Classification
Sequence
Tagging
Topic
Modeling
Sequence
Tagging
Data
occurs
in
sequences
Categorical
labels
for
each
position
Labels
are
correlated
Sequence
Tagging
Data
occurs
in
sequences
Categorical
labels
for
each
position
Labels
are
correlated
??
??
??
??
the
dog
likes
running
Sequence
Tagging
Classification:
n-way
Sequence
Tagging:
nT-way
NN NN NN NN NN NN
JJ
JJ
JJ
JJ
JJ
JJ
PRP PRP PRP PRP PRP PRP
VB VB VB VB VB VB
CC CC CC CC CC CC
or
red
dogs
on
blue
trees
NN
JJ
PRP
VB
CC
Andrei Markov
Andrei Markov
Andrei Markov
Andrei Markov
Andrei Markov
P(Labels | Data) =
P(Data, Labels) / P(Data)
Conditional
Random
Field:
conditional
P(Labels | Data)
NSF-funded
Conditional
Random
Field:
arbitrarily
complicated
outputs
NSF-funded
CAPITALIZED
HYPHENATED
ENDS-WITH-ed
ENDS-WITH-d
FeatureSequence
int[]
Conditional
Random
Field:
arbitrarily
complicated
outputs
FeatureVectorSequence
!
FeatureVector[]
Importing
Data
SimpleTagger
format:
one
word
per
line,
with
instances
delimited
by
a
blank
line
Call VB
me PPN
Ishmael NNP
..
!
Some JJ
years NNS
Importing
Data
SimpleTagger
format:
one
word
per
line,
with
instances
delimited
by
a
blank
line
Call SUFF-ll VB
me TWO_LETTERS PPN
Ishmael BIBLICAL_NAME NNP
. PUNCTUATION .
!
Some CAPITALIZED JJ
years TIME SUFF-s NNS
Importing
Data
LineGroupIterator
!
SimpleTaggerSentence2TokenSequence()
//String to Tokens, handles labels
!
TokenSequence2FeatureVectorSequence()
//Token objects to FeatureVectors
cc.mallet.pipe, cc.mallet.pipe.iterator
Importing
Data
LineGroupIterator
!
SimpleTaggerSentence2TokenSequence()
//String to Tokens, handles labels
!
[Pipes that modify tokens]
!
TokenSequence2FeatureVectorSequence()
//Token objects to FeatureVectors
cc.mallet.pipe, cc.mallet.pipe.iterator
Importing
Data
must match
entire string
//Ishmael
TokenTextCharSuffix(C2=, 2)
//Ishmael C2=el
RegexMatches(CAP, Pattern.compile(\\p{Lu}.*))
//Ishmael C2=el CAP
LexiconMembership(NAME, new File(names), false)
//Ishmael C2=el CAP NAME
Importing
Data
int[][] conjunctions =
conjunctions[0]
conjunctions[1]
conjunctions[2]
previous
position
new int[3][];
= new int[] { -1 }; next
= new int[] { 1 };
= new int[] { -2, -1 };
position
!
OffsetConjunctions(conjunctions)
// a@-2_&_red@-1 on@1
cc.mallet.pipe.tsf
previous two
Importing
Data
int[][] conjunctions =
conjunctions[0]
conjunctions[1]
conjunctions[2]
previous
position
new int[3][];
= new int[] { -1 }; next
= new int[] { 1 };
= new int[] { -2, -1 };
position
!
TokenTextCharSuffix("C1=", 1)
OffsetConjunctions(conjunctions)
// a@-2_&_red@-1 a@-2_&_C1=d@-1
cc.mallet.pipe.tsf
previous two
DET
P(DET)
DET
the
P(the | DET)
DET
NN
the
P(NN | DET)
DET
NN
the
dog
P(dog | NN)
DET
NN
VBS
the
dog
P(VBS | NN)
P(VBS | DET) = 0
DET
NN
VBS
!
the
dog
runs
DET
NN
VBS
!
the
dog
runs
DET
NN
VBS
!
the
dog
runs
cc.mallet.fst
cc.mallet.fst
crf.addStatesForThreeQuarterLabelsConnectedAsIn(instances);
cc.mallet.fst
crf.addStatesForBiLabelsConnectedAsIn(instances);
cc.mallet.fst
crf.addStatesForHalfLabelsConnectedAsIn(instances);
cc.mallet.fst
Training
a
transducer
CRF crf = new CRF(pipe, null);
crf.addStatesForLabelsConnectedAsIn(trainingInstances);
CRFTrainerByLabelLikelihood trainer =
new CRFTrainerByLabelLikelihood(crf);
!
trainer.train();
cc.mallet.fst
Evaluating
a
transducer
CRFTrainerByLabelLikelihood trainer =
new CRFTrainerByLabelLikelihood(transducer);
!
TransducerEvaluator evaluator =
new TokenAccuracyEvaluator(testing, "testing"));
!
trainer.addEvaluator(evaluator);
!
trainer.train();
cc.mallet.fst
Applying
a
transducer
Sequence output = transducer.transduce (input);
!
for (int index=0; index < input.size(); input++) {
System.out.print(input.get(index) + /);
System.out.print(output.get(index) + );
}
cc.mallet.fst
Review
How
do
you
add
new
features
to
TokenSequences?
Review
How
do
you
add
new
features
to
TokenSequences?
What
are
three
factors
that
affect
the
number
of
parameters
in
a
model?
Outline
About
MALLET
Representing
Data
Classification
Sequence
Tagging
Topic
Modeling
News Article
Sports
News Article
Negotiation
Sports
News Article
strike
team
player deadline
union
game
Negotiation
Series
Yankees
Sox
Red
World
League
game
Boston
team
games
baseball
Mets
Game
series
won
Clemens
Braves
Yankee
teams
cc.mallet.topics
!
MarginalProbEstimator evaluator =
lda.getProbEstimator();
!
double logLikelihood =
evaluator.evaluateLeftToRight(testing, 10, false, null);
cc.mallet.topics
!
TopicInferencer inferencer =
lda.getInferencer();
!
double[] topicProbs =
inferencer.getSampledDistribution(instance, 100,
10, 10);
cc.mallet.topics
David Mimno
Andrew McCallum
UAI
2008
David Mimno
Andrew McCallum
UAI
2008
!
Topic models conditioned
on arbitrary features using
Dirichlet-multinomial
regression.
Dirichlet-multinomial
Regression
(DMR)
1.74
1.41
1.40
1.37
-1.12
-1.21
-1.23
-1.36
web, web pages, web page, world wide web, web sites
-1.44
Sridhar Mahadevan
2.88
ICML
2.56
Kenji Doya
2.45
ECML
2.19
-1.38
ACL
-1.47
CVPR
-1.54
-1.64
COLING
-3.76
<default>
2.26
2.25
2.25
2.11
-1.29
-1.36
-1.37
-1.50
-1.50
UAI
2.41
Mary-Anne Williams
2.23
Ashraf M. Abdelbar
2.15
Philippe Smets
2.04
-1.16
-1.38
COLING
-1.50
Neural Networks
-2.24
ICRA
-3.36
<default>
Dirichlet-multinomial
Regression
Arbitrary
observed
features
of
documents
Target
contains
FeatureVector
DMRTopicModel dmr =
new DMRTopicModel (numTopics);
!
dmr.addInstances(training);
dmr.estimate();
!
dmr.writeParameters(new File("dmr.parameters"));
Topics
from
European
Parliament
Proceedings
Topics
from
European
Parliament
Proceedings
Topics
from
Wikipedia
chien
chat
hund
schwein
Polylingual
Topics
InstanceList[] training =
new InstanceList[] { english, german,
arabic, mahican };
!
PolylingualTopicModel pltm =
new PolylingualTopicModel(numTopics);
!
pltm.addInstances(training);