0% found this document useful (0 votes)
32 views31 pages

Practical and Effective Neural NER

This document discusses spaCy v2.0 and beyond, including new features for named entity recognition and Prodigy, spaCy's active learning tool. spaCy is an open-source library for industrial-strength natural language processing. Prodigy allows for faster training and evaluation of models through its annotation tool that combines machine learning and user experience insights. Upcoming developments include pre-trained models for more languages and domains, as well as adding capabilities like coreference resolution and entity linking.

Uploaded by

wcc32
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views31 pages

Practical and Effective Neural NER

This document discusses spaCy v2.0 and beyond, including new features for named entity recognition and Prodigy, spaCy's active learning tool. spaCy is an open-source library for industrial-strength natural language processing. Prodigy allows for faster training and evaluation of models through its annotation tool that combines machine learning and user experience insights. Upcoming developments include pre-trained models for more languages and domains, as well as adding capabilities like coreference resolution and entity linking.

Uploaded by

wcc32
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Practical and

Effective Neural
Entity Recognition
in spaCy v2.0 and beyond

Matthew Honnibal 💥 Explosion AI


Explosion AI is a digital studio
specialising in Artificial Intelligence
and Natural Language Processing.

Open-source library for industrial-strength


Natural Language Processing

spaCy’s next-generation Machine Learning library


for deep learning with text

A radically efficient data collection and annotation


tool, powered by active learning

Coming soon: pre-trained, customisable models


for a variety of languages and domains
Matthew Honnibal
CO-FOUNDER

PhD in Computer Science in 2009.


10 years publishing research on state-of-
the-art natural language understanding
systems. Left academia in 2014 to
develop spaCy.

Ines Montani
CO-FOUNDER

Programmer and front-end developer with


degree in media science and linguistics.
Has been working on spaCy since its first
release. Lead developer of Prodigy.
“I don’t get it. Can you
explain like I’m five?”
Think of us as a boutique kitchen.

free recipes published online open-source software

catering for select events consulting

soon: a line of kitchen gadgets downloadable tools

soon: a line of fancy sauces and spice mixes you


can use at home pre-trained models
spaCy
free, open-source library for Natural Language Processing

helps you build applications that process and


“understand” large volumes of text

in use at hundreds of companies


A hopelessly short
introduction to Named
Entity Recognition
What’s NER?

import spacy

nlp = spacy.load('en')
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:


print(ent.text, ent.start_char, ent.end_char, ent.label)

U-ORG U-GPE B-MONEY L-MONEY


’s NER performance
SYSTEM TYPE NER F

spaCy en_core_web_sm (2017) neural 85.67

spaCy en_core_web_lg (2017) neural 86.42

Strubell et al. (2017) neural 86.81

Chiu and Nichols (2016) neural 86.19

Durrett and Klein (2014) neural 84.04

Ratinov and Roth (2009) linear 83.45

alpha.spacy.io
’s English models

MODEL TYPE UAS NER F POS WPS SIZE

en_core_web_sm (2017) v2 neural 91.4 85.5 97.0 8.2k 36MB

en_core_web_lg (2017) v2 neural 91.9 86.4 97.2 8.1k 667MB

en_core_web_sm (2016) v1 linear 86.6 78.5 96.6 25.7k 50MB

en_core_web_lg (2016) v1 linear 90.6 81.4 96.7 18.8k 1GB

alpha.spacy.io
What’s so hard about
Named Entity Recognition?
Entity recognition is not a great thesis topic.
This makes progress slow.

Structured prediction 🤓 interesting!

Knowledge intensive 🤔 potentially cool?

Mix of easy and hard cases 😫 super frustrating...


Transition-based NER

Lample et al. (2016)

Start with all empty stack, all words on buffer, no entities

Define actions that change the state

Predict the sequence of actions


DEEP LE A R N I N G F OR N L P

Embed. Encode.
Attend. Predict.
Think of data shapes,
not application details.

integer vector
category label single meaning

sequence of vectors matrix


multiple meanings meanings in context
EMBED

Learn dense embedding

“You shall know a word by the company it keeps.”

if it barks like a dog...

word2vec, PMI, LSI etc.


NOTATION
EMBED | Function concatenation
>> Function composition

features = doc2array([NORM, PREFIX, SUFFIX, SHAPE])


norm = get_col(0) >> HashEmbed(128, 7500)
prefix = get_col(1) >> HashEmbed(128, 7500)
suffix = get_col(2) >> HashEmbed(128, 7500)
shape = get_col(3) >> HashEmbed(128, 7500)

embed_word = (
(norm | prefix | suffix | shape)
>> Maxout(128, pieces=3)
)
EMBED

features = doc2array([NORM, PREFIX, SUFFIX, SHAPE])


norm = get_col(0) >> HashEmbed(128, 7500)
prefix = get_col(1) >> HashEmbed(128, 7500)
suffix = get_col(2) >> HashEmbed(128, 7500)
shape = get_col(3) >> HashEmbed(128, 7500)

embed_word = (
(norm | prefix | suffix | shape)
>> Maxout(128, pieces=3)
)
ENCODE

Learn to encode context

encode context-independent vectors into


context-sensitive sentence matrix

LSTM, CNN etc.


ENCODE

trigram_cnn = (
ExtractWindow(nW=1) >> Maxout(128, pieces=3)
)
encode_context = (
embed_word
>> Residual(trigram_cnn)
>> Residual(trigram_cnn)
>> Residual(trigram_cnn)
>> Residual(trigram_cnn)
)
ATTEND

Learn what to pay attention to

summarize inputs with respect to query

get global problem-specific representation


ATTEND

state2vec = (
tensor[state.buffer(0)]
| tensor[state.buffer(-1)]
| tensor[state.buffer(1)]
| tensor[state.entities(0)]
| tensor[state.entities(-1)]
| tensor[state.entities(1)]
)
>> Maxout(128)
PREDICT

Learn to predict target


values

output class IDs, real values, etc.

standard multi-layer perceptron


PREDICT

tensor = trigram_cnn(embed_word(doc))
state_weights = state2vec(tensor)
state = initialize_state(doc)
while not state.is_finished:
features = get_features(state, state_weights)
probs = mlp(features)
action = (probs * valid_actions(state)).argmax()
state = action(state)
Advantages of the
transition-based approach

Mostly equivalent to sequence tagging

Convenient to share code with parser

Easily exclude invalid sequences

Easily define arbitrary features


Breaking through the
knowledge acquisition
bottleneck
PROBLEM

We need annotations.

We can definitely pre-train embeddings.

We can probably pre-train CNN.

We can pre-train entities, but should fine-tune.

We must train the output from scratch.

We absolutely need evaluation data.


Prodigy (prodi.gy)

Annotation tool combining insights from Machine Learning


and UX to help developers train and evaluate models faster.
START THE PRODIGY SERVER

$ prodigy dataset ner_product "Improve PRODUCT


on Reddit data"

✨ Created dataset 'ner_product'.

$ prodigy ner.teach ner_product en_core_web_sm


~/data/RC_2010-01.bz2 --loader reddit --label
PRODUCT

✨ Starting the web server on port 8080...


TRAIN AND EVALUATE

$ prodigy ner.batch-train ner_product


en_core_web_sm --output /tmp/model --eval-split 0.5
--label PRODUCT

Loaded model en_core_web_sm


Using 50% of examples (883) for evaluation
Using 100% of remaining examples (891) for training

Correct 164
Incorrect 46
Baseline 0.005
Accuracy 0.781
What’s next?

spaCy v2.0 release candidate – almost ready 🎉

Create training data for more languages and


specific genres and domains

Add coreference resolution and entity linking

Use self-training to keep models up-to-date


Thanks!
💥 Explosion AI
explosion.ai

📲 Follow us on Twitter
@honnibal
@_inesmontani
@explosion_ai

You might also like