0% found this document useful (0 votes)

177 views29 pages

Advanced NLP With Spacy Chapter3

The document discusses processing pipelines in spaCy. It describes the built-in pipeline components like the tagger, parser and NER that create attributes like Token.tag, Doc.sents and Doc.ents. It explains that the pipeline is defined in the model's meta.json file and components need binary data to make predictions. The pipeline attributes nlp.pipe_names and nlp.pipeline are also described.

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

177 views29 pages

Advanced NLP With Spacy Chapter3

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Processing pipelines

A D VA N C E D N L P W I T H S PA C Y

Ines Montani
spaCy core developer
What happens when you call nlp?

doc = nlp("This is a sentence.")

ADVANCED NLP WITH SPACY

Built-in pipeline components
Name Description Creates

tagger Part-of-speech tagger Token.tag

parser Dependency parser Token.dep , Token.head , Doc.sents , Doc.noun_chunks

ner Named entity recognizer Doc.ents , Token.ent_iob , Token.ent_type

textcat Text classi er Doc.cats

ADVANCED NLP WITH SPACY

Under the hood

Pipeline de ned in model's meta.json in order

Built-in components need binary data to make predictions

ADVANCED NLP WITH SPACY

Pipeline attributes
nlp.pipe_names : list of pipeline component names

print(nlp.pipe_names)

['tagger', 'parser', 'ner']

nlp.pipeline : list of (name, component) tuples

print(nlp.pipeline)

[('tagger', <spacy.pipeline.Tagger>),
('parser', <spacy.pipeline.DependencyParser>),
('ner', <spacy.pipeline.EntityRecognizer>)]

ADVANCED NLP WITH SPACY

Let's practice!
A D VA N C E D N L P W I T H S PA C Y
Custom pipeline
components
A D VA N C E D N L P W I T H S PA C Y

Ines Montani
spaCy core developer
Why custom components?

Make a function execute automatically when you call nlp

Add your own metadata to documents and tokens

Updating built-in attributes like doc.ents

ADVANCED NLP WITH SPACY

Anatomy of a component (1)
Function that takes a doc , modi es it and returns it

Can be added using the nlp.add_pipe method

def custom_component(doc):
# Do something to the doc here
return doc

nlp.add_pipe(custom_component)

ADVANCED NLP WITH SPACY

Anatomy of a component (2)
def custom_component(doc):
# Do something to the doc here
return doc

nlp.add_pipe(custom_component)

Argument Description Example

last If True , add last nlp.add_pipe(component, last=True)

first If True , add rst nlp.add_pipe(component, first=True)

before Add before component nlp.add_pipe(component, before='ner')

after Add after component nlp.add_pipe(component, after='tagger')

ADVANCED NLP WITH SPACY

Example: a simple component (1)
# Create the nlp object
nlp = spacy.load('en_core_web_sm')

# Define a custom component

def custom_component(doc):

# Print the doc's length

print('Doc length:' len(doc))

# Return the doc object

return doc

# Add the component first in the pipeline

nlp.add_pipe(custom_component, first=True)

# Print the pipeline component names

print('Pipeline:', nlp.pipe_names)

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']

ADVANCED NLP WITH SPACY

Example: a simple component (2)
# Create the nlp object
nlp = spacy.load('en_core_web_sm')

# Define a custom component

def custom_component(doc):

# Print the doc's length

print('Doc length:' len(doc))

# Return the doc object

return doc

# Add the component first in the pipeline

nlp.add_pipe(custom_component, first=True)
# Process a text
doc = nlp("Hello world!")

Doc length: 3

ADVANCED NLP WITH SPACY

Let's practice!
A D VA N C E D N L P W I T H S PA C Y
Extension attributes
A D VA N C E D N L P W I T H S PA C Y

Ines Montani
spaCy core developer
Setting custom attributes
Add custom metadata to documents, tokens and spans

Accessible via the ._ property

doc._.title = 'My document'
token._.is_color = True
span._.has_color = False

registered on the global Doc , Token or Span using the set_extension method
# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span

Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)

ADVANCED NLP WITH SPACY

Extension attribute types
1. Attribute extensions

2. Property extensions

3. Method extensions

ADVANCED NLP WITH SPACY

Attribute extensions
Set a default value that can be overwritten

from spacy.tokens import Token

# Set extension on the Token with default value

Token.set_extension('is_color', default=False)

doc = nlp("The sky is blue.")

# Overwrite extension attribute value

doc[3]._.is_color = True

ADVANCED NLP WITH SPACY

Property extensions (1)
De ne a getter and an optional setter function

Getter only called when you retrieve the attribute value

from spacy.tokens import Token

# Define getter function

def get_is_color(token):
colors = ['red', 'yellow', 'blue']
return token.text in colors

# Set extension on the Token with getter

Token.set_extension('is_color', getter=get_is_color)

doc = nlp("The sky is blue.")

print(doc[3]._.is_color, '-', doc[3].text)

blue - True

ADVANCED NLP WITH SPACY

Property extensions (2)
Span extensions should almost always use a getter

from spacy.tokens import Span

# Define getter function

def get_has_color(span):
colors = ['red', 'yellow', 'blue']
return any(token.text in colors for token in span)
# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color)
doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

True - sky is blue

False - The sky

ADVANCED NLP WITH SPACY

Method extensions
Assign a function that becomes available as an object method

Lets you pass arguments to the extension function

from spacy.tokens import Doc

# Define method with arguments

def has_token(doc, token_text):
in_doc = token_text in [token.text for token in doc]
# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token)
doc = nlp("The sky is blue.")
print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

True - blue
False - cloud

ADVANCED NLP WITH SPACY

Let's practice!
A D VA N C E D N L P W I T H S PA C Y
Scaling and
performance
A D VA N C E D N L P W I T H S PA C Y

Ines Montani
spaCy core developer
Processing large volumes of text
Use nlp.pipe method

Processes texts as a stream, yields Doc objects

Much faster than calling nlp on each text

BAD:

docs = [nlp(text) for text in LOTS_OF_TEXTS]

GOOD:

docs = list(nlp.pipe(LOTS_OF_TEXTS))

ADVANCED NLP WITH SPACY

Passing in context (1)
Setting as_tuples=True on nlp.pipe lets you pass in (text, context) tuples

Yields (doc, context) tuples

Useful for associating metadata with the doc

data = [
('This is a text', {'id': 1, 'page_number': 15}),
('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):

print(doc.text, context['page_number'])

This is a text 15
And another text 16

ADVANCED NLP WITH SPACY

Passing in context (2)
from spacy.tokens import Doc

Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)

data = [
('This is a text', {'id': 1, 'page_number': 15}),
('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):

doc._.id = context['id']
doc._.page_number = context['page_number']

ADVANCED NLP WITH SPACY

Using only the tokenizer

don't run the whole pipeline!

ADVANCED NLP WITH SPACY

Using only the tokenizer (2)
Use nlp.make_doc to turn a text in to a Doc object

BAD:

doc = nlp("Hello world")

GOOD:

doc = nlp.make_doc("Hello world!")

ADVANCED NLP WITH SPACY

Disabling pipeline components
Use nlp.disable_pipes to temporarily disable one or more pipes

# Disable tagger and parser

with nlp.disable_pipes('tagger', 'parser'):
# Process the text and print the entities
doc = nlp(text)
print(doc.ents)

restores them after the with block

only runs the remaining components

ADVANCED NLP WITH SPACY

Let's practice!
A D VA N C E D N L P W I T H S PA C Y

Advanced NLP With Spacy Chapter2
100% (1)
Advanced NLP With Spacy Chapter2
28 pages
Getting Started With Tivoli Dynamic Workload Broker Version 1.1 Sg247442
No ratings yet
Getting Started With Tivoli Dynamic Workload Broker Version 1.1 Sg247442
706 pages
Advance Python
100% (1)
Advance Python
84 pages
NLP - Spacy Package
No ratings yet
NLP - Spacy Package
28 pages
By Robert Brown and Leon Gottlieb: Chapter 3: Loss Reserving
No ratings yet
By Robert Brown and Leon Gottlieb: Chapter 3: Loss Reserving
40 pages
120+ Py Interview Q&A Py
100% (1)
120+ Py Interview Q&A Py
137 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
Class 12 Macro Economics Mind Map Chapter - 4 Determination of Income and Employment
No ratings yet
Class 12 Macro Economics Mind Map Chapter - 4 Determination of Income and Employment
24 pages
ObjectAndClass PDF
100% (1)
ObjectAndClass PDF
54 pages
Pizza - Google Hashcode
No ratings yet
Pizza - Google Hashcode
3 pages
Advanced Data Analytics Using Python - Unit II
No ratings yet
Advanced Data Analytics Using Python - Unit II
57 pages
Chapter 3
No ratings yet
Chapter 3
29 pages
Asp MVC
No ratings yet
Asp MVC
254 pages
Tutorial Text Classification in Phyton Using Spacy
No ratings yet
Tutorial Text Classification in Phyton Using Spacy
22 pages
Bhawini NLP File
No ratings yet
Bhawini NLP File
100 pages
Chapter 2
No ratings yet
Chapter 2
28 pages
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
The Poster Child of Open Source Business
No ratings yet
The Poster Child of Open Source Business
35 pages
Cleaning Data With PySpark Chapter1
0% (1)
Cleaning Data With PySpark Chapter1
20 pages
001-2023-0921 DLMDSBDT01 Course Book
No ratings yet
001-2023-0921 DLMDSBDT01 Course Book
124 pages
R Studio
No ratings yet
R Studio
41 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Introduction To Spacy: Ines Montani
No ratings yet
Introduction To Spacy: Ines Montani
26 pages
M S 0 8 - M S E 0 8: Hydraulic Motors
No ratings yet
M S 0 8 - M S E 0 8: Hydraulic Motors
36 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
Spoken Language Processing in Python Chapter3
No ratings yet
Spoken Language Processing in Python Chapter3
26 pages
A Brief Introduction To Mathematica: The Very Basics
No ratings yet
A Brief Introduction To Mathematica: The Very Basics
27 pages
Support Vector Machine
No ratings yet
Support Vector Machine
12 pages
Simple Bank Project Part 1 C Sharp
No ratings yet
Simple Bank Project Part 1 C Sharp
7 pages
Jupyter Installation
100% (1)
Jupyter Installation
19 pages
IMCA Syllabus 2018
No ratings yet
IMCA Syllabus 2018
28 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
Data Science Interview Preparation (#DAY 14)
No ratings yet
Data Science Interview Preparation (#DAY 14)
11 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
Cartoonify An Image With OpenCV in Python
No ratings yet
Cartoonify An Image With OpenCV in Python
13 pages
Lecture R
No ratings yet
Lecture R
201 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
24 pages
Regular Expressions in Python
No ratings yet
Regular Expressions in Python
16 pages
IRC SP 84 2014 - Manual of Specifications & Standards For Four Laning of Highways Through PPP
100% (2)
IRC SP 84 2014 - Manual of Specifications & Standards For Four Laning of Highways Through PPP
2 pages
Spoken Language Processing in Python Chapter4
No ratings yet
Spoken Language Processing in Python Chapter4
46 pages
Introduction To Data Visualization With Seaborn Chapter1
No ratings yet
Introduction To Data Visualization With Seaborn Chapter1
26 pages
HackerRank Notes
No ratings yet
HackerRank Notes
10 pages
01 01 Introduccion A Spacy
No ratings yet
01 01 Introduccion A Spacy
2 pages
Multiplying and Dividing Integers 4x4 Puzzle: Math Made Fun! Great For Formative Assessments!
No ratings yet
Multiplying and Dividing Integers 4x4 Puzzle: Math Made Fun! Great For Formative Assessments!
5 pages
Designing Machine Learning Workflows in Python Chapter2
No ratings yet
Designing Machine Learning Workflows in Python Chapter2
39 pages
Arun Mani Sam, R&D Software Engineer
No ratings yet
Arun Mani Sam, R&D Software Engineer
21 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
HashiCorp Certified Terraform Associate (003) WhizCard
No ratings yet
HashiCorp Certified Terraform Associate (003) WhizCard
21 pages
Analyzing IoT Data in Python Chapter4
No ratings yet
Analyzing IoT Data in Python Chapter4
34 pages
Spoken Language Processing in Python Chapter1
No ratings yet
Spoken Language Processing in Python Chapter1
17 pages
Designing Machine Learning Workflows in Python Chapter4
No ratings yet
Designing Machine Learning Workflows in Python Chapter4
38 pages
Building Chatbots in Python Chapter2 PDF
No ratings yet
Building Chatbots in Python Chapter2 PDF
41 pages
Introduction To Data Visualization With Seaborn Chapter3
100% (1)
Introduction To Data Visualization With Seaborn Chapter3
32 pages
SQL Database Notes
No ratings yet
SQL Database Notes
8 pages
Tokenization
No ratings yet
Tokenization
4 pages
Ahmed Elgendy Flutter Developer CV
No ratings yet
Ahmed Elgendy Flutter Developer CV
2 pages
Spoken Language Processing in Python Chapter2
No ratings yet
Spoken Language Processing in Python Chapter2
23 pages
Kinematics-Motion in One Dimension-1 JEE Main and Advanced
50% (2)
Kinematics-Motion in One Dimension-1 JEE Main and Advanced
6 pages
3 Important Man Overboard Recovery Methods Used at Sea
No ratings yet
3 Important Man Overboard Recovery Methods Used at Sea
16 pages
Notes For Writing Cover Letter
No ratings yet
Notes For Writing Cover Letter
17 pages
Course Guide Big Data University College Groningen: Academic Year 2020/2021, Semester Ib 1. General Information
No ratings yet
Course Guide Big Data University College Groningen: Academic Year 2020/2021, Semester Ib 1. General Information
6 pages
Passport Stats 15-04-2023 0939 GMT Softdrinks
No ratings yet
Passport Stats 15-04-2023 0939 GMT Softdrinks
1 page
Practical Python Course-Overview
No ratings yet
Practical Python Course-Overview
5 pages
Deep Learning - Wikipedia
No ratings yet
Deep Learning - Wikipedia
36 pages
Actuarial Analysis
No ratings yet
Actuarial Analysis
12 pages
Introduction To Data Visualization With Seaborn Chapter2
No ratings yet
Introduction To Data Visualization With Seaborn Chapter2
38 pages
Python (3) Leaflet: Roland Becker December 16, 2020
No ratings yet
Python (3) Leaflet: Roland Becker December 16, 2020
15 pages
Hello Kids Admissions Booklet June 2023
No ratings yet
Hello Kids Admissions Booklet June 2023
12 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
16 pages
DBMS Course Outline
No ratings yet
DBMS Course Outline
14 pages
Research Paper Presentation Pandas Moshiul Arefin
No ratings yet
Research Paper Presentation Pandas Moshiul Arefin
30 pages
Programming in Oracle With PL/SQL
No ratings yet
Programming in Oracle With PL/SQL
26 pages
Introduction To Data Visualization With Matplotlib Chapter2
No ratings yet
Introduction To Data Visualization With Matplotlib Chapter2
27 pages
Credit Risk Modeling in Python Chapter4
100% (1)
Credit Risk Modeling in Python Chapter4
35 pages
Analyzing IoT Data in Python Chapter1
100% (1)
Analyzing IoT Data in Python Chapter1
27 pages
Computational Physics - Chapter 2
No ratings yet
Computational Physics - Chapter 2
79 pages
Memo Payload Axis XL
100% (1)
Memo Payload Axis XL
3 pages
Impact of Bonus Issue On Market Price
No ratings yet
Impact of Bonus Issue On Market Price
70 pages
Abstract On The Artificial Intelegence
No ratings yet
Abstract On The Artificial Intelegence
15 pages
Master Server Log
No ratings yet
Master Server Log
35 pages
Housekeeping House Rules Lesson Exemplar
No ratings yet
Housekeeping House Rules Lesson Exemplar
16 pages
Threats and Vulnerability in Service Oriented Architecture
No ratings yet
Threats and Vulnerability in Service Oriented Architecture
7 pages
Python Data Structures
No ratings yet
Python Data Structures
8 pages
List Comprehension in Python
No ratings yet
List Comprehension in Python
8 pages
Faceplate WinCC Motor en
No ratings yet
Faceplate WinCC Motor en
36 pages
Heartfulnessreport
No ratings yet
Heartfulnessreport
8 pages
Hasee HP500 Laptop Schematics
No ratings yet
Hasee HP500 Laptop Schematics
41 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
ASTM A829 Steel Grades: General Product Description
No ratings yet
ASTM A829 Steel Grades: General Product Description
2 pages
Analyzing IoT Data in Python Chapter2
No ratings yet
Analyzing IoT Data in Python Chapter2
35 pages
Changing Plot Style and Color: Erin Case
No ratings yet
Changing Plot Style and Color: Erin Case
54 pages
Analyzing IoT Data in Python Chapter3
No ratings yet
Analyzing IoT Data in Python Chapter3
30 pages
Sqlalchemy 0 7 3
No ratings yet
Sqlalchemy 0 7 3
540 pages
Data Science Learning Plan
No ratings yet
Data Science Learning Plan
3 pages
Customer Segmentation in Python Chapter3
No ratings yet
Customer Segmentation in Python Chapter3
25 pages
Teaching Image Processing in Engineering Using Python
No ratings yet
Teaching Image Processing in Engineering Using Python
8 pages
Building Chatbots in Python Chapter4
No ratings yet
Building Chatbots in Python Chapter4
20 pages
Sprite Library For CSharp PDF
No ratings yet
Sprite Library For CSharp PDF
29 pages
Customer Segmentation in Python Chapter4
No ratings yet
Customer Segmentation in Python Chapter4
37 pages
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
No ratings yet
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
2 pages
Introduction To Microsoft Access
No ratings yet
Introduction To Microsoft Access
3 pages
Robots Lesson Plan
No ratings yet
Robots Lesson Plan
3 pages
Ballistic Limit Evaluation For Impact of Pistol Projectile 9 MM Luger On Aircraft Skin Metal Plate
No ratings yet
Ballistic Limit Evaluation For Impact of Pistol Projectile 9 MM Luger On Aircraft Skin Metal Plate
10 pages
Friction Stir Welding FSW Process
No ratings yet
Friction Stir Welding FSW Process
6 pages
Preparing Your Gures To Share With Others: Ariel Rokem
No ratings yet
Preparing Your Gures To Share With Others: Ariel Rokem
35 pages
Advtertisement For Laboratory Asstt at SDCRL PDF
No ratings yet
Advtertisement For Laboratory Asstt at SDCRL PDF
5 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
36 pages
Correlation and Regression
No ratings yet
Correlation and Regression
61 pages
Introduction To Data Visualization With Matplotlib: Ariel Rokem
No ratings yet
Introduction To Data Visualization With Matplotlib: Ariel Rokem
30 pages
Notes For MC
No ratings yet
Notes For MC
6 pages
Sal Proj Statement r3
No ratings yet
Sal Proj Statement r3
86 pages
L-Gibs - Configurable Type Gibs
No ratings yet
L-Gibs - Configurable Type Gibs
1 page
Python Pt1 0702
No ratings yet
Python Pt1 0702
121 pages
12 - Section VI - Chapter 3 - Annex 02 - Drawings and Documents Dibi
No ratings yet
12 - Section VI - Chapter 3 - Annex 02 - Drawings and Documents Dibi
15 pages
100% Online: MSC Project Management Offered in Exclusive Partnership With Robert Kennedy College
No ratings yet
100% Online: MSC Project Management Offered in Exclusive Partnership With Robert Kennedy College
7 pages
Acceleo User Guide
No ratings yet
Acceleo User Guide
56 pages
LST-1198 Decom &amp Xfrto Morroco 7-9-84
No ratings yet
LST-1198 Decom &amp Xfrto Morroco 7-9-84
30 pages
Innovation in Volkswagen
No ratings yet
Innovation in Volkswagen
7 pages
Ravi Teja Resume
No ratings yet
Ravi Teja Resume
2 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet