0% found this document useful (0 votes)

2 views29 pages

Chapter 3

The document provides an overview of processing pipelines in spaCy, detailing built-in components like tagger, parser, and named entity recognizer. It explains how to create custom pipeline components, set extension attributes, and optimize performance when processing large volumes of text. Additionally, it covers methods for passing context, using only the tokenizer, and disabling pipeline components temporarily.

Uploaded by

gocat52116

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views29 pages

Chapter 3

Uploaded by

gocat52116

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Processing pipelines

A D VA N C E D N L P W I T H S PA C Y

Ines Montani
spaCy core developer
What happens when you call nlp?

doc = nlp("This is a sentence.")

ADVANCED NLP WITH SPACY

Built-in pipeline components
Name Description Creates

tagger Part-of-speech tagger Token.tag

parser Dependency parser Token.dep , Token.head , Doc.sents , Doc.noun_chunks

ner Named entity recognizer Doc.ents , Token.ent_iob , Token.ent_type

textcat Text classi er Doc.cats

ADVANCED NLP WITH SPACY

Under the hood

Pipeline de ned in model's meta.json in order

Built-in components need binary data to make predictions

ADVANCED NLP WITH SPACY

Pipeline attributes
nlp.pipe_names : list of pipeline component names

print(nlp.pipe_names)

['tagger', 'parser', 'ner']

nlp.pipeline : list of (name, component) tuples

print(nlp.pipeline)

[('tagger', <spacy.pipeline.Tagger>),
('parser', <spacy.pipeline.DependencyParser>),
('ner', <spacy.pipeline.EntityRecognizer>)]

ADVANCED NLP WITH SPACY

Let's practice!
A D VA N C E D N L P W I T H S PA C Y
Custom pipeline
components
A D VA N C E D N L P W I T H S PA C Y

Ines Montani
spaCy core developer
Why custom components?

Make a function execute automatically when you call nlp

Add your own metadata to documents and tokens

Updating built-in a ributes like doc.ents

ADVANCED NLP WITH SPACY

Anatomy of a component (1)
Function that takes a doc , modi es it and returns it

Can be added using the nlp.add_pipe method

def custom_component(doc):
# Do something to the doc here
return doc
nlp.add_pipe(custom_component)

ADVANCED NLP WITH SPACY

Anatomy of a component (2)
def custom_component(doc):
# Do something to the doc here
return doc

nlp.add_pipe(custom_component)

Argument Description Example

last If True , add last nlp.add_pipe(component, last=True)

first If True , add rst nlp.add_pipe(component, first=True)

before Add before component nlp.add_pipe(component, before='ner')

after Add a er component nlp.add_pipe(component, after='tagger')

ADVANCED NLP WITH SPACY

Example: a simple component (1)
# Create the nlp object
nlp = spacy.load('en_core_web_sm')
# Define a custom component
def custom_component(doc):
# Print the doc's length
print('Doc length:' len(doc))
# Return the doc object
return doc
# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)
# Print the pipeline component names
print('Pipeline:', nlp.pipe_names)

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']

ADVANCED NLP WITH SPACY

Example: a simple component (2)
# Create the nlp object
nlp = spacy.load('en_core_web_sm')

# Define a custom component

def custom_component(doc):

# Print the doc's length

print('Doc length:' len(doc))

# Return the doc object

return doc

# Add the component first in the pipeline

nlp.add_pipe(custom_component, first=True)
# Process a text
doc = nlp("Hello world!")

Doc length: 3

ADVANCED NLP WITH SPACY

Let's practice!
A D VA N C E D N L P W I T H S PA C Y
Extension attributes
A D VA N C E D N L P W I T H S PA C Y

Ines Montani
spaCy core developer
Setting custom attributes
Add custom metadata to documents, tokens and spans

Accessible via the ._ property

doc._.title = 'My document'

token._.is_color = True
span._.has_color = False

registered on the global Doc , Token or Span using the set_extension method

# Import global classes

from spacy.tokens import Doc, Token, Span
# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)

ADVANCED NLP WITH SPACY

Extension attribute types
1. A ribute extensions

2. Property extensions

3. Method extensions

ADVANCED NLP WITH SPACY

Attribute extensions
Set a default value that can be overwri en

from spacy.tokens import Token

# Set extension on the Token with default value

Token.set_extension('is_color', default=False)
doc = nlp("The sky is blue.")

# Overwrite extension attribute value

doc[3]._.is_color = True

ADVANCED NLP WITH SPACY

Property extensions (1)
De ne a ge er and an optional se er function

Ge er only called when you retrieve the a ribute value

from spacy.tokens import Token

# Define getter function

def get_is_color(token):
colors = ['red', 'yellow', 'blue']
return token.text in colors
# Set extension on the Token with getter
Token.set_extension('is_color', getter=get_is_color)
doc = nlp("The sky is blue.")
print(doc[3]._.is_color, '-', doc[3].text)

blue - True

ADVANCED NLP WITH SPACY

Property extensions (2)
Span extensions should almost always use a ge er

from spacy.tokens import Span

# Define getter function

def get_has_color(span):
colors = ['red', 'yellow', 'blue']
return any(token.text in colors for token in span)
# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color)
doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

True - sky is blue

False - The sky

ADVANCED NLP WITH SPACY

Method extensions
Assign a function that becomes available as an object method

Lets you pass arguments to the extension function

from spacy.tokens import Doc

# Define method with arguments

def has_token(doc, token_text):
in_doc = token_text in [token.text for token in doc]
# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token)
doc = nlp("The sky is blue.")
print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

True - blue
False - cloud

ADVANCED NLP WITH SPACY

Let's practice!
A D VA N C E D N L P W I T H S PA C Y
Scaling and
performance
A D VA N C E D N L P W I T H S PA C Y

Ines Montani
spaCy core developer
Processing large volumes of text
Use nlp.pipe method

Processes texts as a stream, yields Doc objects

Much faster than calling nlp on each text

BAD:

docs = [nlp(text) for text in LOTS_OF_TEXTS]

GOOD:

docs = list(nlp.pipe(LOTS_OF_TEXTS))

ADVANCED NLP WITH SPACY

Passing in context (1)
Se ing as_tuples=True on nlp.pipe lets you pass in (text, context) tuples

Yields (doc, context) tuples

Useful for associating metadata with the doc

data = [
('This is a text', {'id': 1, 'page_number': 15}),
('And another text', {'id': 2, 'page_number': 16}),
]
for doc, context in nlp.pipe(data, as_tuples=True):
print(doc.text, context['page_number'])

This is a text 15
And another text 16

ADVANCED NLP WITH SPACY

Passing in context (2)
from spacy.tokens import Doc

Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)
data = [
('This is a text', {'id': 1, 'page_number': 15}),
('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):

doc._.id = context['id']
doc._.page_number = context['page_number']

ADVANCED NLP WITH SPACY

Using only the tokenizer

don't run the whole pipeline!

ADVANCED NLP WITH SPACY

Using only the tokenizer (2)
Use nlp.make_doc to turn a text in to a Doc object

BAD:

doc = nlp("Hello world")

GOOD:

doc = nlp.make_doc("Hello world!")

ADVANCED NLP WITH SPACY

Disabling pipeline components
Use nlp.disable_pipes to temporarily disable one or more pipes

# Disable tagger and parser

with nlp.disable_pipes('tagger', 'parser'):
# Process the text and print the entities
doc = nlp(text)
print(doc.ents)

restores them a er the with block

only runs the remaining components

ADVANCED NLP WITH SPACY

Let's practice!
A D VA N C E D N L P W I T H S PA C Y

Natural Language Processing For Hackers
No ratings yet
Natural Language Processing For Hackers
176 pages
Sgw-3015 Technical Manual
100% (2)
Sgw-3015 Technical Manual
7 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
No ratings yet
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
275 pages
Advanced NLP With Spacy Chapter2
100% (1)
Advanced NLP With Spacy Chapter2
28 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
TM-42 Manual (ENG) 20050707-97
83% (6)
TM-42 Manual (ENG) 20050707-97
8 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
NLP - Spacy Package
No ratings yet
NLP - Spacy Package
28 pages
UN 9252-02 Part 1 - UD-AU-000-EB-00019 PDF
100% (1)
UN 9252-02 Part 1 - UD-AU-000-EB-00019 PDF
17 pages
Practice 7 QC Tool Part 1
No ratings yet
Practice 7 QC Tool Part 1
11 pages
Movilift Encoder Stand Alone Eng Ver. 2.0
No ratings yet
Movilift Encoder Stand Alone Eng Ver. 2.0
6 pages
Introduction To Keyboarding: Using Good Technique
No ratings yet
Introduction To Keyboarding: Using Good Technique
17 pages
Advanced NLP With Spacy Chapter4
No ratings yet
Advanced NLP With Spacy Chapter4
26 pages
Spacy
No ratings yet
Spacy
5 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
NLP Record300
No ratings yet
NLP Record300
24 pages
What Is NLP?
No ratings yet
What Is NLP?
74 pages
T Extacy
No ratings yet
T Extacy
184 pages
Fortigate Getting Started 56
No ratings yet
Fortigate Getting Started 56
105 pages
Gging and Named Entity Recognition
No ratings yet
Gging and Named Entity Recognition
31 pages
HTML
No ratings yet
HTML
10 pages
Lecture - 1 - Introduction
No ratings yet
Lecture - 1 - Introduction
19 pages
BS - EN12715 - 2000 Execution of Special Geotechnical Work - Grouting
No ratings yet
BS - EN12715 - 2000 Execution of Special Geotechnical Work - Grouting
56 pages
Longest Palindromic Substring
No ratings yet
Longest Palindromic Substring
23 pages
Advanced NLP With Spacy Chapter3
No ratings yet
Advanced NLP With Spacy Chapter3
29 pages
Lab8 Instructions
No ratings yet
Lab8 Instructions
36 pages
Natural Language Processing
No ratings yet
Natural Language Processing
19 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
Official AccuPoint-AdvancedNG User-Manual
No ratings yet
Official AccuPoint-AdvancedNG User-Manual
29 pages
Tutorial Text Classification in Phyton Using Spacy
No ratings yet
Tutorial Text Classification in Phyton Using Spacy
22 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
Python and NLP Notes
No ratings yet
Python and NLP Notes
32 pages
Chapter 2
No ratings yet
Chapter 2
28 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
Chapter 4
No ratings yet
Chapter 4
26 pages
G-06-Autonomous Database - Serverless and Dedicated-Transcript
No ratings yet
G-06-Autonomous Database - Serverless and Dedicated-Transcript
7 pages
Sivasri NLP Lab
No ratings yet
Sivasri NLP Lab
50 pages
Pythin Learnings
No ratings yet
Pythin Learnings
51 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
NLP 9
No ratings yet
NLP 9
44 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Big Data Analytics Chap 11
No ratings yet
Big Data Analytics Chap 11
8 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
Rule-Based Matching Spacy Usage Documentation
No ratings yet
Rule-Based Matching Spacy Usage Documentation
48 pages
Spacy Yy Yyy Yyy Yyy
No ratings yet
Spacy Yy Yyy Yyy Yyy
19 pages
Arkanoid - Doh It Again (US) - Text
No ratings yet
Arkanoid - Doh It Again (US) - Text
20 pages
P2P Integration Manual V2.1 - Itanagar
No ratings yet
P2P Integration Manual V2.1 - Itanagar
22 pages
Gentle Start To Natural Language Processing Using Python
No ratings yet
Gentle Start To Natural Language Processing Using Python
6 pages
Grade 3 Mental Maths Subtraction Worksheet 1 PDF 2
No ratings yet
Grade 3 Mental Maths Subtraction Worksheet 1 PDF 2
1 page
NLP Exp2
No ratings yet
NLP Exp2
6 pages
Final Checklist Thesis - Ab 1.03
No ratings yet
Final Checklist Thesis - Ab 1.03
4 pages
84 Lumber - 850 - 4010 PDF
No ratings yet
84 Lumber - 850 - 4010 PDF
12 pages
Session2 3
No ratings yet
Session2 3
18 pages
Massp2023 NLP
No ratings yet
Massp2023 NLP
26 pages
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
HP Designjet T2300 eMFP Product Series - The Scanner Diagnostic Plot HP® Customer Support
No ratings yet
HP Designjet T2300 eMFP Product Series - The Scanner Diagnostic Plot HP® Customer Support
8 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
An Introduction To Perl PDF
No ratings yet
An Introduction To Perl PDF
25 pages
Bettis™ VOS-PAC™: Preset, Pre-Designed, Pre-Engineered Actuation Package
No ratings yet
Bettis™ VOS-PAC™: Preset, Pre-Designed, Pre-Engineered Actuation Package
31 pages
Text Preprocessing
No ratings yet
Text Preprocessing
3 pages
Contoh Soal 5
No ratings yet
Contoh Soal 5
15 pages
Unit 4
No ratings yet
Unit 4
8 pages
Introduction To Spacy: Ines Montani
No ratings yet
Introduction To Spacy: Ines Montani
26 pages
Spacy Io Usage Spacy 101
No ratings yet
Spacy Io Usage Spacy 101
10 pages
4090-9001 Supervised IAM Installation Manual Rev E PDF
No ratings yet
4090-9001 Supervised IAM Installation Manual Rev E PDF
2 pages
Suntech 550W EN - STP560S - C72 - VMH Monofacial PERC
No ratings yet
Suntech 550W EN - STP560S - C72 - VMH Monofacial PERC
2 pages
TOEIC
No ratings yet
TOEIC
14 pages
DTIN Assg. Q
No ratings yet
DTIN Assg. Q
5 pages
Named Entity Recognition: Katharine Jarmul
No ratings yet
Named Entity Recognition: Katharine Jarmul
17 pages
Uniswap v3 Liquidity Math
No ratings yet
Uniswap v3 Liquidity Math
8 pages
Tokenization
No ratings yet
Tokenization
4 pages
Stree 2 Sarkate Ka Aatank Movie Showtimes in Hyderabad & Online Ticket Booking
No ratings yet
Stree 2 Sarkate Ka Aatank Movie Showtimes in Hyderabad & Online Ticket Booking
1 page
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
No ratings yet
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
11 pages
TP1 3
No ratings yet
TP1 3
5 pages
NLP Lab - 1
No ratings yet
NLP Lab - 1
3 pages
Introduction To Python - Ipynb - Colaboratory
No ratings yet
Introduction To Python - Ipynb - Colaboratory
4 pages
Ec3401 Set4
No ratings yet
Ec3401 Set4
2 pages
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
No ratings yet
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
2 pages
01 01 Introduccion A Spacy
No ratings yet
01 01 Introduccion A Spacy
2 pages
RA 01 - Abstract & Measurement Sheet - Wing 6
No ratings yet
RA 01 - Abstract & Measurement Sheet - Wing 6
2 pages
Archivo - 01 (Outra Cópia)
No ratings yet
Archivo - 01 (Outra Cópia)
1 page
Spacy Library
No ratings yet
Spacy Library
3 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet

Chapter 3

Uploaded by

Chapter 3

Uploaded by

Processing pipelines

doc = nlp("This is a sentence.")

ADVANCED NLP WITH SPACY

tagger Part-of-speech tagger Token.tag

parser Dependency parser Token.dep , Token.head , Doc.sents , Doc.noun_chunks

ner Named entity recognizer Doc.ents , Token.ent_iob , Token.ent_type

textcat Text classi er Doc.cats

ADVANCED NLP WITH SPACY

Pipeline de ned in model's meta.json in order

Built-in components need binary data to make predictions

ADVANCED NLP WITH SPACY

['tagger', 'parser', 'ner']

nlp.pipeline : list of (name, component) tuples

ADVANCED NLP WITH SPACY

Make a function execute automatically when you call nlp

Add your own metadata to documents and tokens

Updating built-in a ributes like doc.ents

ADVANCED NLP WITH SPACY

Can be added using the nlp.add_pipe method

ADVANCED NLP WITH SPACY

Argument Description Example

last If True , add last nlp.add_pipe(component, last=True)

first If True , add rst nlp.add_pipe(component, first=True)

before Add before component nlp.add_pipe(component, before='ner')

after Add a er component nlp.add_pipe(component, after='tagger')

ADVANCED NLP WITH SPACY

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']

ADVANCED NLP WITH SPACY

# Define a custom component

# Print the doc's length

# Return the doc object

# Add the component first in the pipeline

ADVANCED NLP WITH SPACY

Accessible via the ._ property

doc._.title = 'My document'

# Import global classes

ADVANCED NLP WITH SPACY

ADVANCED NLP WITH SPACY

from spacy.tokens import Token

# Set extension on the Token with default value

# Overwrite extension attribute value

ADVANCED NLP WITH SPACY

Ge er only called when you retrieve the a ribute value

from spacy.tokens import Token

# Define getter function

ADVANCED NLP WITH SPACY

from spacy.tokens import Span

# Define getter function

True - sky is blue

ADVANCED NLP WITH SPACY

Lets you pass arguments to the extension function

from spacy.tokens import Doc

# Define method with arguments

ADVANCED NLP WITH SPACY

Processes texts as a stream, yields Doc objects

Much faster than calling nlp on each text

docs = [nlp(text) for text in LOTS_OF_TEXTS]

ADVANCED NLP WITH SPACY

Yields (doc, context) tuples

Useful for associating metadata with the doc

ADVANCED NLP WITH SPACY

for doc, context in nlp.pipe(data, as_tuples=True):

ADVANCED NLP WITH SPACY

don't run the whole pipeline!

ADVANCED NLP WITH SPACY

doc = nlp("Hello world")

doc = nlp.make_doc("Hello world!")

ADVANCED NLP WITH SPACY

# Disable tagger and parser

restores them a er the with block

only runs the remaining components

ADVANCED NLP WITH SPACY

You might also like