0% found this document useful (0 votes)

8 views5 pages

Spacy Regex

Uploaded by

ayanokojik843

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views5 pages

Spacy Regex

Uploaded by

ayanokojik843

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

7/18/24, 8:50 PM 8.

Using RegEx with spaCy — Introduction to spaCy 3

 Contents

8. Print to PDF
8.1. Key Concepts in this Notebook

Using RegEx with spaCy 8.2. What is Regular Expressions (RegEx)?

8.3. The Strengths of RegEx

Dr. W.J.B. Mattingly 8.4. The Weaknesses of RegEx

Smithsonian Data Science Lab and United States Holocaust Memorial Museum 8.5. How to Use RegEx in Python

January 2021 8.6. How to Use RegEx in spaCy

8.7. Video

8.1. Key Concepts in this Notebook

1. What is RegEx (Regular Expressions)?
2. The Strengths of RegEx
3. The Weaknesses of RegEx
4. How to use RegEx in Python
5. How to use RegEx in spaCy

8.2. What is Regular Expressions (RegEx)?

Regular Expressions, or RegEx for short, is a way of achieving complex string matching based on simple or complex
patterns. It can be used to perform finding and retrieving patterns or replacing matching patterns in a string with some
other pattern. It was invnted by an Stephen Cole Kleene in the 1950s and is still widely used today for numerous tasks,
but particularly string matching in texts. RegEx are fully integrated with most search engines and can allow for more
robust searching. Nearly all data scientists, especially those who work with texts, use RegEx at some stage in their
workflow, from data searching, to cleaning data, to implementing machine learning models. It is an essential tool for
any text-based researcher. For these reasons, it merits a few chapters in this textbook.

In spaCy it can be leveraged in a few different pipes (depending on the task at hand as we shall see), to identify things
such as entities or pattern matching.

8.3. The Strengths of RegEx

There are several strengths to RegEx.

1. Due to its complex syntax, it can allow for programmers to write robust rules in short spaces.
2. It can allow the researcher to find all types of variance in strings
3. It can perform remarkably quickly when compared to other methods.
4. It is universally supported

8.4. The Weaknesses of RegEx

Despite these strengths, there are a few weaknesses to RegEx.

1. Its syntax is quite difficult for beginners. (I still find myself looking up how to do certain things).
2. It order to work well, it requires a domain-expert to work alongside the programmer to think of all ways a pattern
may vary in texts.

8.5. How to Use RegEx in Python

Python comes prepackaged with a RegEx library. We can import it like so:

import re

Now that we have it imported, we can begin to write out some RegEx rules. Let’s say we want to find an occurrence of a
date in a text. As noted in an earlier notebook, there are a finite number of ways this can be represented. Let’s try to
grab all instances of a day followed by a month first.

https://spacy.pythonhumanities.com/02_05_simple_regex.html 1/5
7/18/24, 8:50 PM 8. Using RegEx with spaCy — Introduction to spaCy 3

pattern = r"((\d){1,2}
(January|February|March|April|May|June|July|August|September|October|November|December))"

text = "This is a date 2 February. Another date would be 14 August."

matches = re.findall(pattern, text)
print (matches)

[('2 February', '2', 'February'), ('14 August', '4', 'August')]

In this bit of code, we see a real-life RegEx formula at work. While this looks quite complex, its syntax is fairly straight
forward. Let’s break it down. The first ( tells RegEx that I’m looking for something within the ending ). In other words, I’m
looking for a pattern that’s going to match the whole pattern, not just components.

Next, we state (\d){1,2}. This means that we are looking for any digit (0-9) that occurs either once or twice ({1,2}).

Next, we have a space to indicate the space in the string that we would expect with a date.

Next, we have (January|February|March|April|May|June|July|August|September|October|November|December) – this

indicates another component of the pattern (because it is parentheses). The | indicates the same concept as “or” in
English, so either January, or February, or March, etc.

When we bring it together, this pattern will match anything that functions as a set of one or two numbers followed by a
month. What happens when we try and do this with a date that is formed the opposite way?

text = "This is a date February 2. Another date would be 14 August."

matches = re.findall(pattern, text)
print (matches)

[('14 August', '4', 'August')]

It fails. But this is no fault of RegEx. Our pattern cannot accommodate that variation. Nevertheless, we can account for it
by adding it as a possible variation. Possible variations are accounted for with a *

pattern = r"(((\d){1,2}(
(January|February|March|April|May|June|July|August|September|October|November|December)))|
(((January|February|March|April|May|June|July|August|September|October|November|December) )(\d)
{1,2}))"

text = "This is a date February 2. Another date would be 14 August."

matches = re.findall(pattern, text)
print (matches)

[('February 2', '', '', '', '', 'February 2', 'February ', 'February', '2'), ('14 August',
'14 August', '4', ' August', 'August', '', '', '', '')]

There are more concise ways to write the same RegEx formula. I have opted here to be more verbose to make it a bit
easier to read. You can see that we’ve allowed for two main options for our pattern matcher.

Notice, however, that we have a lot of superfluous information for each match. These are the components of each
match. There are several ways we can remove them. One way is to use the command finditer, rather than findall in
RegEx.

text = "This is a date February 2. Another date would be 14 August."

iter_matches = re.finditer(pattern, text)
print (iter_matches)

<callable_iterator object at 0x00000217A415BC10>

This is an iterator object, we can loop over it, however, and get our results.

text = "This is a date February 2. Another date would be 14 August."

iter_matches = re.finditer(pattern, text)
print (iter_matches)
for hit in iter_matches:
print (hit)

<callable_iterator object at 0x00000217A4256670>

<re.Match object; span=(15, 25), match='February 2'>
<re.Match object; span=(49, 58), match='14 August'>

Within each of these is some very salient information, such as the start and end location (inside the span) and the text
itself (match). We can use the start and end location to grab the text within the string.

https://spacy.pythonhumanities.com/02_05_simple_regex.html 2/5
7/18/24, 8:50 PM 8. Using RegEx with spaCy — Introduction to spaCy 3

text = "This is a date February 2. Another date would be 14 August."

iter_matches = re.finditer(pattern, text)
for hit in iter_matches:
start = hit.start()
end = hit.end()
print (text[start:end])

February 2
14 August

8.6. How to Use RegEx in spaCy

Things like dates, times, IP Addresses, etc. that have either consistent or fairly consistent structures are excellent
candidates for RegEx. Fortunately, spaCy has easy ways to implement RegEx in three pipes: Matcher, PhraseMatcher,
and EntityRuler. One of the major drawbacks to the Matcher and PhraseMatcher, is that they do not align the matches
as doc.ents. Because this textbook is about NER and our goal is to store the entities in the doc.ents, we will focus on
using RegEx with the EntityRuler. In the next notebook, we will examine other methods.

In the previous notebook, we saw how the code below allowed for us to capture the phone number in the string. I have
modified it a bit here for reasons that will become a bit more clear below.

#Import the requisite library

import spacy

#Sample text
text = "This is a sample number 555-5555."

#Build upon the spaCy Small Model

nlp = spacy.blank("en")

#Create the Ruler and Add it

ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)

patterns = [
{"label": "PHONE_NUMBER", "pattern": [{"SHAPE": "ddd"},
{"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]}
]
#add patterns to ruler
ruler.add_patterns(patterns)

#create the doc

doc = nlp(text)

#extract entities
for ent in doc.ents:
print (ent.text, ent.label_)

INFO:tensorflow:Enabling eager execution

INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2
555-5555 PHONE_NUMBER

This method worked well for grabbing the phone number. But what if we wanted to use RegEx as opposed to linguistic
features, such as shape? First, let’s write some RegEx to capturee 555-5555.

pattern = r"((\d){3}-(\d){4})"
text = "This is a sample number 555-5555."
matches = re.findall(pattern, text)
print (matches)

[('555-5555', '5', '5')]

Okay. So, now we know that we have a RegEx pattern that works. Let’s try and implement it in the spaCy EntityRuler. We
can do that with the code below. When we execute the code below, we have no output.

https://spacy.pythonhumanities.com/02_05_simple_regex.html 3/5
7/18/24, 8:50 PM 8. Using RegEx with spaCy — Introduction to spaCy 3

#Import the requisite library

import spacy

#Sample text
text = "This is a sample number (555) 555-5555."

#Build upon the spaCy Small Model

nlp = spacy.blank("en")

#Create the Ruler and Add it

ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)

patterns = [
{
"label": "PHONE_NUMBER", "pattern": [{"TEXT": {"REGEX": "((\d){3}-(\d)
{4})"}}
]
}
]
#add patterns to ruler
ruler.add_patterns(patterns)

#create the doc

doc = nlp(text)

#extract entities
for ent in doc.ents:
print (ent.text, ent.label_)

This is for one very important reason. SpaCy’s EntityRuler cannot use RegEx to pattern match across tokens. The dash in
the phone number throws off the EntityRuler. So, what are we to do in this scenario? Well, we have a few different
options that we will explore in the next notebook. But before we get to that, let’s try and use RegEx to capture the
phone number with no hyphen.

#Import the requisite library

import spacy

#Sample text
text = "This is a sample number 5555555."
#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the Ruler and Add it

ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)

patterns = [
{
"label": "PHONE_NUMBER", "pattern": [{"TEXT": {"REGEX": "((\d){5})"}}
]
}
]
#add patterns to ruler
ruler.add_patterns(patterns)

#create the doc

doc = nlp(text)

#extract entities
for ent in doc.ents:
print (ent.text, ent.label_)

5555555 PHONE_NUMBER

Notice that without the dash and a few modifications to our RegEx, we were able to capture 5555555 because this is a
single token in the spaCy doc object. Let’s explore how to solve the problem in the next notebook!

8.7. Video
%%html
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/wpyCzodvO3A"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
</div>

https://spacy.pythonhumanities.com/02_05_simple_regex.html 4/5
7/18/24, 8:50 PM 8. Using RegEx with spaCy — Introduction to spaCy 3

How to Use spaCy's EntityRuler (Named Entity Recognition for DH 04 | Part 0…

0…

By William Mattingly
© Copyright 2021.

https://spacy.pythonhumanities.com/02_05_simple_regex.html 5/5

Home Work Chapter 1 To 7: Book: Business Logistics/Supply Chain Management Ronald H. Ballou
100% (1)
Home Work Chapter 1 To 7: Book: Business Logistics/Supply Chain Management Ronald H. Ballou
35 pages
(Supplements To Vetus Testamentum 152) Craig A. Evans, Joel N. Lohr, David L. Petersen (Eds.) - The Book of Genesis - Composition, Reception, and Interpretation-Brill (2012) PDF
100% (8)
(Supplements To Vetus Testamentum 152) Craig A. Evans, Joel N. Lohr, David L. Petersen (Eds.) - The Book of Genesis - Composition, Reception, and Interpretation-Brill (2012) PDF
789 pages
Lec 07 II Dsfa23
No ratings yet
Lec 07 II Dsfa23
44 pages
Lec 07-II-DSFa23
No ratings yet
Lec 07-II-DSFa23
44 pages
Unit - 4 Regex
No ratings yet
Unit - 4 Regex
28 pages
Text Processing For NLP Understanding Regex
No ratings yet
Text Processing For NLP Understanding Regex
16 pages
Full Python Regex Questions Detailed
No ratings yet
Full Python Regex Questions Detailed
4 pages
Module II
No ratings yet
Module II
17 pages
Python Regular Expression
100% (1)
Python Regular Expression
31 pages
Regex
No ratings yet
Regex
44 pages
Regular Expressions in Python
No ratings yet
Regular Expressions in Python
12 pages
9 RegEx
No ratings yet
9 RegEx
57 pages
Regular Expressions (Slides)
No ratings yet
Regular Expressions (Slides)
20 pages
A Simple Intro To Regex With Python: You Have 2 Free Stories Left This Month
No ratings yet
A Simple Intro To Regex With Python: You Have 2 Free Stories Left This Month
18 pages
Manipulating Text With Regular Expression in Python
No ratings yet
Manipulating Text With Regular Expression in Python
4 pages
Python Regex Cheatsheet With Examples: Re Module Functions
No ratings yet
Python Regex Cheatsheet With Examples: Re Module Functions
1 page
Untitled
No ratings yet
Untitled
53 pages
Day 4 NLP
No ratings yet
Day 4 NLP
12 pages
9 RegEx
No ratings yet
9 RegEx
57 pages
mod-3-PATTERN MATCHING WITH REGULAR EXPRESSIONS
No ratings yet
mod-3-PATTERN MATCHING WITH REGULAR EXPRESSIONS
21 pages
9python Simple Character Matches
No ratings yet
9python Simple Character Matches
19 pages
Py Regex
No ratings yet
Py Regex
50 pages
Exercises Regular Expressions
No ratings yet
Exercises Regular Expressions
6 pages
Lecture 7 Re Part2 Split
No ratings yet
Lecture 7 Re Part2 Split
8 pages
Python Course: Session 6b - Regular Expressions
No ratings yet
Python Course: Session 6b - Regular Expressions
11 pages
Chapter 10
No ratings yet
Chapter 10
28 pages
Sundeep Agarwal Understanding Python Re Gex
No ratings yet
Sundeep Agarwal Understanding Python Re Gex
228 pages
Regular Expression L
No ratings yet
Regular Expression L
20 pages
Regular Expressions: Python For Everybody
No ratings yet
Regular Expressions: Python For Everybody
34 pages
Regular Expressions - Regexes in Python (Part 1) - Real Python
No ratings yet
Regular Expressions - Regexes in Python (Part 1) - Real Python
44 pages
Regular Expressions: Python For Everybody
No ratings yet
Regular Expressions: Python For Everybody
34 pages
RegEx in Python
No ratings yet
RegEx in Python
5 pages
An Introduction To Regular Expressions (9781492082569)
100% (1)
An Introduction To Regular Expressions (9781492082569)
17 pages
Python Progr Module 3 - 6th EC by 21EC643
No ratings yet
Python Progr Module 3 - 6th EC by 21EC643
24 pages
Aula 2
No ratings yet
Aula 2
26 pages
Data Analysis Using Python Lab Ex3
No ratings yet
Data Analysis Using Python Lab Ex3
27 pages
Lecture 6 Re Basics
No ratings yet
Lecture 6 Re Basics
12 pages
Lec 06 - Regular Expression
No ratings yet
Lec 06 - Regular Expression
19 pages
Python Tutorial 32
No ratings yet
Python Tutorial 32
4 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Howto Regex PDF
No ratings yet
Howto Regex PDF
20 pages
REGEX in Data Analytics
No ratings yet
REGEX in Data Analytics
5 pages
Beginners Tutorial For Regular Expressions in Python - Python Learning
No ratings yet
Beginners Tutorial For Regular Expressions in Python - Python Learning
23 pages
Lecture02 Scanning 1
No ratings yet
Lecture02 Scanning 1
72 pages
A Practical Gui Regular Expressions - Learn RegEx With Real Life Examples
No ratings yet
A Practical Gui Regular Expressions - Learn RegEx With Real Life Examples
38 pages
Re Expression 19 and 20
No ratings yet
Re Expression 19 and 20
26 pages
Regular
No ratings yet
Regular
9 pages
Regular Expression
No ratings yet
Regular Expression
21 pages
Class 3
No ratings yet
Class 3
52 pages
(CSC221 2024-02-08) Regular Expressions
No ratings yet
(CSC221 2024-02-08) Regular Expressions
21 pages
Python Regex: Re - Match, Re - Search, Re - Findall With Example
No ratings yet
Python Regex: Re - Match, Re - Search, Re - Findall With Example
10 pages
Regular Expressions
No ratings yet
Regular Expressions
104 pages
RegEx in Python
No ratings yet
RegEx in Python
6 pages
Unit7 RegularExpressionpdf 2023 10 17 09 16 29
No ratings yet
Unit7 RegularExpressionpdf 2023 10 17 09 16 29
17 pages
Network Security - 4.2 Reg Ex Primer
No ratings yet
Network Security - 4.2 Reg Ex Primer
3 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Regular Expressions: Regular Expression Syntax in Python
No ratings yet
Regular Expressions: Regular Expression Syntax in Python
11 pages
NLP Mod-2-1
No ratings yet
NLP Mod-2-1
25 pages
17 - Regular Expression
No ratings yet
17 - Regular Expression
20 pages
Python Module-3 Notes (21EC646) - Final
No ratings yet
Python Module-3 Notes (21EC646) - Final
37 pages
Python Regular Expressions Explained: A Practical Guide with Examples
From Everand
Python Regular Expressions Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
JavaScript Data Structures Explained: A Practical Guide with Examples
From Everand
JavaScript Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
BCC PPT by - Jayan Jod
No ratings yet
BCC PPT by - Jayan Jod
17 pages
Option Buy Trading
No ratings yet
Option Buy Trading
19 pages
The Heart of Success
No ratings yet
The Heart of Success
114 pages
High Level Analysis and Design
No ratings yet
High Level Analysis and Design
31 pages
MEMO
No ratings yet
MEMO
1 page
Conten. Prioriz 2ºcuatrim, 5tomy T, 8 AyB
No ratings yet
Conten. Prioriz 2ºcuatrim, 5tomy T, 8 AyB
6 pages
ST Bernard Omaha NE 2009 Re Dedication Brochure
No ratings yet
ST Bernard Omaha NE 2009 Re Dedication Brochure
8 pages
Modernization Forces in Maria Theresa's Peasant Policies, 1740-1780
No ratings yet
Modernization Forces in Maria Theresa's Peasant Policies, 1740-1780
31 pages
0W0Q59.3RR340 Gs-07f-081da Gs07f081dasmithsdetectionmod55
No ratings yet
0W0Q59.3RR340 Gs-07f-081da Gs07f081dasmithsdetectionmod55
6 pages
Aristotle Short Notes by E - 53
No ratings yet
Aristotle Short Notes by E - 53
3 pages
Manuel Castells - Local and Global (Cities in The Network Society)
No ratings yet
Manuel Castells - Local and Global (Cities in The Network Society)
11 pages
Machinelearningforkids Ibmer PDF
No ratings yet
Machinelearningforkids Ibmer PDF
4 pages
Ministry Magazine - A Theological Approach To Pastoral Leadership Today
No ratings yet
Ministry Magazine - A Theological Approach To Pastoral Leadership Today
11 pages
Management Accounting Innovation in Organizations
No ratings yet
Management Accounting Innovation in Organizations
14 pages
Combined Stresses
No ratings yet
Combined Stresses
12 pages
Effects of Exchange Rate Fluctuations On The Balance of Payment in The Nigerian Economy
No ratings yet
Effects of Exchange Rate Fluctuations On The Balance of Payment in The Nigerian Economy
8 pages
London HD Contract of Employment-3
No ratings yet
London HD Contract of Employment-3
8 pages
Untitled
No ratings yet
Untitled
19 pages
Clinical - 2020 Batch Resit Student Attendance
No ratings yet
Clinical - 2020 Batch Resit Student Attendance
4 pages
Tetris On Canvas - CodeProject
No ratings yet
Tetris On Canvas - CodeProject
8 pages
Assistant Property Manager Cover Letter
100% (1)
Assistant Property Manager Cover Letter
5 pages
Woodside Genealogy
100% (1)
Woodside Genealogy
59 pages
School Management System Database Project
100% (1)
School Management System Database Project
15 pages
2
No ratings yet
2
8 pages
Unit 1 - Task 3 - Challenge Yourself Test - Evaluation Quiz - Revisión Del Intento2
No ratings yet
Unit 1 - Task 3 - Challenge Yourself Test - Evaluation Quiz - Revisión Del Intento2
11 pages
Precise Software Solutions - EPGP - 10 - 119 PDF
No ratings yet
Precise Software Solutions - EPGP - 10 - 119 PDF
4 pages
Shannon Butler Resume
No ratings yet
Shannon Butler Resume
3 pages
Cephalopelvic Disproportion
60% (5)
Cephalopelvic Disproportion
2 pages
Maxillary Incisor Based Objectives in Present Day o 2022 Seminars in Orthodo
No ratings yet
Maxillary Incisor Based Objectives in Present Day o 2022 Seminars in Orthodo
13 pages
Training of The American Actor 1St Edition Edition Arthur Bartow Download
No ratings yet
Training of The American Actor 1St Edition Edition Arthur Bartow Download
48 pages
David Forrest - Autobahn
100% (2)
David Forrest - Autobahn
14 pages
TRAFx Vehicle Counter
No ratings yet
TRAFx Vehicle Counter
2 pages
Blaw Work
No ratings yet
Blaw Work
4 pages

Spacy Regex

Uploaded by

Spacy Regex

Uploaded by

7/18/24, 8:50 PM 8.

Using RegEx with spaCy — Introduction to spaCy 3

Using RegEx with spaCy 8.2. What is Regular Expressions (RegEx)?

Dr. W.J.B. Mattingly 8.4. The Weaknesses of RegEx

January 2021 8.6. How to Use RegEx in spaCy

8.1. Key Concepts in this Notebook

8.2. What is Regular Expressions (RegEx)?

8.3. The Strengths of RegEx

8.4. The Weaknesses of RegEx

8.5. How to Use RegEx in Python

text = "This is a date 2 February. Another date would be 14 August."

[('2 February', '2', 'February'), ('14 August', '4', 'August')]

Next, we have (January|February|March|April|May|June|July|August|September|October|November|December) – this

text = "This is a date February 2. Another date would be 14 August."

[('14 August', '4', 'August')]

text = "This is a date February 2. Another date would be 14 August."

text = "This is a date February 2. Another date would be 14 August."

<callable_iterator object at 0x00000217A415BC10>

text = "This is a date February 2. Another date would be 14 August."

<callable_iterator object at 0x00000217A4256670>

text = "This is a date February 2. Another date would be 14 August."

8.6. How to Use RegEx in spaCy

#Import the requisite library

#Build upon the spaCy Small Model

#Create the Ruler and Add it

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)

#create the doc

INFO:tensorflow:Enabling eager execution

[('555-5555', '5', '5')]

#Import the requisite library

#Build upon the spaCy Small Model

#Create the Ruler and Add it

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)

#create the doc

#Import the requisite library

#Create the Ruler and Add it

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)

#create the doc

How to Use spaCy's EntityRuler (Named Entity Recognition for DH 04 | Part 0…

You might also like