Spacy Regex
Spacy Regex
Contents
8. Print to PDF
8.1. Key Concepts in this Notebook
Smithsonian Data Science Lab and United States Holocaust Memorial Museum 8.5. How to Use RegEx in Python
In spaCy it can be leveraged in a few different pipes (depending on the task at hand as we shall see), to identify things
such as entities or pattern matching.
1. Due to its complex syntax, it can allow for programmers to write robust rules in short spaces.
2. It can allow the researcher to find all types of variance in strings
3. It can perform remarkably quickly when compared to other methods.
4. It is universally supported
1. Its syntax is quite difficult for beginners. (I still find myself looking up how to do certain things).
2. It order to work well, it requires a domain-expert to work alongside the programmer to think of all ways a pattern
may vary in texts.
import re
Now that we have it imported, we can begin to write out some RegEx rules. Let’s say we want to find an occurrence of a
date in a text. As noted in an earlier notebook, there are a finite number of ways this can be represented. Let’s try to
grab all instances of a day followed by a month first.
https://spacy.pythonhumanities.com/02_05_simple_regex.html 1/5
7/18/24, 8:50 PM 8. Using RegEx with spaCy — Introduction to spaCy 3
pattern = r"((\d){1,2}
(January|February|March|April|May|June|July|August|September|October|November|December))"
In this bit of code, we see a real-life RegEx formula at work. While this looks quite complex, its syntax is fairly straight
forward. Let’s break it down. The first ( tells RegEx that I’m looking for something within the ending ). In other words, I’m
looking for a pattern that’s going to match the whole pattern, not just components.
Next, we state (\d){1,2}. This means that we are looking for any digit (0-9) that occurs either once or twice ({1,2}).
Next, we have a space to indicate the space in the string that we would expect with a date.
When we bring it together, this pattern will match anything that functions as a set of one or two numbers followed by a
month. What happens when we try and do this with a date that is formed the opposite way?
It fails. But this is no fault of RegEx. Our pattern cannot accommodate that variation. Nevertheless, we can account for it
by adding it as a possible variation. Possible variations are accounted for with a *
pattern = r"(((\d){1,2}(
(January|February|March|April|May|June|July|August|September|October|November|December)))|
(((January|February|March|April|May|June|July|August|September|October|November|December) )(\d)
{1,2}))"
[('February 2', '', '', '', '', 'February 2', 'February ', 'February', '2'), ('14 August',
'14 August', '4', ' August', 'August', '', '', '', '')]
There are more concise ways to write the same RegEx formula. I have opted here to be more verbose to make it a bit
easier to read. You can see that we’ve allowed for two main options for our pattern matcher.
Notice, however, that we have a lot of superfluous information for each match. These are the components of each
match. There are several ways we can remove them. One way is to use the command finditer, rather than findall in
RegEx.
This is an iterator object, we can loop over it, however, and get our results.
Within each of these is some very salient information, such as the start and end location (inside the span) and the text
itself (match). We can use the start and end location to grab the text within the string.
https://spacy.pythonhumanities.com/02_05_simple_regex.html 2/5
7/18/24, 8:50 PM 8. Using RegEx with spaCy — Introduction to spaCy 3
February 2
14 August
In the previous notebook, we saw how the code below allowed for us to capture the phone number in the string. I have
modified it a bit here for reasons that will become a bit more clear below.
#Sample text
text = "This is a sample number 555-5555."
#extract entities
for ent in doc.ents:
print (ent.text, ent.label_)
This method worked well for grabbing the phone number. But what if we wanted to use RegEx as opposed to linguistic
features, such as shape? First, let’s write some RegEx to capturee 555-5555.
pattern = r"((\d){3}-(\d){4})"
text = "This is a sample number 555-5555."
matches = re.findall(pattern, text)
print (matches)
Okay. So, now we know that we have a RegEx pattern that works. Let’s try and implement it in the spaCy EntityRuler. We
can do that with the code below. When we execute the code below, we have no output.
https://spacy.pythonhumanities.com/02_05_simple_regex.html 3/5
7/18/24, 8:50 PM 8. Using RegEx with spaCy — Introduction to spaCy 3
#Sample text
text = "This is a sample number (555) 555-5555."
#extract entities
for ent in doc.ents:
print (ent.text, ent.label_)
This is for one very important reason. SpaCy’s EntityRuler cannot use RegEx to pattern match across tokens. The dash in
the phone number throws off the EntityRuler. So, what are we to do in this scenario? Well, we have a few different
options that we will explore in the next notebook. But before we get to that, let’s try and use RegEx to capture the
phone number with no hyphen.
#Sample text
text = "This is a sample number 5555555."
#Build upon the spaCy Small Model
nlp = spacy.blank("en")
#extract entities
for ent in doc.ents:
print (ent.text, ent.label_)
5555555 PHONE_NUMBER
Notice that without the dash and a few modifications to our RegEx, we were able to capture 5555555 because this is a
single token in the spaCy doc object. Let’s explore how to solve the problem in the next notebook!
8.7. Video
%%html
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/wpyCzodvO3A"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
</div>
https://spacy.pythonhumanities.com/02_05_simple_regex.html 4/5
7/18/24, 8:50 PM 8. Using RegEx with spaCy — Introduction to spaCy 3
By William Mattingly
© Copyright 2021.
https://spacy.pythonhumanities.com/02_05_simple_regex.html 5/5