Python Harvard RegularExpressions
Python Harvard RegularExpressions
CS50’s Introduction
to Programming with Lecture 7
Python
Regular Expressions
OpenCourseWare Case Sensitivity
Cleaning Up User Input
Donate
Extracting User Input
David J. Malan Summing Up
[email protected]
Regular Expressions
Ready Player 50 Regular expressions or “regexes” will enable us to examine patterns within our code. For example, we might want to validate that an email
Zoom Meetings address is formatted correctly. Regular expressions will enable us to examine expressions in this fashion.
To begin, type code validate.py in the terminal window. Then, code as follows in the text editor:
Communities Notice how the endswith method will check to see if domain contains .edu . Still, however, a nefarious user could still break our code.
Bluesky For example, a user could type in [email protected] and it would be considered valid.
Clubhouse Indeed, we could keep iterating upon this code ourselves. However, it turns out that Python has an existing library called re that has a
Discord Q&A number of built-in functions that can validate user inputs against patterns.
Ed Q&A One of the most versatile functions within the library re is search .
Facebook Group Q&A
The search library follows the signature re.search(pattern, string, flags=0) . Following this signature, we can modify our code
Facebook Page
GitHub
as follows:
Gitter Q&A import re
Instagram
email = input("What's your email? ").strip()
LinkedIn Group
LinkedIn Page if re.search("@", email):
Medium print("Valid")
else:
Quora
print("Invalid")
Reddit Q&A
Slack Q&A Notice this does not increase the functionality of our program at all. In fact, it is somewhat a step back.
Snapchat
We can further our program’s functionality. However, we need to advance our vocabulary around validation . It turns out that in the
SoundCloud
world of regular expressions there are certain symbols that allow us to identify patterns. At this point, we have only been checking for
Stack Exchange Q&A
Stack Exchange Q&A
specific pieces of text like @ . It so happens that many special symbols can be passed to the compiler for the purpose of engaging in
Telegramin a verified certificate or a professional certificate?
Interested
validation. A non-exhaustive list of those patterns is as follows:
TikTok . any character except a new line
Threads * 0 or more repetitions
+ 1 or more repetitions
Twitter Account
? 0 or 1 repetition
Twitter Community {m} m repetitions
YouTube {m,n} m-n repetitions
License
2024-03-20 12:48:26
Notice the depiction of the state machine of our regular expression. On the left, the compiler begins evaluating the statement from left
to right. Once we reach q1 or question 1, the compiler reads time and time again based on the expression handed to it. Then, the state is
changed looking now at q2 or the second question being validated. Again, the arrow indicates how the expression will be evaluated time
and time again based upon our programming. Then, as depicted by the double circle, the final state of state machine is reached.
Considering the regular expression we used in our code, .+@.+ , you can visualize it as follows:
Notice how q1 is any character provided by the user, including ‘q2’ as 1 or more repetitions of characters. This is followed by the ‘@’
symbol. Then, q3 looks for any character provided by the user, including q4 as 1 or more repetitions of characters.
The re and re.search functions and ones like them look for patterns.
Continuing our improvement of this code, we could improve our code as follows:
import re
if re.search(".+@.+.edu", email):
print("Valid")
else:
print("Invalid")
Notice, however, that one could type in malan@harvard?edu and it could be considered valid. Why is this the case? You might recognize
that in the language of validation, a . means any character!
We can modify our code as follows:
import re
if re.search(r".+@.+\.edu", email):
print("Valid")
else:
print("Invalid")
Notice how we utilize the “escape character” or \ as a way of regarding the . as part of our string instead of our validation expression.
Testing your code, you will notice that [email protected] is regarded as valid, where malan@harvard?edu is invalid.
Now that we’re using escape characters, it’s a good time to introduce “raw strings”. In Python, raw strings are strings that don’t format
special characters—instead, each character is taken at face-value. Imagine \n , for example. We’ve seen in an earlier lecture how, in a
regular string, these two characters become one: a special newline character. In a raw string, however, \n is treated not as \n , the special
character, but as a single \ and a single n . Placing an r in front of a string tells the Python interpreter to treat the string as a raw string,
similar to how placing an f in front of a string tells the Python interpreter to treat the string as a format string:
import re
if re.search(r"^.+@.+\.edu$", email):
print("Valid")
else:
Interested in a verified certificate or a professional certificate?
print("Invalid")
Now we’ve ensured the Python interpreter won’t treat \. as a special character. Instead, simply as a \ followed by a . —which, in
regular expression terms, means matching a literal “.”.
You can imagine still how our users could create problems for us! For example, you could type in a sentence such as My email address
is [email protected]. and this whole sentence would be considered valid. We can be even more precise in our coding.
It just so happens we have more special symbols at our disposal in validation:
import re
if re.search(r"^.+@.+\.edu$", email):
print("Valid")
else:
print("Invalid")
Notice this has the effect of looking for this exact pattern matching to the start and end of the expression being validated. Typing in a
sentence such as My email address is [email protected]. now is regarded as invalid.
We propose we can do even better! Even though we are now looking for the username at the start of the string, the @ symbol, and the
domain name at the end, we could type in as many @ symbols as we wish! malan@@@harvard.edu is considered valid!
We can add to our vocabulary as follows:
[] set of characters
[^] complementing the set
import re
if re.search(r"^[^@]+@[^@]+\.edu$", email):
print("Valid")
else:
print("Invalid")
Notice that ^ means to match at the start of the string. All the way at the end of our expression, $ means to match at the end of the
string. [^@]+ means any character except an @ . Then, we have a literal @ . [^@]+\.edu means any character except an @ followed by
an expression ending in .edu . Typing in malan@@@harvard.edu is now regarded as invalid.
We can still improve this regular expression further. It turns out there are certain requirements for what an email address can be!
Currently, our validation expression is far too accomodating. We might only want to allow for characters normally used in a sentence. We
can modify our code as follows:
import re
if re.search(r"^[a-zA-Z0-9_]+@[a-zA-Z0-9_]+\.edu$", email):
print("Valid")
else:
print("Invalid")
Notice that [a-zA-Z0-9_] tells the validation that characters must be between a and z , between A and Z , between 0 and 9 and
potentially include an _ symbol. Testing the input, you’ll find that many potential user mistakes can be indicated.
Thankfully, common patterns have been built into regular expressions by hard-working programmers. In this case, you can modify your
code as follows:
import re
if re.search(r"^\w+@\w+\.edu$", email):
print("Valid")
else:
print("Invalid")
\d decimal digit
\D not a decimal digit
\s whitespace characters
\S not a whitespace character
\w word character, as well as numbers and the underscore
\W not a word character
Now, we know that there are not simply .edu email addresses. We could modify our code as follows:
import re
if re.search(r"^\w+@\w.+\.(com|edu|gov|net|org)$", email):
print("Valid")
else:
Interested in a verified certificate or a professional certificate?print("Invalid")
A|B either A or B
(...) a group
(?:...) non-capturing version
Case Sensitivity
To illustrate how you might address issues around case sensitivity, where there is a difference between EDU and edu and the like, let’s
rewind our code to the following:
import re
if re.search(r"^\w+@\w+\.edu$", email):
print("Valid")
else:
print("Invalid")
re.IGNORECASE
re.MULTILINE
re.DOTALL
import re
Notice how we added a third parameter re.IGNORECASE . Running this program with [email protected] , the input is now considered
valid.
Consider the following email address [email protected] . Using our code above, this would be considered invalid. Why might that
be?
Since there is an additional . , the program considers this invalid.
It turns out that we can, looking at our vocabulary from before, we can group together ideas.
A|B either A or B
(...) a group
(?:...) non-caputuring version
import re
Notice how the (\w+\.)? communicates to the compiler that this new expression can be there once or not at all. Hence, both
[email protected] and [email protected] are considered valid.
Interestingly enough, the edits we have done so far to our code do not fully encompass all the checking that could be done to ensure a
valid email address. Indeed, here is the full expression that one would have to type to ensure that a valid email is inputted:
^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]
There are other functions within the re library you might find useful. re.match and re.fullmatch are ones you might find
exceedingly useful.
You can learn more in Python’s documentation of re.
Notice that we have created, essentially, a “hello world” program. Running this program and typing in David , it works well! However,
typing in Malan, David notice how the program does not function as intended. How could we modify our program to clean up this
input?
Modify your code as follows.
Notice how last, first = name.split(", ") is run if there is a , in the name. Then, the name is standardized as first and last.
Running our code, typing in Malan, David , you can see how this program does clean up at least one scenario where a user types in
something unexpected.
You might notice that typing in Malan,David with no space causes the compiler to throw an error. Since we now know some regular
expression syntax, let’s apply that to our code:
import re
Notice that re.search can return a set of matches that are extracted from the user’s input. If matches are returned by re.search .
Running this program, typing in David Malan notice how the if condition is not run and the name is returned. If you run the program
by typing Malan, David , the name is also returned properly.
It just so happens that we can request specific groups back using matches.group . We can modify our code as follows:
import re
import re
Notice how group(2) and group(1) are concatenated together with a space. The first group is that which is left of the comma. The
second group is that which is right of the comma.
Recognize still that typing in Malan,David with no space will still break our code. Therefore, we can make the following modification:
import re
Notice the addition of the * in our validation statement. This code will now accept and properly process Malan,David . Further, it will
properly handle ` David,Malan with many spaces in front of David`.
It is very common to utilize re.search as we have in the previous examples, where matches is on a line of code after. However, we can
combine these statements:
import re
Notice how we combine two lines of our code. The walrus := operator assigns a value from right to left and allows us to ask a boolean
question at the same time. Turn your head sideways and you’ll see why this is called a walrus operator.
You can learn more in Python’s documentation of re.
Notice that if we type in https://twitter.com/davidjmalan , it shows exactly what the user typed. However, how would we be able to
extract just the username and ignore the rest of the URL?
You can imagine how we would simply be able to get rid of the beginning of the standard Twitter URL. We can attempt this as follows:
Notice how the replace method allows us to find one item and replace it with another. In this case, we are finding part of the URL and
replacing it with nothing. Typing in the full URL https://twitter.com/davidjmalan , the program effectively outputs the username.
However, what are some shortcomings of this current program?
What if the user simply typed twitter.com instead of including the https:// and the like? You can imagine many scenarios where the
user may input or neglect to input parts of the URL that would create strange output by this program. To improve this program, we can
code as follows:
username = url.removeprefix("https://twitter.com/")
print(f"Username: {username}")
Notice how we utilize the removeprefix method. This method will remove the beginning of a string.
Regular expressions simply allow us to succinctly express the patterns and goals.
Within the re library, there is a method called sub . This method allows us to substitute a pattern with something else.
The signature of the sub method is as follows
Notice how pattern refers to the regular expression we are looking for. Then, there is a repl string that we can replace the pattern
with. Finally, there is the string that we want to do the substitution on.
Implementing this method in our code, we can modify our program as follows:
import re
Notice how executing this program and inputting https://twitter.com/davidjmalan produces the correct outcome. However, there
are some problems still present in our code.
The protocol, subdomain, and the possibility that the user inputted any part of the URL after the username are all reasons that this code is
still not ideal. We can further address these shortcomings as follows:
import re
Notice how the ^ caret was added to the url. Notice also how the . could be interpreted improperly by the compiler. Therefore, we
escape it using a \ to make it \. For the purpose of tolerating both http and https , we add a ? to the end of https? , making the
s optional. Further, to accommodate www we add (www\.)? to our code. Finally, just in case the user decides to leave out the protocol
altogether, the http:// or https:// is made optional using (https?://) .
Still, we are blindly expecting that what the user inputted a url that, indeed, has a username.
Using our knowledge of re.search , we can further improve our code.
import re
Notice how we are searching for the regular expression above in the string provided by the user. In particular, we are capturing that which
appears at the end of the URL using (.+)$ regular expression. Therefore, if the user fails to input a URL without a username, no input
will be presented.
Even further tightening up our program, we can utilize our := operator as follows:
import re
Notice that the ?: tells the compiler it does not have to capture what is in that spot in our regular expression.
Still, we can be more explicit to ensure that the username inputted is correct. Using Twitter’s documentation, we can add the following to
our regular expression:
import re
url = input("URL: ").strip()
Interested in a verified certificate or a professional certificate?
if matches := re.search(r"^https?://(?:www\.)?twitter\.com/([a-z0-9_]+)", url, re.IGNORECASE):
print(f"Username:", matches.group(1))
Notice that the [a-z0-9_]+ tells the compiler to only expect a-z , 0-9 , and _ as part of the regular expression. The + indicates that
we are expecting one or more characters.
You can learn more in Python’s documentation of re.
Summing Up
Now, you’ve learned a whole new language of regular expressions that can be utilized to validate, clean up, and extract user input.
Regular Expressions
Case Sensitivity
Cleaning Up User Input
Extracting User Input
Interested in a verified certificate or a professional certificate?
Interested in a verified certificate or a professional certificate?
Interested in a verified certificate or a professional certificate?
Interested in a verified certificate or a professional certificate?
Interested in a verified certificate or a professional certificate?
Interested in a verified certificate or a professional certificate?
Interested in a verified certificate or a professional certificate?
Interested in a verified certificate or a professional certificate?
Interested in a verified certificate or a professional certificate?
Interested in a verified certificate or a professional certificate?
Interested in a verified certificate or a professional certificate?
Interested in a verified certificate or a professional certificate?
Interested in a verified certificate or a professional certificate?