0% found this document useful (0 votes)
173 views9 pages

HW - Regex: 1 Instructions HW - Regular Expression - 10 Points

The document provides instructions for a homework assignment on regular expressions. It contains two main tasks: 1. Download movie review data from a provided URL and combine the reviews from four reviewers into a single dataframe with two columns for the review text and reviewer name. 2. Perform data cleaning on a COVID-19 tweet dataset using regular expressions. This includes downloading the data, creating a dataframe with tweet text, converting hashtags to lowercase, removing "RT" from tweets, and removing URLs from the text.

Uploaded by

Stephen Kamau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
173 views9 pages

HW - Regex: 1 Instructions HW - Regular Expression - 10 Points

The document provides instructions for a homework assignment on regular expressions. It contains two main tasks: 1. Download movie review data from a provided URL and combine the reviews from four reviewers into a single dataframe with two columns for the review text and reviewer name. 2. Perform data cleaning on a COVID-19 tweet dataset using regular expressions. This includes downloading the data, creating a dataframe with tweet text, converting hashtags to lowercase, removing "RT" from tweets, and removing URLs from the text.

Uploaded by

Stephen Kamau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

HW_regex

September 18, 2021

1 Instructions HW - Regular Expression - 10 Points


• You have to submit two files for this part of the HW (1) ipynb (colab notebook) and (2) pdf
file # You have to use only regular expressions for this HW. Please do not use
spacy for the tasks in this notebook

[2]: import os
import re
import json
import pandas as pd
import numpy as np
from pathlib import Path
import tarfile

import warnings
warnings.filterwarnings("ignore")

[3]: from google.colab import drive


drive.mount('/content/drive')

Mounted at /content/drive

2 Task1: Download data and combine data from multiple files into
a single dataframe - 2 Points.
In this task you have to download the moview reviews from the following link:
https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
Instructions:
• The data has movie reviews from four different reviewers:
(1) Dennis+Schwartz, (2) James+Berardinelli, (3) Scott+Renshaw and (4) Steve+Rhodes.
• You have to extract the reviews of the four reviewers in a single dataframe.
• The final dataframe should have two columns (1) Moview Review and (2) Reviewer Name.

[43]: folder =Path('/content/drive/MyDrive/')


movie_rev = folder /'movie_rege1'
!mkdir {str(movie_rev)}

1
[44]: basepath1 = str(movie_rev)

[45]: url = 'https://www.cs.cornell.edu/people/pabo/movie-review-data/


,→scale_whole_review.tar.gz'

!wget {url} -P {basepath1}

--2021-09-18 07:22:45-- https://www.cs.cornell.edu/people/pabo/movie-review-


data/scale_whole_review.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)… 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443…
connected.
HTTP request sent, awaiting response… 200 OK
Length: 8853204 (8.4M) [application/x-gzip]
Saving to: ‘/content/drive/MyDrive/movie_rege1/scale_whole_review.tar.gz’

scale_whole_review. 100%[===================>] 8.44M 10.7MB/s in 0.8s

2021-09-18 07:22:47 (10.7 MB/s) -


‘/content/drive/MyDrive/movie_rege1/scale_whole_review.tar.gz’ saved
[8853204/8853204]

[108]: !tar -xvf '/content/drive/MyDrive/movie_rege1/scale_whole_review.tar.gz' -C "/


,→content/drive/MyDrive/movie_rege1"

[105]: import os
label_names = []
movies =[]
main_path = "/content/drive/MyDrive/movie_rege1/scale_whole_review"
for path in os.scandir(main_path):
sub_path = main_path + f"/{path.name}/txt.parag"
if os.path.isdir(sub_path):
for each_path in os.listdir(sub_path):
full_path = os.path.join(sub_path , each_path)
#read the file
movie = open(full_path , encoding="utf8", errors='ignore').read()
#add them to the list
movies.append(movie)
label_names.append(path.name)
else:
pass
print("****DONE READING*************")

****DONE READING*************

[106]: # check lengths


len(label_names) , len(movies)

2
[106]: (5006, 5006)

[109]: # create a dataframe for the results


df = pd.DataFrame()

# add the data to the df


df['Moview Review'] = movies
df['Reviewer Name'] = label_names

[110]: # preview the results


df.head(10)

[110]: Moview Review Reviewer Name


0 DENNIS SCHWARTZ "Movie Reviews and Poetry"\nUN… Dennis+Schwartz
1 A brilliant, witty mock documentary of Jean Se… Dennis+Schwartz
2 NOSTALGHIA (director: Andrei Tarkovsky; cast: … Dennis+Schwartz
3 PAYBACK (director: Brian Helgeland; cast:(Port… Dennis+Schwartz
4 WAKING NED DEVINE (director: Kirk Jones (III);… Dennis+Schwartz
5 HAPPINESS (director: Todd Solondz; cast: Dylan… Dennis+Schwartz
6 LEON MORIN, PRIEST (director: Jean-Pierre Melv… Dennis+Schwartz
7 LES BICHES (THE DOES)(director: Claude Chabrol… Dennis+Schwartz
8 FUNNY GAMES ( director: Michael Haneke; cast: … Dennis+Schwartz
9 MEN WITH GUNS (director: John Sayles; cast: Fe… Dennis+Schwartz

3 Task 2 : We will perform following tasks- 8 Points


• Download data (using wget)
• Clean data using regular expression

[4]: folder =Path('/content/drive/MyDrive/')


movie_regex = folder /'movie_rege'
!mkdir {str(movie_regex)}

[6]: basepath = str(movie_regex)

3.1 2.1 Download the data and create dataframe


Download the dats set from foillowing URL: “http://www.trackmyhashtag.com/data/COVID-
19.zip”

[7]: # Now we will use wget to get the data


url = 'http://www.trackmyhashtag.com/data/COVID-19.zip'
!wget {url} -P {basepath}

--2021-09-18 06:20:55-- http://www.trackmyhashtag.com/data/COVID-19.zip


Resolving www.trackmyhashtag.com (www.trackmyhashtag.com)… 138.197.74.186
Connecting to www.trackmyhashtag.com
(www.trackmyhashtag.com)|138.197.74.186|:80… connected.

3
HTTP request sent, awaiting response… 301 Moved Permanently
Location: https://www.trackmyhashtag.com/data/COVID-19.zip [following]
--2021-09-18 06:20:55-- https://www.trackmyhashtag.com/data/COVID-19.zip
Connecting to www.trackmyhashtag.com
(www.trackmyhashtag.com)|138.197.74.186|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 13772919 (13M) [application/zip]
Saving to: ‘/content/drive/MyDrive/movie_rege/COVID-19.zip’

COVID-19.zip 100%[===================>] 13.13M 14.4MB/s in 0.9s

2021-09-18 06:20:56 (14.4 MB/s) -


‘/content/drive/MyDrive/movie_rege/COVID-19.zip’ saved [13772919/13772919]

[9]: !unzip /content/drive/MyDrive/movie_rege/COVID-19.zip

Archive: /content/drive/MyDrive/movie_rege/COVID-19.zip
inflating: COVID.csv
inflating: COVID-images.csv
inflating: COVID-videos.csv

[38]: # read the file


data2 = pd.read_csv("COVID.csv")

[39]: # extract only Tweets content


# extract the text
df = pd.DataFrame(data2['Tweet Content'])

df.columns = ["text"]

# preview
df.head()

[39]: text
0 Also the entire Swiss Football League is on ho…
1 World Health Org Official: Trump’s press confe…
2 I mean, Liberals are cheer-leading this #Coron…
3 Under repeated questioning, Pompeo refuses to …
4 #coronavirus comments now from @larry_kudlow h…

3.2 2.2 Change the hashtags to lower case


e.g. replace #Coronavirus with #coronavirus
Create a new column clean_text. The clean_text should have modified hashtags.

[40]: hashtag_pattern = r"#\w+"


def convert_hashtag_tolower(text):

4
for word in text.split():
if re.search(hashtag_pattern , word):
replace = re.findall(hashtag_pattern , word)[0].lower()
text = re.sub(hashtag_pattern , replace , text)
#print(text)
else:
text = text

return text

# apply the function to convert


df["clean_text"] = df['text'].apply(convert_hashtag_tolower)

# preview
df.head()

[40]: text
clean_text
0 Also the entire Swiss Football League is on ho… Also the entire Swiss
Football League is on ho…
1 World Health Org Official: Trump’s press confe… World Health Org Official:
Trump’s press confe…
2 I mean, Liberals are cheer-leading this #Coron… I mean, Liberals are
cheer-leading this #tds l…
3 Under repeated questioning, Pompeo refuses to … Under repeated
questioning, Pompeo refuses to …
4 #coronavirus comments now from @larry_kudlow h… #coronavirus comments now
from @larry_kudlow h…

3.3 2.3 Remove RT from text in column clean_text


[41]: # remove RT
def remove_RT(text):
return re.sub(r'[RT]+' , "" , text)

df['clean_text'] = df['text'].apply(remove_RT)

# preview
df.tail()

[41]: text
clean_text
60155 El #coronavirus entérico felino es un virus in… El #coronavirus
entérico felino es un virus in…
60156 RT @timhquotes: It's my party, you're invited!… @timhquotes: It's my
party, you're invited!\n…

5
60157 It's my party, you're invited!\n\nPS, this is … It's my party, you're
invited!\n\nPS, this is …
60158 Amy’s a survivor! #bariclab #pnnl #movingon #c… Amy’s a survivor!
#bariclab #pnnl #movingon #c…
60159 A review of asymptomatic and sub-clinical Midd… A review of
asymptomatic and sub-clinical Midd…

3.4 2.4 Remove URLs and links from text in column clean_text
[42]: # function to remove urls
def remove_urls(text):
text = re.sub('https?://[A-Za-z0-9./]+','',text)
return text

df['clean_text'] = df['text'].apply(remove_urls)
# preview
df.tail(10)

[42]: text
clean_text
60150 “There may be specific interactions that are c… “There may be specific
interactions that are c…
60151 El #coronavirus entérico felino (FECV) es un v… El #coronavirus
entérico felino (FECV) es un v…
60152 Mediante microscopía electrónica, investigador… Mediante microscopía
electrónica, investigador…
60153 El #coronavirus entérico felino es un virus in… El #coronavirus
entérico felino es un virus in…
60154 RT @timhquotes: It's my party, you're invited!… RT @timhquotes: It's
my party, you're invited!…
60155 El #coronavirus entérico felino es un virus in… El #coronavirus
entérico felino es un virus in…
60156 RT @timhquotes: It's my party, you're invited!… RT @timhquotes: It's
my party, you're invited!…
60157 It's my party, you're invited!\n\nPS, this is … It's my party, you're
invited!\n\nPS, this is …
60158 Amy’s a survivor! #bariclab #pnnl #movingon #c… Amy’s a survivor!
#bariclab #pnnl #movingon #c…
60159 A review of asymptomatic and sub-clinical Midd… A review of
asymptomatic and sub-clinical Midd…

3.5 2.5 Removing Punctuations from text in column clean_text.


Hint: Use the following function
text.translate(str.maketrans('', '', string.punctuation))

6
[28]: # function to remove punctuations
def remove_punct(text):
import string
text = text.translate(str.maketrans('', '', string.punctuation))
return text
# apply the function
df['clean_text'] = df['text'].apply(remove_punct)
# preview
df.head()

[28]: text
clean_text
0 Also the entire Swiss Football League is on ho… Also the entire Swiss
Football League is on ho…
1 World Health Org Official: Trump’s press confe… World Health Org Official
Trump’s press confer…
2 I mean, Liberals are cheer-leading this #Coron… I mean Liberals are
cheerleading this Coronavi…
3 Under repeated questioning, Pompeo refuses to … Under repeated questioning
Pompeo refuses to s…
4 #coronavirus comments now from @larry_kudlow h… coronavirus comments now
from larrykudlow here…

3.6 2.6 Extract number of Hashtags in a new column num_hashtags


[31]: # extract hash tags
def extract_hashtags(text):
hash_tag_pattern = r"#\w+"
hashtag_list = re.findall(hash_tag_pattern , text)
#remove the hashes
hashtag_list = [word[1:] for word in hashtag_list]
return hashtag_list
# apply the function
df['num_hashtags'] = df['text'].apply(extract_hashtags)
# preview
df.tail()

[31]: text …
num_hashtags
60155 El #coronavirus entérico felino es un virus in… … [coronavirus,
enfermedades, gatos, veterinaria]
60156 RT @timhquotes: It's my party, you're invited!… … [Q, DevilSticks,
TimAndEricDotCom, Matthew, Ch…
60157 It's my party, you're invited!\n\nPS, this is … … [Q, DevilSticks,
TimAndEricDotCom, Matthew, Ch…
60158 Amy’s a survivor! #bariclab #pnnl #movingon #c… … [bariclab, pnnl,
movingon, coronavirus, bsl3, …

7
60159 A review of asymptomatic and sub-clinical Midd… …
[Coronavirus]

[5 rows x 3 columns]

3.7 2.7 Extract number of user mentions in a new column num_mentions


[32]: # extract all user mentions from the columns
def extract_user_mentions(text):
mention_pattern = r"@\w+"
user_mentions = re.findall(mention_pattern , text)
#remove the @ symbol
user_mentions = [word[1:] for word in user_mentions]
return user_mentions

# apply the function


df['num_mentions'] = df['text'].apply(extract_user_mentions)

# preview
df.head()

[32]: text … num_mentions


0 Also the entire Swiss Football League is on ho… … []
1 World Health Org Official: Trump’s press confe… … []
2 I mean, Liberals are cheer-leading this #Coron… … []
3 Under repeated questioning, Pompeo refuses to … … []
4 #coronavirus comments now from @larry_kudlow h… … [larry_kudlow]

[5 rows x 4 columns]

3.8 2.8 Count number of mentions and hashtags


[33]: # count the mentions
df['total_mentions'] = df['num_mentions'].apply(len)

[36]: # count the total numbers of hash tags


df['total_hashtags'] = df['num_hashtags'].apply(len)

[37]: # preview the data


df.head()

[37]: text … total_hashtags


0 Also the entire Swiss Football League is on ho… … 1
1 World Health Org Official: Trump’s press confe… … 1
2 I mean, Liberals are cheer-leading this #Coron… … 2
3 Under repeated questioning, Pompeo refuses to … … 1
4 #coronavirus comments now from @larry_kudlow h… … 1

8
[5 rows x 6 columns]

[ ]:

You might also like