0% found this document useful (0 votes)
5 views5 pages

DAV Example

This document outlines a lab exercise for a Data Analysis and Visualization course at Namal University, focusing on analyzing a speech by Theresa May. It details tasks to read a text file, extract words, filter out stopwords, and plot a bar chart of the most frequent words. The document includes Python code snippets for each task and expected outputs.

Uploaded by

Muhammad Shahbaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

DAV Example

This document outlines a lab exercise for a Data Analysis and Visualization course at Namal University, focusing on analyzing a speech by Theresa May. It details tasks to read a text file, extract words, filter out stopwords, and plot a bar chart of the most frequent words. The document includes Python code snippets for each task and expected outputs.

Uploaded by

Muhammad Shahbaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Analysis and Visualization Lab (CS-352 )

Namal University Mianwali


Faculty of Computer Science

Lab 04 – Example Dataset


In this exercise, there is a .txt file of the famous speech given by Theresa May. The aim of this exercise is to plot a
bar chart of the most "frequent" words in the speech.

The resultant graph will look similar to the one given below:

Read the file speech.txt using Python and extract each word in the file. (Remember this list will and
Task 1
should contain duplicates)
Solution # Import libraries
import numpy as np
import pandas as pd
import string
import matplotlib.pyplot as plt
%matplotlib inline
# Read the speech.txt file and create a word list
# file_name = "speech.txt"
word_list = []

with open('speech.txt', 'r', encoding='utf-8') as file:


text = file.read()
words = text.lower().split()
# Remove punctuation characters and numbers from the word list
for word in words:
# Remove leading/trailing punctuation characters
word = word.strip(string.punctuation).replace('–', '')

# Add the cleaned word to the word list


if word and not word.isdigit():
word_list.append(word)

print(word_list[:20]) #printing some values

OUTPUT

['thank', 'you', 'today', 'i', 'want', 'to', 'talk', 'about', 'the', 'united', 'kingdom', 'our', 'place', 'in', 'the',
'world', 'and', 'our', 'membership', 'of']
Read stopwords.csv using pandas and extract all the stopwords in the file that belongs to language
Task 2
English.
Solution # Read 'stopwords.csv' file as pandas dataframe and print its head()
data = pd.read_csv('stopwords.csv')
data.head()

OUTPUT
# Selecting stopwords that belong to English language
english_stopwords = set(data[data['Language'] == 'English']['Words'].tolist())
print(english_stopwords)

OUTPUT

{'was', 're', 'himself', 'after', 'ourselves', 'whom', 'out', 'against', 'has', 'aren', 't', 'we', 'have', 'own',
'it', 'as', 'do', "should've", "aren't", "hadn't", 'where', 'll', 'its', 'their', 'again', 'below', 's', 've', 'the',
'too', 'wouldn', 'more', 'y', 'can', 'are', 'itself', 'needn', 'she', 'having', 'now', 'or', 'at', 'nor', 'about',
"don't", 'very', 'were', 'mustn', "mustn't", 'his', "you're", "wasn't", 'once', 'which', 'doing', 'don',
"mightn't", 'but', "shouldn't", 'ain', 'most', 'then', "isn't", 'with', 'both', 'them', 'weren', 'be', 'had',
'will', 'hasn', 'isn', 'that', "couldn't", 'there', 'only', 'into', 'here', 'does', 'doesn', 'herself', 'me', 'by',
'is', "shan't", 'if', "it's", 'from', 'while', 'theirs', 'wasn', 'of', 'because', 'just', 'm', 'than', 'in', 'haven',
'yourself', 'him', "she's", 'my', 'our', 'did', 'off', 'each', "won't", 'shouldn', 'above', 'other',
'yourselves', 'how', 'they', 'hers', 'ours', 'hadn', 'am', 'her', 'shan', 'such', 'couldn', "haven't", 'those',
'when', 'been', 'for', 'same', "you've", 'between', "didn't", 'why', 'and', 'ma', 'on', 'should', 'under',
"you'll", 'over', 'some', 'further', "needn't", "doesn't", 'myself', 'until', "weren't", 'before', 'any', 'he',
'o', 'didn', 'being', 'not', 'few', 'to', 'up', 'won', 'an', 'down', 'd', "that'll", 'through', "wouldn't", 'you',
'during', 'your', 'what', 'mightn', 'i', 'all', 'yours', 'no', 'so', 'a', 'these', 'themselves', "you'd", 'who',
"hasn't", 'this'}

Task 3 Remove all the words extracted from the speech file if those words are present in the stopwords.
word_list = [word for word in word_list if word not in english_stopwords]

Solution
OUTPUT

['thank', 'today', 'want', 'talk', 'united', 'kingdom', 'place', 'world', 'membership', 'european',
'union', 'start', 'want', 'make', 'clear', 'see', 'rally', 'attack', 'even', 'criticism']
Task 4 Sort the pruned list of words from the file in order of the highest frequency of occurrence.

# Sort the dictionary based on highest frequncy of occurence


sorted_word_freq = {k: v for k, v in sorted(frequency_dict.items(), key=lambda item: item[1],
reverse=True)}
Solution
OUTPUT

Task 5 Plot a bar chart that represents the frequency of the top 15 words of the speech.
# Separate the frequency_list into 2 lists for easy plotting
keys = [key for key in sorted_word_freq]
values = [sorted_word_freq[key] for key in sorted_word_freq]

OUTPUT

# Plotting the words on the x-axis and frequency along the y-axis

# Plot the bar chart


plt.figure(figsize=(10, 5))
plt.bar(keys[:15], values[:15])
plt.title("Most Frequent Words in Theresa May's Speech")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.xticks(rotation=90)
plt.show()

OUTPUT

Solution
THE END

You might also like