0% found this document useful (0 votes)
10 views10 pages

Trends Merged

The document outlines a lab exercise focused on spell correction using a dataset from Project Gutenberg. It details the process of reading a text file, tokenizing words, calculating word probabilities, and generating possible corrections for misspelled words. Additionally, it includes a separate analysis of Google search trends with a dataset that tracks popular queries over the years.

Uploaded by

PARNIT KAUR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

Trends Merged

The document outlines a lab exercise focused on spell correction using a dataset from Project Gutenberg. It details the process of reading a text file, tokenizing words, calculating word probabilities, and generating possible corrections for misspelled words. Additionally, it includes a separate analysis of Google search trends with a dataset that tracks popular queries over the years.

Uploaded by

PARNIT KAUR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2/18/25, 6:02 PM spell_check

LAB 4
Spell Correction

Dataset link
https://www.kaggle.com/datasets/bittlingmayer/spelling?select=big.txt

In [ ]: # read the document which is "The Project Gutenberg EBook of The Adventures of S
with open('spell_corrector/big.txt', 'r') as file:
document = file.read()

# convert words in document to lowercase and tokenize


import re

# pattern to extract words


pattern = "\w+"
words = re.findall(pattern, document.lower())

# print first 10 words


print(words[:10])

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'adventures', 'of', 'sherlo


ck', 'holmes']

In [11]: print("Total words in document: ", len(words))

# set of unique words available in document


uniqueWords = set(words)
print("Total unique words in document: ", len(uniqueWords))

Total words in document: 1115585


Total unique words in document: 32198

In [ ]: # dictionary to store word counts


wordCounts = {}

for word in words:


if word in wordCounts:
wordCounts[word] += 1
else:
wordCounts[word] = 1

# sort the dictionary by value


sortedWordCounts = sorted(wordCounts.items(), key=lambda x: x[1], reverse=True)

# print top 8 words


print(sortedWordCounts[:8])

[('the', 79809), ('of', 40024), ('and', 38312), ('to', 28765), ('in', 22023),
('a', 21124), ('that', 12512), ('he', 12401)]

In [13]: letters = "abcdefghijklmnopqrstuvwxyz"

# generate all words within edit distance 1 (insertions, deletions, substitution

file:///E:/Edu/NITJ/NLP/Lab4/spell_check.html 1/3
2/18/25, 6:02 PM spell_check

def wordsWithinEdits1(word):
splits = []

for i in range(len(word) + 1):


splits.append((word[:i], word[i:]))

inserts = []
deletes = []
substitutes = []

for left, right in splits:

# Delete a character
if right:
deletes.append(left + right[1:])

# Subsitute a character (assuming substitution cost is 1)


if right:
for c in letters:
substitutes.append(left + c + right[1:])

# Insert a character
for c in letters:
inserts.append(left + c + right)

return set(deletes + substitutes + inserts)

In [19]: # dictionary to store probabilities of each word


probabilities = {}

for word in wordCounts:


probabilities[word] = wordCounts[word] / len(words)

# print the probability of some words


print("Probability of 'the': ", probabilities['the'])
print("Probability of 'of': ", probabilities['of'])
print("Probability of 'and': ", probabilities['and'])
print("Probability of 'in': ", probabilities['in'])

Probability of 'the': 0.07154004401278254


Probability of 'of': 0.03587714069299964
Probability of 'and': 0.03434251984384874
Probability of 'in': 0.01974121200984237

In [26]: # input word to be corrected


test_word = "thi"

# if word is already spelled correctly


if test_word in uniqueWords:
print("Word already spelled correctly")

else:

# get all guesses within edit distance 1


guesses = wordsWithinEdits1(test_word)
print("All guesses are: ", guesses)

# filter guesses that are available in document


bestGuesses = []

file:///E:/Edu/NITJ/NLP/Lab4/spell_check.html 2/3
2/18/25, 6:02 PM spell_check

for guess in guesses:


if guess in uniqueWords:
bestGuesses.append(guess)

print("\n\nGuesses available in document: ", bestGuesses)

# sort guesses by probability


probs = {}
for guess in bestGuesses:
probs[guess] = probabilities[guess]

probs = sorted(probs.items(), key=lambda x: x[1], reverse=True)


print("\n\nProbabilities: ", probs[:5])

# print corrected word with highest probability


print("\n\nCorrected word is:", probs[0])

All guesses are: {'tgi', 'thij', 'gthi', 'phi', 'thji', 'thb', 'tqhi', 'uhi', 't
fi', 'thv', 'hi', 'tghi', 'thxi', 'khi', 'thoi', 'tha', 'hthi', 'thh', 'thk', 'to
hi', 'ithi', 'othi', 'tuhi', 'tqi', 'zhi', 'thx', 'wthi', 'tfhi', 'thri', 'thyi',
'jhi', 'tdhi', 'tsi', 'txi', 'thia', 'tehi', 'tli', 'fhi', 'tahi', 'sthi', 'tzi',
'thiw', 'thqi', 'qthi', 'thmi', 'yhi', 'thj', 'tkhi', 'thiq', 'tho', 'thq', 'kth
i', 'dhi', 'shi', 'thbi', 'tlhi', 'thfi', 'thpi', 'this', 'zthi', 'lthi', 'tdi',
'thci', 'tki', 'tvi', 'ghi', 'bhi', 'ahi', 'thy', 'tti', 'twhi', 'thii', 'thio',
'vhi', 'thc', 'mhi', 'tyhi', 'nhi', 'thz', 'pthi', 'thgi', 'thui', 'tihi', 'th',
'thi', 'ethi', 'cthi', 'thim', 'thf', 'thvi', 'tbi', 'thn', 'thki', 'ehi', 'thi
h', 'tphi', 'dthi', 'thsi', 'thai', 'tthi', 'tnhi', 'thie', 'tri', 'tci', 'thzi',
'tni', 'tai', 'thd', 'ohi', 'thix', 'vthi', 'ythi', 'thit', 'tht', 'thw', 'thin',
'thiy', 'tui', 'thp', 'tji', 'xthi', 'nthi', 'chi', 'xhi', 'thni', 'thip', 'thi
b', 'mthi', 'thil', 'thiv', 'thg', 'tei', 'thdi', 'thid', 'tzhi', 'tvhi', 'jthi',
'athi', 'thwi', 'thig', 'rhi', 'thiu', 'tjhi', 'tii', 'lhi', 'bthi', 'tmhi', 'uth
i', 'thir', 'the', 'thhi', 'thif', 'ti', 'thu', 'tyi', 'twi', 'hhi', 'thic', 'to
i', 'fthi', 'tchi', 'tbhi', 'thei', 'tshi', 'thiz', 'thti', 'thr', 'trhi', 'thi
k', 'ihi', 'thli', 'tpi', 'qhi', 'rthi', 'tmi', 'txhi', 'ths', 'thm', 'whi', 'th
l'}

Guesses available in document: ['hi', 'this', 'thy', 'th', 'thin', 'chi', 'the',
'ti', 'toi']

Probabilities: [('the', 0.07154004401278254), ('this', 0.0036420353446846273),


('thin', 0.00014880085336393012), ('thy', 4.2130362097016364e-05), ('th', 8.96390
6829152418e-06)]

Corrected word is: ('the', 0.07154004401278254)

file:///E:/Edu/NITJ/NLP/Lab4/spell_check.html 3/3
2/18/25, 6:03 PM trends

Statistics related to words typed in Google search engine

Dataset link:
https://www.kaggle.com/datasets/dhruvildave/google-trends-dataset

In [ ]: import pandas as pd

# read the data


df = pd.read_csv('trends/trends.csv')
df.head()

Out[ ]: location year category rank query

0 Global 2001 Consumer Brands 1 Nokia

1 Global 2001 Consumer Brands 2 Sony

2 Global 2001 Consumer Brands 3 BMW

3 Global 2001 Consumer Brands 4 Palm

4 Global 2001 Consumer Brands 5 Adobe

In [ ]: # check the data types


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26955 entries, 0 to 26954
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 location 26955 non-null object
1 year 26955 non-null int64
2 category 26955 non-null object
3 rank 26955 non-null int64
4 query 26955 non-null object
dtypes: int64(2), object(3)
memory usage: 1.0+ MB

In [ ]: # check for missing values


df.isna().sum()

Out[ ]: location 0
year 0
category 0
rank 0
query 0
dtype: int64

In [ ]: # get unique locations


df.location.unique()

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 1/7
2/18/25, 6:03 PM trends

Out[ ]: array(['Global', 'France', 'Germany', 'United Kingdom', 'Australia',


'Canada', 'Italy', 'Netherlands', 'Spain', 'United States',
'Argentina', 'Austria', 'Belgium', 'Brazil', 'Chile', 'China',
'Colombia', 'Czechia', 'Denmark', 'Finland', 'Hong Kong', 'India',
'Malaysia', 'Mexico', 'New Zealand', 'Philippines', 'Poland',
'Russia', 'Singapore', 'South Africa', 'South Korea', 'Sweden',
'Switzerland', 'Taiwan', 'Thailand', 'United Arab Emirates',
'Costa Rica', 'Croatia', 'Dominican Republic', 'Ecuador',
'El Salvador', 'Guatemala', 'Honduras', 'Japan', 'Kenya',
'Nigeria', 'Panama', 'Peru', 'Egypt', 'Hungary', 'Ireland',
'Israel', 'Norway', 'Portugal', 'Romania', 'Saudi Arabia',
'Serbia', 'Slovakia', 'Turkey', 'Ukraine', 'Ghana', 'Indonesia',
'Senegal', 'Uganda', 'Vietnam', 'Bangladesh', 'Bulgaria',
'Estonia', 'Latvia', 'Lithuania', 'Pakistan', 'Puerto Rico',
'Slovenia', 'Uruguay', 'Venezuela', 'Greece', 'Belarus',
'Kazakhstan', 'Sri Lanka', 'Zimbabwe', 'Myanmar (Burma)', 'Kuwait',
'Sudan'], dtype=object)

In [ ]: # get year range


df.year.unique()

Out[ ]: array([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020], dtype=int64)

In [ ]: import matplotlib.pyplot as plt

# plot the total searches per year using a line plot


yearly_search_counts = df.groupby(['year'])['query'].count()
plt.figure(figsize=(8, 5))
yearly_search_counts.plot(kind='line', marker='o')
plt.title('Total Searches per Year')
plt.xlabel('Year')
plt.ylabel('Number of Searches')
plt.grid()
plt.show()

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 2/7
2/18/25, 6:03 PM trends

In [ ]: # get top 5 searched consumer brands per year


top_brands = df[df['category'] == 'Consumer Brands'].groupby('year')['query'].ap
print("\nTop 5 Consumer Brands per year:")
print(top_brands)

Top 5 Consumer Brands per year:


year
2001 0 Nokia
1 Sony
2 BMW
3 Palm
4 Adobe
2002 65 Ferrari
66 Sony
67 Nokia
68 Disney
69 IKEA
2003 175 Ferrari
176 Sony
177 BMW
178 Disney
179 Ryanair
2004 340 eBay
341 Walmart
342 MapQuest
343 Amazon
344 Home Depot
Name: query, dtype: object

In [ ]: # get top 5 searched movies per year


top_movies = df[df['category'] == 'Movies'].groupby('year')['query'].apply(lambd
print("\nTop Movies per year:")
print(top_movies)

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 3/7
2/18/25, 6:03 PM trends

Top Movies per year:


year
2001 10 Harry Potter
11 Lord of the Rings
12 Final Fantasy
13 Tomb Raider
14 Shrek
2002 80 Spiderman
81 Harry Potter
82 Star Wars
83 Jackass
84 Scooby Doo
2011 2425 Destino final 5
2426 Rio
2427 Perras
2428 Cowboys Vs Aliens
2429 Transformers 3
2012 3265 Batman Asciende
3266 Hotel Transylvania
3267 MIB 3
3268 La Era del Hielo 4
3269 Los Vengadores
2013 5180 Man of Steel
5181 Iron Man 3
5182 World War Z
5183 Django Unchained
5184 Despicable Me 2
2014 8355 Frozen
8356 Interstellar
8357 Divergent
8358 Godzilla
8359 Gone Girl
2015 10390 Jurassic World
10391 Furious 7
10392 American Sniper
10393 Fifty Shade of Grey
10394 Minions
2016 13190 Deadpool
13191 Suicide Squad
13192 The Revenant
13193 Captain America Civil War
13194 Batman v Superman
2017 16250 IT
16251 Wonder Woman
16252 Beauty and the Beast
16253 Logan
16254 Justice League
2018 18840 Black Panther
18841 Deadpool 2
18842 Venom
18843 Avengers: Infinity War
18844 Bohemian Rhapsody
2019 21395 Avengers: Endgame
21396 Joker
21397 Captain Marvel
21398 Toy Story 4
21399 Aquaman
2020 24015 Parasite
24016 1917
24017 Black Panther

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 4/7
2/18/25, 6:03 PM trends

24018 365 Dni


24019 Contagion
Name: query, dtype: object

In [36]: # get search trends related to natural disasters


event_query = df[df['query'].str.contains('Earthquake|Wildfire|Hurricane|Storm',
event_trend = event_query.groupby('year')['query'].count()

plt.figure(figsize=(10, 5))
event_trend.plot(kind='line', marker='o', color='red')
plt.title('Searches Related to Disasters Over the Years')
plt.xlabel('Year')
plt.ylabel('Number of Searches')
plt.grid()
plt.show()

In [ ]: # get search trends related to Trump


event_query = df[df['query'].str.contains('Trump', case=False)]
event_trend = event_query.groupby('year')['query'].count()

plt.figure(figsize=(10, 5))
event_trend.plot(kind='line', marker='o', color='red')
plt.title('Searches Related to Trump Over the Years')
plt.xlabel('Year')
plt.ylabel('Number of Searches')
plt.grid()
plt.show()

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 5/7
2/18/25, 6:03 PM trends

In [38]: # get top 10 searched categories in India


top_10_india = df[df['location'] == 'India'].groupby('category')['query'].count(

plt.figure(figsize=(10, 5))
top_10_india.plot(kind='bar' , color="red")
plt.title('Top Categories in India')
plt.xlabel('Category')
plt.ylabel('Number of Searches')
plt.xticks(rotation=45)
plt.grid()
plt.show()

In [39]: # get top 10 searched queries in India for the year 2020
year = 2020
location = 'India'

top_10_queries_india = df[(df['year'] == year) & (df['location'] == location)]['

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 6/7
2/18/25, 6:03 PM trends

print("Top 10 search queries in", year, "and", location + ":")


print(top_10_queries_india)

# get top 10 searched queries in United States for the year 2020
year = 2020
location = 'United States'
top_10_queries_us = df[(df['year'] == year) & (df['location'] == location)]['que
print("\nTop 10 search queries in", year, "and", location + ":")
print(top_10_queries_us)

Top 10 search queries in 2020 and India:


query
Indian Premier League 3
Coronavirus 2
La Liga 1
Joe Biden 1
Arnab Goswami 1
Kanika Kapoor 1
Kim Jong-un 1
Amitabh Bachchan 1
UEFA Champions League 1
English Premier League 1
Name: count, dtype: int64

Top 10 search queries in 2020 and United States:


query
Election results 2
Joe Biden 2
How to style curtain bangs 2
WAP 2
Kamala Harris 2
Ryan Newman 2
Coronavirus 2
Kobe Bryant 2
Where to buy Xbox Series X 1
Where is my refunds 1
Name: count, dtype: int64

In [ ]: # get top words searched in the query column for the year 2020

year = 2020
top_words = df[df['year'] == year]['query'].str.split().explode().value_counts()
print("Top words searched in the query column for the year", year, ":")

print(top_words)

Top words searched in the query column for the year 2020 :
query
to 108
Coronavirus 94
How 89
de 82
2020 58
Kobe 48
Joe 48
Bryant 46
en 46
el 45
Name: count, dtype: int64

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 7/7

You might also like