0% found this document useful (0 votes)

10 views10 pages

Trends Merged

The document outlines a lab exercise focused on spell correction using a dataset from Project Gutenberg. It details the process of reading a text file, tokenizing words, calculating word probabilities, and generating possible corrections for misspelled words. Additionally, it includes a separate analysis of Google search trends with a dataset that tracks popular queries over the years.

Uploaded by

PARNIT KAUR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views10 pages

Trends Merged

Uploaded by

PARNIT KAUR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

2/18/25, 6:02 PM spell_check

LAB 4
Spell Correction

Dataset link
https://www.kaggle.com/datasets/bittlingmayer/spelling?select=big.txt

In [ ]: # read the document which is "The Project Gutenberg EBook of The Adventures of S
with open('spell_corrector/big.txt', 'r') as file:
document = file.read()

# convert words in document to lowercase and tokenize

import re

# pattern to extract words

pattern = "\w+"
words = re.findall(pattern, document.lower())

# print first 10 words

print(words[:10])

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'adventures', 'of', 'sherlo

ck', 'holmes']

In [11]: print("Total words in document: ", len(words))

# set of unique words available in document

uniqueWords = set(words)
print("Total unique words in document: ", len(uniqueWords))

Total words in document: 1115585

Total unique words in document: 32198

In [ ]: # dictionary to store word counts

wordCounts = {}

for word in words:

if word in wordCounts:
wordCounts[word] += 1
else:
wordCounts[word] = 1

# sort the dictionary by value

sortedWordCounts = sorted(wordCounts.items(), key=lambda x: x[1], reverse=True)

# print top 8 words

print(sortedWordCounts[:8])

[('the', 79809), ('of', 40024), ('and', 38312), ('to', 28765), ('in', 22023),
('a', 21124), ('that', 12512), ('he', 12401)]

In [13]: letters = "abcdefghijklmnopqrstuvwxyz"

# generate all words within edit distance 1 (insertions, deletions, substitution

file:///E:/Edu/NITJ/NLP/Lab4/spell_check.html 1/3
2/18/25, 6:02 PM spell_check

def wordsWithinEdits1(word):
splits = []

for i in range(len(word) + 1):

splits.append((word[:i], word[i:]))

inserts = []
deletes = []
substitutes = []

for left, right in splits:

# Delete a character
if right:
deletes.append(left + right[1:])

# Subsitute a character (assuming substitution cost is 1)

if right:
for c in letters:
substitutes.append(left + c + right[1:])

# Insert a character
for c in letters:
inserts.append(left + c + right)

return set(deletes + substitutes + inserts)

In [19]: # dictionary to store probabilities of each word

probabilities = {}

for word in wordCounts:

probabilities[word] = wordCounts[word] / len(words)

# print the probability of some words

print("Probability of 'the': ", probabilities['the'])
print("Probability of 'of': ", probabilities['of'])
print("Probability of 'and': ", probabilities['and'])
print("Probability of 'in': ", probabilities['in'])

Probability of 'the': 0.07154004401278254

Probability of 'of': 0.03587714069299964
Probability of 'and': 0.03434251984384874
Probability of 'in': 0.01974121200984237

In [26]: # input word to be corrected

test_word = "thi"

# if word is already spelled correctly

if test_word in uniqueWords:
print("Word already spelled correctly")

else:

# get all guesses within edit distance 1

guesses = wordsWithinEdits1(test_word)
print("All guesses are: ", guesses)

# filter guesses that are available in document

bestGuesses = []

file:///E:/Edu/NITJ/NLP/Lab4/spell_check.html 2/3
2/18/25, 6:02 PM spell_check

for guess in guesses:

if guess in uniqueWords:
bestGuesses.append(guess)

print("\n\nGuesses available in document: ", bestGuesses)

# sort guesses by probability

probs = {}
for guess in bestGuesses:
probs[guess] = probabilities[guess]

probs = sorted(probs.items(), key=lambda x: x[1], reverse=True)

print("\n\nProbabilities: ", probs[:5])

# print corrected word with highest probability

print("\n\nCorrected word is:", probs[0])

All guesses are: {'tgi', 'thij', 'gthi', 'phi', 'thji', 'thb', 'tqhi', 'uhi', 't
fi', 'thv', 'hi', 'tghi', 'thxi', 'khi', 'thoi', 'tha', 'hthi', 'thh', 'thk', 'to
hi', 'ithi', 'othi', 'tuhi', 'tqi', 'zhi', 'thx', 'wthi', 'tfhi', 'thri', 'thyi',
'jhi', 'tdhi', 'tsi', 'txi', 'thia', 'tehi', 'tli', 'fhi', 'tahi', 'sthi', 'tzi',
'thiw', 'thqi', 'qthi', 'thmi', 'yhi', 'thj', 'tkhi', 'thiq', 'tho', 'thq', 'kth
i', 'dhi', 'shi', 'thbi', 'tlhi', 'thfi', 'thpi', 'this', 'zthi', 'lthi', 'tdi',
'thci', 'tki', 'tvi', 'ghi', 'bhi', 'ahi', 'thy', 'tti', 'twhi', 'thii', 'thio',
'vhi', 'thc', 'mhi', 'tyhi', 'nhi', 'thz', 'pthi', 'thgi', 'thui', 'tihi', 'th',
'thi', 'ethi', 'cthi', 'thim', 'thf', 'thvi', 'tbi', 'thn', 'thki', 'ehi', 'thi
h', 'tphi', 'dthi', 'thsi', 'thai', 'tthi', 'tnhi', 'thie', 'tri', 'tci', 'thzi',
'tni', 'tai', 'thd', 'ohi', 'thix', 'vthi', 'ythi', 'thit', 'tht', 'thw', 'thin',
'thiy', 'tui', 'thp', 'tji', 'xthi', 'nthi', 'chi', 'xhi', 'thni', 'thip', 'thi
b', 'mthi', 'thil', 'thiv', 'thg', 'tei', 'thdi', 'thid', 'tzhi', 'tvhi', 'jthi',
'athi', 'thwi', 'thig', 'rhi', 'thiu', 'tjhi', 'tii', 'lhi', 'bthi', 'tmhi', 'uth
i', 'thir', 'the', 'thhi', 'thif', 'ti', 'thu', 'tyi', 'twi', 'hhi', 'thic', 'to
i', 'fthi', 'tchi', 'tbhi', 'thei', 'tshi', 'thiz', 'thti', 'thr', 'trhi', 'thi
k', 'ihi', 'thli', 'tpi', 'qhi', 'rthi', 'tmi', 'txhi', 'ths', 'thm', 'whi', 'th
l'}

Guesses available in document: ['hi', 'this', 'thy', 'th', 'thin', 'chi', 'the',
'ti', 'toi']

Probabilities: [('the', 0.07154004401278254), ('this', 0.0036420353446846273),

('thin', 0.00014880085336393012), ('thy', 4.2130362097016364e-05), ('th', 8.96390
6829152418e-06)]

Corrected word is: ('the', 0.07154004401278254)

file:///E:/Edu/NITJ/NLP/Lab4/spell_check.html 3/3
2/18/25, 6:03 PM trends

Statistics related to words typed in Google search engine

Dataset link:
https://www.kaggle.com/datasets/dhruvildave/google-trends-dataset

In [ ]: import pandas as pd

# read the data

df = pd.read_csv('trends/trends.csv')
df.head()

Out[ ]: location year category rank query

0 Global 2001 Consumer Brands 1 Nokia

1 Global 2001 Consumer Brands 2 Sony

2 Global 2001 Consumer Brands 3 BMW

3 Global 2001 Consumer Brands 4 Palm

4 Global 2001 Consumer Brands 5 Adobe

In [ ]: # check the data types

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26955 entries, 0 to 26954
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 location 26955 non-null object
1 year 26955 non-null int64
2 category 26955 non-null object
3 rank 26955 non-null int64
4 query 26955 non-null object
dtypes: int64(2), object(3)
memory usage: 1.0+ MB

In [ ]: # check for missing values

df.isna().sum()

Out[ ]: location 0
year 0
category 0
rank 0
query 0
dtype: int64

In [ ]: # get unique locations

df.location.unique()

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 1/7
2/18/25, 6:03 PM trends

Out[ ]: array(['Global', 'France', 'Germany', 'United Kingdom', 'Australia',

'Canada', 'Italy', 'Netherlands', 'Spain', 'United States',
'Argentina', 'Austria', 'Belgium', 'Brazil', 'Chile', 'China',
'Colombia', 'Czechia', 'Denmark', 'Finland', 'Hong Kong', 'India',
'Malaysia', 'Mexico', 'New Zealand', 'Philippines', 'Poland',
'Russia', 'Singapore', 'South Africa', 'South Korea', 'Sweden',
'Switzerland', 'Taiwan', 'Thailand', 'United Arab Emirates',
'Costa Rica', 'Croatia', 'Dominican Republic', 'Ecuador',
'El Salvador', 'Guatemala', 'Honduras', 'Japan', 'Kenya',
'Nigeria', 'Panama', 'Peru', 'Egypt', 'Hungary', 'Ireland',
'Israel', 'Norway', 'Portugal', 'Romania', 'Saudi Arabia',
'Serbia', 'Slovakia', 'Turkey', 'Ukraine', 'Ghana', 'Indonesia',
'Senegal', 'Uganda', 'Vietnam', 'Bangladesh', 'Bulgaria',
'Estonia', 'Latvia', 'Lithuania', 'Pakistan', 'Puerto Rico',
'Slovenia', 'Uruguay', 'Venezuela', 'Greece', 'Belarus',
'Kazakhstan', 'Sri Lanka', 'Zimbabwe', 'Myanmar (Burma)', 'Kuwait',
'Sudan'], dtype=object)

In [ ]: # get year range

df.year.unique()

Out[ ]: array([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020], dtype=int64)

In [ ]: import matplotlib.pyplot as plt

# plot the total searches per year using a line plot

yearly_search_counts = df.groupby(['year'])['query'].count()
plt.figure(figsize=(8, 5))
yearly_search_counts.plot(kind='line', marker='o')
plt.title('Total Searches per Year')
plt.xlabel('Year')
plt.ylabel('Number of Searches')
plt.grid()
plt.show()

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 2/7
2/18/25, 6:03 PM trends

In [ ]: # get top 5 searched consumer brands per year

top_brands = df[df['category'] == 'Consumer Brands'].groupby('year')['query'].ap
print("\nTop 5 Consumer Brands per year:")
print(top_brands)

Top 5 Consumer Brands per year:

year
2001 0 Nokia
1 Sony
2 BMW
3 Palm
4 Adobe
2002 65 Ferrari
66 Sony
67 Nokia
68 Disney
69 IKEA
2003 175 Ferrari
176 Sony
177 BMW
178 Disney
179 Ryanair
2004 340 eBay
341 Walmart
342 MapQuest
343 Amazon
344 Home Depot
Name: query, dtype: object

In [ ]: # get top 5 searched movies per year

top_movies = df[df['category'] == 'Movies'].groupby('year')['query'].apply(lambd
print("\nTop Movies per year:")
print(top_movies)

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 3/7
2/18/25, 6:03 PM trends

Top Movies per year:

year
2001 10 Harry Potter
11 Lord of the Rings
12 Final Fantasy
13 Tomb Raider
14 Shrek
2002 80 Spiderman
81 Harry Potter
82 Star Wars
83 Jackass
84 Scooby Doo
2011 2425 Destino final 5
2426 Rio
2427 Perras
2428 Cowboys Vs Aliens
2429 Transformers 3
2012 3265 Batman Asciende
3266 Hotel Transylvania
3267 MIB 3
3268 La Era del Hielo 4
3269 Los Vengadores
2013 5180 Man of Steel
5181 Iron Man 3
5182 World War Z
5183 Django Unchained
5184 Despicable Me 2
2014 8355 Frozen
8356 Interstellar
8357 Divergent
8358 Godzilla
8359 Gone Girl
2015 10390 Jurassic World
10391 Furious 7
10392 American Sniper
10393 Fifty Shade of Grey
10394 Minions
2016 13190 Deadpool
13191 Suicide Squad
13192 The Revenant
13193 Captain America Civil War
13194 Batman v Superman
2017 16250 IT
16251 Wonder Woman
16252 Beauty and the Beast
16253 Logan
16254 Justice League
2018 18840 Black Panther
18841 Deadpool 2
18842 Venom
18843 Avengers: Infinity War
18844 Bohemian Rhapsody
2019 21395 Avengers: Endgame
21396 Joker
21397 Captain Marvel
21398 Toy Story 4
21399 Aquaman
2020 24015 Parasite
24016 1917
24017 Black Panther

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 4/7
2/18/25, 6:03 PM trends

24018 365 Dni

24019 Contagion
Name: query, dtype: object

In [36]: # get search trends related to natural disasters

event_query = df[df['query'].str.contains('Earthquake|Wildfire|Hurricane|Storm',
event_trend = event_query.groupby('year')['query'].count()

plt.figure(figsize=(10, 5))
event_trend.plot(kind='line', marker='o', color='red')
plt.title('Searches Related to Disasters Over the Years')
plt.xlabel('Year')
plt.ylabel('Number of Searches')
plt.grid()
plt.show()

In [ ]: # get search trends related to Trump

event_query = df[df['query'].str.contains('Trump', case=False)]
event_trend = event_query.groupby('year')['query'].count()

plt.figure(figsize=(10, 5))
event_trend.plot(kind='line', marker='o', color='red')
plt.title('Searches Related to Trump Over the Years')
plt.xlabel('Year')
plt.ylabel('Number of Searches')
plt.grid()
plt.show()

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 5/7
2/18/25, 6:03 PM trends

In [38]: # get top 10 searched categories in India

top_10_india = df[df['location'] == 'India'].groupby('category')['query'].count(

plt.figure(figsize=(10, 5))
top_10_india.plot(kind='bar' , color="red")
plt.title('Top Categories in India')
plt.xlabel('Category')
plt.ylabel('Number of Searches')
plt.xticks(rotation=45)
plt.grid()
plt.show()

In [39]: # get top 10 searched queries in India for the year 2020
year = 2020
location = 'India'

top_10_queries_india = df[(df['year'] == year) & (df['location'] == location)]['

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 6/7
2/18/25, 6:03 PM trends

print("Top 10 search queries in", year, "and", location + ":")

print(top_10_queries_india)

# get top 10 searched queries in United States for the year 2020
year = 2020
location = 'United States'
top_10_queries_us = df[(df['year'] == year) & (df['location'] == location)]['que
print("\nTop 10 search queries in", year, "and", location + ":")
print(top_10_queries_us)

Top 10 search queries in 2020 and India:

query
Indian Premier League 3
Coronavirus 2
La Liga 1
Joe Biden 1
Arnab Goswami 1
Kanika Kapoor 1
Kim Jong-un 1
Amitabh Bachchan 1
UEFA Champions League 1
English Premier League 1
Name: count, dtype: int64

Top 10 search queries in 2020 and United States:

query
Election results 2
Joe Biden 2
How to style curtain bangs 2
WAP 2
Kamala Harris 2
Ryan Newman 2
Coronavirus 2
Kobe Bryant 2
Where to buy Xbox Series X 1
Where is my refunds 1
Name: count, dtype: int64

In [ ]: # get top words searched in the query column for the year 2020

year = 2020
top_words = df[df['year'] == year]['query'].str.split().explode().value_counts()
print("Top words searched in the query column for the year", year, ":")

print(top_words)

Top words searched in the query column for the year 2020 :
query
to 108
Coronavirus 94
How 89
de 82
2020 58
Kobe 48
Joe 48
Bryant 46
en 46
el 45
Name: count, dtype: int64

file:///E:/Edu/NITJ/NLP/Lab4/trends.html 7/7

Python Cheat Sheet 2.0
100% (1)
Python Cheat Sheet 2.0
10 pages
Data Wrangling - Jupyter Notebook
No ratings yet
Data Wrangling - Jupyter Notebook
5 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
Write A Python Code To Create A Dictionary Contai
No ratings yet
Write A Python Code To Create A Dictionary Contai
3 pages
EC A1 Photocopiables PDF
100% (1)
EC A1 Photocopiables PDF
38 pages
Lab Manual Python Programming Language
No ratings yet
Lab Manual Python Programming Language
21 pages
Python Code Examples
100% (1)
Python Code Examples
30 pages
English in Mind 5 Teacher S Book
No ratings yet
English in Mind 5 Teacher S Book
184 pages
JHS 1 Eng WK7
No ratings yet
JHS 1 Eng WK7
5 pages
Lab Mannual
No ratings yet
Lab Mannual
49 pages
Lecture 7 - CS50x 2024
No ratings yet
Lecture 7 - CS50x 2024
20 pages
Iksha ' Nusandhan: Admission Batch: 2019
No ratings yet
Iksha ' Nusandhan: Admission Batch: 2019
17 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Py4Inf 09 Dictionaries
No ratings yet
Py4Inf 09 Dictionaries
30 pages
Chap9 Python-Dictionaries
No ratings yet
Chap9 Python-Dictionaries
29 pages
CS Practical File
No ratings yet
CS Practical File
28 pages
Pythonlearn 09 Dictionaries 1
No ratings yet
Pythonlearn 09 Dictionaries 1
31 pages
AP19110010110 Lab Assignment-2 - Jupyter Notebook
No ratings yet
AP19110010110 Lab Assignment-2 - Jupyter Notebook
18 pages
Experiments
No ratings yet
Experiments
14 pages
Gebrekidan Yonatan Yakob
No ratings yet
Gebrekidan Yonatan Yakob
14 pages
Dbms Practical File
No ratings yet
Dbms Practical File
29 pages
Practical File Python
No ratings yet
Practical File Python
25 pages
Sintax
No ratings yet
Sintax
13 pages
23BBS0006 VL2023240506353 Ast03
No ratings yet
23BBS0006 VL2023240506353 Ast03
8 pages
Pyq Solution
No ratings yet
Pyq Solution
12 pages
96 Yogesh Khairnar Assignment 4
No ratings yet
96 Yogesh Khairnar Assignment 4
25 pages
DVP Manual V2
No ratings yet
DVP Manual V2
26 pages
1696123699200
No ratings yet
1696123699200
73 pages
Python Dictionary Comprehension Notes PDF
No ratings yet
Python Dictionary Comprehension Notes PDF
9 pages
Ip Practical
No ratings yet
Ip Practical
23 pages
Py4Inf 09 Dictionaries
No ratings yet
Py4Inf 09 Dictionaries
32 pages
PDA Lab Prog (Short)
No ratings yet
PDA Lab Prog (Short)
11 pages
03 Python
No ratings yet
03 Python
5 pages
ÔN TẬP FINAL NGÔN NGỮ LẬP TRÌNH
No ratings yet
ÔN TẬP FINAL NGÔN NGỮ LẬP TRÌNH
121 pages
Vraj-198 PWP 5-12
No ratings yet
Vraj-198 PWP 5-12
16 pages
PWP - 5-12 Meet
No ratings yet
PWP - 5-12 Meet
16 pages
De Interview Raamashaamy Qna Bank
No ratings yet
De Interview Raamashaamy Qna Bank
11 pages
Lab Task 10: Programming Exercises
No ratings yet
Lab Task 10: Programming Exercises
6 pages
Pandas Library Problems For Parctice
No ratings yet
Pandas Library Problems For Parctice
13 pages
AI Final PDF
No ratings yet
AI Final PDF
38 pages
Dictionary Set
No ratings yet
Dictionary Set
19 pages
IR
No ratings yet
IR
12 pages
Technical Interview Questions Technical Interview Questions
No ratings yet
Technical Interview Questions Technical Interview Questions
13 pages
Text To Self Editedenglish 9 Quarter 2 Module 1
No ratings yet
Text To Self Editedenglish 9 Quarter 2 Module 1
8 pages
Data Science Fundamentals Lab
No ratings yet
Data Science Fundamentals Lab
24 pages
Sodapdf
No ratings yet
Sodapdf
3 pages
9 Feature Engineering Text Data
No ratings yet
9 Feature Engineering Text Data
7 pages
Python 2
No ratings yet
Python 2
4 pages
Python Assignment 3 AMAN GAUTAM 039
No ratings yet
Python Assignment 3 AMAN GAUTAM 039
5 pages
Dav Assignment
No ratings yet
Dav Assignment
33 pages
Yetfjdt
No ratings yet
Yetfjdt
2 pages
Data Science 1: Assignment No. 2 Date: Sept 26, 2016
No ratings yet
Data Science 1: Assignment No. 2 Date: Sept 26, 2016
5 pages
All Python Codes
No ratings yet
All Python Codes
6 pages
Ge - Computer Science Data Analysis
No ratings yet
Ge - Computer Science Data Analysis
16 pages
Py 1679789071
No ratings yet
Py 1679789071
2 pages
Python Lab Manual
No ratings yet
Python Lab Manual
17 pages
Practical Scientific Computing in Python A Workbook
No ratings yet
Practical Scientific Computing in Python A Workbook
43 pages
Easiest Lab Programs
No ratings yet
Easiest Lab Programs
5 pages
Working With Dictionaries - Lab Answersheet - Colab
No ratings yet
Working With Dictionaries - Lab Answersheet - Colab
5 pages
Python Complet Test
No ratings yet
Python Complet Test
3 pages
Week 9 Python GRPA
No ratings yet
Week 9 Python GRPA
2 pages
Lesson 3. Performance Assessment
No ratings yet
Lesson 3. Performance Assessment
6 pages
Alicia Cardigan Template: Print OUT & Keep
No ratings yet
Alicia Cardigan Template: Print OUT & Keep
22 pages
EOT II P - 3 Mathematics
No ratings yet
EOT II P - 3 Mathematics
8 pages
How To Write A Spelling Corrector
No ratings yet
How To Write A Spelling Corrector
10 pages
The Study of Surah Yaseen Lesson 04
No ratings yet
The Study of Surah Yaseen Lesson 04
6 pages
Chart For Kannada
No ratings yet
Chart For Kannada
5 pages
3rd Form
No ratings yet
3rd Form
6 pages
A Conversation-Driven Approach For Chatbot Management: Presented By: Naheeda Afreen 19B81A3325
No ratings yet
A Conversation-Driven Approach For Chatbot Management: Presented By: Naheeda Afreen 19B81A3325
19 pages
Unit - V: Principles of HDL
No ratings yet
Unit - V: Principles of HDL
56 pages
Ahmed Messlmani: Skills Work Experience
No ratings yet
Ahmed Messlmani: Skills Work Experience
1 page
01 Introduction
No ratings yet
01 Introduction
8 pages
CLASS X PT1 Maths
No ratings yet
CLASS X PT1 Maths
4 pages
A1 Unit7 Notes
No ratings yet
A1 Unit7 Notes
10 pages
7 Cs of Communication
No ratings yet
7 Cs of Communication
2 pages
Animated Presenation Template (With Morph Transition)
No ratings yet
Animated Presenation Template (With Morph Transition)
8 pages
Lord of The Flies Word Cloud PDF
No ratings yet
Lord of The Flies Word Cloud PDF
3 pages
Lec6 - Testbench Modified
No ratings yet
Lec6 - Testbench Modified
15 pages
Types of Listening
No ratings yet
Types of Listening
4 pages
Logical Reasoning
No ratings yet
Logical Reasoning
7 pages
Tieng Anh 8 Sach Moi de Thi Giua Hoc Ki 2
No ratings yet
Tieng Anh 8 Sach Moi de Thi Giua Hoc Ki 2
6 pages
8 Ci Sinif Otk .Az 2023 n2
No ratings yet
8 Ci Sinif Otk .Az 2023 n2
3 pages
Dropbox
No ratings yet
Dropbox
4 pages
Lecture Note-TKD-2
No ratings yet
Lecture Note-TKD-2
7 pages
LMB 162 Adc
No ratings yet
LMB 162 Adc
11 pages
Alchemy of The Heart - Week 3 Article
No ratings yet
Alchemy of The Heart - Week 3 Article
2 pages
Mohammad Alfar CV-Accounting - Supplychain Coordinator
No ratings yet
Mohammad Alfar CV-Accounting - Supplychain Coordinator
2 pages
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Hashing
From Everand
Hashing
Prakash Hegade
No ratings yet

Trends Merged

Uploaded by

Trends Merged

Uploaded by

2/18/25, 6:02 PM spell_check

# convert words in document to lowercase and tokenize

# pattern to extract words

# print first 10 words

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'adventures', 'of', 'sherlo

In [11]: print("Total words in document: ", len(words))

# set of unique words available in document

Total words in document: 1115585

In [ ]: # dictionary to store word counts

for word in words:

# sort the dictionary by value

# print top 8 words

In [13]: letters = "abcdefghijklmnopqrstuvwxyz"

# generate all words within edit distance 1 (insertions, deletions, substitution

for i in range(len(word) + 1):

for left, right in splits:

# Subsitute a character (assuming substitution cost is 1)

return set(deletes + substitutes + inserts)

In [19]: # dictionary to store probabilities of each word

for word in wordCounts:

# print the probability of some words

Probability of 'the': 0.07154004401278254

In [26]: # input word to be corrected

# if word is already spelled correctly

# get all guesses within edit distance 1

# filter guesses that are available in document

for guess in guesses:

print("\n\nGuesses available in document: ", bestGuesses)

# sort guesses by probability

probs = sorted(probs.items(), key=lambda x: x[1], reverse=True)

# print corrected word with highest probability

Probabilities: [('the', 0.07154004401278254), ('this', 0.0036420353446846273),

Corrected word is: ('the', 0.07154004401278254)

Statistics related to words typed in Google search engine

# read the data

Out[ ]: location year category rank query

0 Global 2001 Consumer Brands 1 Nokia

1 Global 2001 Consumer Brands 2 Sony

2 Global 2001 Consumer Brands 3 BMW

3 Global 2001 Consumer Brands 4 Palm

4 Global 2001 Consumer Brands 5 Adobe

In [ ]: # check the data types

In [ ]: # check for missing values

In [ ]: # get unique locations

Out[ ]: array(['Global', 'France', 'Germany', 'United Kingdom', 'Australia',

In [ ]: # get year range

In [ ]: import matplotlib.pyplot as plt

# plot the total searches per year using a line plot

In [ ]: # get top 5 searched consumer brands per year

Top 5 Consumer Brands per year:

In [ ]: # get top 5 searched movies per year

Top Movies per year:

24018 365 Dni

In [36]: # get search trends related to natural disasters

In [ ]: # get search trends related to Trump

In [38]: # get top 10 searched categories in India

top_10_queries_india = df[(df['year'] == year) & (df['location'] == location)]['

print("Top 10 search queries in", year, "and", location + ":")

Top 10 search queries in 2020 and India:

Top 10 search queries in 2020 and United States:

You might also like