0% found this document useful (0 votes)
51 views76 pages

NLP Merged

Uploaded by

begoj22622
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views76 pages

NLP Merged

Uploaded by

begoj22622
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

DEPARTMENT OF COMPUTER ENGINEERING

A Laboratory Manual for


Natural Language Processing Lab (CSDL7013)
ACADEMIC YEAR: 2024-25

Course Name: Natural Language Processing Lab

Course Code: CSDL7013

Name: __________________________________________________________
Semester: VII (Seventh) Roll No.: ___________________

Div.: ____________________________ Exam. Seat No.: _____________

Email ID: _________________________ Mobile No.: _________________


DEPARTMENT OF COMPUTER ENGINEERING

VISION AND MISSION


Institution's
To be a world class institute and a front runner in educational and socioeconomic
Vision development of the nation by providing high quality technical education to students
from all sections of society.
To provide superior learning experiences in a caring and conducive environment so
Mission
as to empower students to be successful in life & contribute positively to society.
We, at SHREE L. R. TIWARI COLLEGE OF ENGINEERING, shall dedicate and
strive hard to continuously achieve academic excellence in the field of Engineering
Quality
and to produce the most competent Engineers through objective & innovative
Policy
teaching methods, consistent updating of facilities, welfare & quality improvement of
the faculty & a system of continual process improvement.

Computer Engineering Department's


To be a department of high repute focused on quality education, training and skill
Vision development in the field of computer engineering to prepare professionals and
entrepreneurs of high calibre with human values to serve our nation and globe.
M1: To develop - technical, analytical, theoretical competencies, managerial skills
and practical exposure.
M2: Over all development of students, faculty and staff by providing encouraging
Mission environment and infrastructure for learning, skill development and research.
M3: To strengthen - versatility, adaptability and chase for excellence amongst
students with highest ethical values as their core strength

PEO-1: Be employed in industry, government, or entrepreneurial endeavours to


demonstrate professional advancement through significant technical achievements
and expanded leadership responsibility by exhibiting ethical attitude and good
communication skills.
Program
PEO-2: Demonstrate the ability to work effectively as a team member and/or leader
Educational
Objectives in an ever-changing professional environment.
PEO-3: To pursue higher studies, engage in professional development, research and
entrepreneurship and adapt to emerging technologies.

_______________
Student’s Signature
DEPARTMENT OF COMPUTER ENGINEERING

Certificate

This is to certify that Mr. /Ms.________________________________________

Class ________________ Roll No. __________ Exam Seat No. ___________ of

Seventh Semester of Degree in Computer Engineering has completed the

required number of Practical’s / Term Work / Sessional in the subject Natural

Language Processing Lab from the Department of Computer Engineering

during the academic year of 2024-2025 as prescribed in the curriculum.

Lecturer in-Charge Head of the Department Principal


Date:

Seal of
Institution
INSTRUCTION FOR STUDENTS

Students shall read the points given below for understanding the theoretical concepts and
practical applications.
1) Listen carefully to the lecture given by teacher about importance of subject, curriculum
philosophy learning structure, skills to be developed, information about equipment,
instruments, procedure, method of continuous assessment, tentative plan of work in
laboratory and total amount of work to be done in a semester.
2) Student shall undergo study visit of the laboratory for types of equipment, instruments,
software to be used, before performing experiments.
3) Read the write up of each experiment to be performed, a day in advance.
4) Organize the work in the group and make a record of all observations.
5) Understand the purpose of experiment and its practical implications.
6) Write the answers of the questions allotted by the teacher during practical hours if
possible or afterwards, but immediately.
7) Student should not hesitate to ask any difficulty faced during conduct of
practical/exercise.
8) The student shall study all the questions given in the laboratory manual and practice to
write the answers to these questions.
9) Student shall develop maintenance skills as expected by the industries.
10) Student should develop the habit of pocket discussion/group discussion related to the
experiments/exercises so that exchanges of knowledge/skills could take place.
11) Student shall attempt to develop related hands-on-skills and gain confidence.
12) Student shall focus on development of skills rather than theoretical or codified
knowledge.
13) Student shall visit the nearby workshops, workstation, industries, laboratories, technical
exhibitions, trade fair etc. even not included in the Lab manual. In short, students should
have exposure to the area of work right in the student hood.
14) Student shall insist for the completion of recommended laboratory work, industrial
visits, answers to the given questions, etc.
15) Student shall develop the habit of evolving more ideas, innovations, skills etc. those
included in the scope of the manual.
16) Student shall refer technical magazines, proceedings of the seminars, refer websites
related to the scope of the subjects and update his knowledge and skills.
17) Student should develop the habit of not to depend totally on teachers but to develop self-
learning techniques.
18) Student should develop the habit to react with the teacher without hesitation with respect
to the academics involved.
19) Student should develop habit to submit the practicals, exercise continuously and
progressively on the scheduled dates and should get the assessment done.
20) Student should be well prepared while submitting the write up of the exercise. This will
develop the continuity of the studies and he/she will not be over loaded at the end of the
term.
GUIDELINES FOR TEACHERS

Teachers shall discuss the following points with students before start of practicals of the subject.
1) Learning Overview: To develop better understanding of importance of the subject. To
know related skills to be developed such as Intellectual skills and Motor skills.
2) Learning Structure: In this, topic and sub topics are organized in systematic way so that
ultimate purpose of learning the subject is achieved. This is arranged in the form of fact,
concept, principle, procedure, application and problem.
3) Know your Laboratory Work: To understand the layout of laboratory, specifications of
equipment/Instruments/Materials, procedure, working in groups, planning time ets.
Also to know total amount of work to be done in the laboratory.
4) Teaching shall ensure that required equipments are in working condition before start of
experiment, also keep operating instruction manual available.
5) Explain prior concepts to the students before starting of each experiment.
6) Involve students activity at the time of conduct of each experiment.
7) While taking reading/observation each student shall be given a chance to perform or
observe the experiment.
8) If the experimental set up has variations in the specifications of the equipment, the
teachers are advised to make the necessary changes, wherever needed.
9) Teacher shall assess the performance of students continuously as per norms prescribed
by university of Mumbai and guidelines provided by IQAC.
10) Teacher should ensure that the respective skills and competencies are developed in the
students after the completion of the practical exercise..
11) Teacher is expected to share the skills and competencies are developed in the students.
12) Teacher may provide additional knowledge and skills to the students even though not
covered in the manual but are expected from the students by the industries.
13) Teachers shall ensure that industrial visits if recommended in the manual are covered.
14) Teacher may suggest the students to refer additional related literature of the Technical
papers/Reference books/Seminar proceedings, etc.
15) During assessment teacher is expected to ask questions to the students to tap their
achievements regarding related knowledge and skills so that students can prepare while
submitting record of the practicals. Focus should be given on development of enlisted
skills rather than theoretical /codified knowledge.
16) Teacher should enlist the skills to be developed in the students that are expected by the
industry.
17) Teacher should organize Group discussions /brain storming sessions / Seminars to
facilitate the exchange of knowledge amongst the students.
18) Teacher should ensure that revised assessment norms are followed simultaneously and
progressively.
19) Teacher should give more focus on hands on skills and should actually share the same.
20) Teacher shall also refer to the circulars related to practicals supervise and assessment
for additional guidelines.
DEPARTMENT OF COMPUTER ENGINEERING
Student’s Progress Assessments
Student Name: __________________________________ Roll No.: ______________________
Class/Semester: BE CS/SEM-VII Academic Year: 2024-25
Course Name: Natural Language Processing Lab Course Code: CSDL7013
Assessment Parameters for Practical’s/Assignments
Criteria for Grading Total
(out of Average
Exp. Covered
No.
Title of Experiment PE KT DR DN PL (out of 5) COs
(Out of 3) (Out of 3) (Out of 3) (Out of 3) (Out of 3) 15)
To study and implement pre-
1
processing of texts.
To study and implement pre-
2
processing of documents.
Perform morphological analysis
3 and word generation for any
given text.
To generate N grams from
4 sentences for English and any
Indian Language
Perform POS tagging for English
5
and Hindi using a tagger.
Perform Chunking of text in
6
English Language.
Perform Named Entity
7 Recognition for English
Language.
Perform top down and bottom up
8 parsing using CFG for English
Language.
To implement a text similarity
9
recognizer using NLP techniques
To study and implement the
10
concept of WORDNET
11 Mini-Project
Average Marks

Criteria for Grading – Preparedness and Efforts(PE), Knowledge of tools(KT), Debugging and results(DR),
Documentation(DN), Punctuality & Lab Ethics(PL)
Criteria for Grading Total
Assignments
Average Covered COs
TS OM NT IS (out of 12) (out of 5)
(Out of 3) (Out of 3) (Out of 3) (Out of 3)

Assignment No. 1
Assignment No. 2
Assignment No. 3
Assignment No. 4
Assignment No. 5
Average

Criteria for Grading –Timely submission(TS), Originality of the material(OM), Neatness(NT), Innovative solution(IS)
Grades – Meet Expectations (3 Marks), Moderate Expectations (2 Marks), Below Expectations (1 Mark)

_______________ _________________ _______________


Student’s Signature Subject In-charge Head of Department
DEPARTMENT OF COMPUTER ENGINEERING
RECORD OF PROGRESSIVE ASSESSMENTS

Student Name: __________________________________Roll No.: ________ (BE CS SEM.-VII)


Course Name : Natural Language Processing Laboratory Course Code: CSDL7013
Assessment of Experiments (A)
Sr. Page Date of Date of Assessment Teacher's Signature CO
Name of Experiments
no. No. Performance Submission (out of 15) and Remark Covered
To study and implement pre-
1 processing of texts.

To study and implement pre-


2 processing of documents.
Perform morphological
3 analysis and word generation
for any given text.
To generate N grams from
4 sentences for English and any
Indian Language
Perform POS tagging for
5 English and Hindi using a
tagger.
Perform Chunking of text in
6 English Language.
Perform Named Entity
7 Recognition for English
Language.
Perform top down and bottom
8 up parsing using CFG for
English Language.
To implement a text similarity
9 recognizer using NLP
techniques
To study and implement the
10 concept of WORDNET

Mini-Project
11
Average Marks (Out of 10)

Assessment of Assignments (B)


DEPARTMENT OF COMPUTER ENGINEERING

Teacher's
Sr. Date of Date of Assessment CO
Assignment Page No. Signature and
no. Display Completion (Out of 12) Remark Covered
1 Assignment No.1
2 Assignment No.2
3 Assignment No. 3
4 Assignment No. 4
5 Assignment No. 5
Average Marks (Out of 12)
Converted Marks (Out of 5) (B)
Assessment of Mini-Project (C)

Teacher's
Sr. Date of Date of Assessment CO
Mini-Project Page No. Signature and
no. Display Completion (Out of 18) Remark Covered

1 Mini_Project
Average Marks (Out of 18)
Converted Marks (Out of 5) (B)

Assessments of Attendance (D)


Natural Language Processing Theory AVG. Attendance
Natural Language Processing Laboratory
Attendance Attendance Marks (C)
TH (out of) TH attend. TH % PR (out of) PR Attend. PR % % (TH+PR) (Out of 5)

Total Term Work Marks: A+B+C+D = _________ (Out of 25)

_______________ _________________ _______________


Student Signature Subject In-charge Head of the Department
DEPARTMENT OF COMPUTER ENGINEERING
Programme Outcome (PO & PSOs)
Programme Outcomes are the skills and knowledge which the students have at the time of graduation. This will indicate
what student can do from subject-wise knowledge acquired during the programme.
PO Short title of the PO Description of the Programme outcome as defined by the NBA
Apply the knowledge of mathematics, science, engineering fundamentals, and an
PO-1 Engineering knowledge
engineering specialization to the solution of complex engineering problems.
Identify, formulate, review research literature, and analyze complex
PO-2 Problem analysis engineering problems reaching substantiated conclusions using first principles
of mathematics, natural sciences, and engineering sciences.
Design solutions for complex engineering problems and design system
Design/development of components or processes that meet the specified needs with appropriate
PO-3
solutions consideration for the public health and safety, and the cultural, societal, and
environmental considerations.
Use research-based knowledge and research methods including design of
Conduct investigations of
PO-4 experiments, analysis and interpretation of data, and synthesis of the information to
complex problems
provide valid conclusions.
Create, select, and apply appropriate techniques, resources, and modern engineering
PO-5 Modern tool usage and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations.
Apply reasoning informed by the contextual knowledge to assess societal, health,
The engineer and
PO-6 safety, legal and cultural issues and the consequent responsibilities relevant to the
society
professional engineering practice.
Understand the impact of the professional engineering solutions in societal and
Environment and
PO-7 environmental contexts, and demonstrate the knowledge of, and need for
sustainability
sustainable development.
Apply ethical principles and commit to professional ethics and responsibilities and
PO-8 Ethics
norms of the engineering practice.
Function effectively as an individual, and as a member or leader in diverse teams, and
PO-9 Individual and team work
in multidisciplinary settings.
Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write
PO-10 Communication
effective reports and design documentation, make effective presentations, and give
and receive clear instructions.
Demonstrate knowledge and understanding of the engineering and
Project management
PO-11 management principles and apply these to one’s own work, as a member and leader
and finance
in a team, to manage projects and in multidisciplinary environments.
Recognize the need for, and have the preparation and ability to engage in
PO-12 Life-long learning
independent and life-long learning in the broadest context of technological change.
Program Specific Outcomes (PSOs) defined by the programme. Baseline-Rational Unified Process(RUP)
The graduate must be able to develop, deploy, test and maintain the software or
Computing solution to
PSO-1 computing hardware solutions to solve real life problems using state of the art
solve real life problem
technologies, standards, tools and programming paradigms.

Computer Engineering The graduate should be able to adapt Computer Engineering knowledge and skills to
PSO-2
knowledge and skills create career paths in industries or business organizations or institutes of repute.
DEPARTMENT OF COMPUTER ENGINEERING

Course Objectives and Outcomes


Academic Year: 2024-2025 Class: BE Course Code: CSDL7013
Program: Computer Engineering Div: - Course Name: NLP
Department: Computer Engineering Sem.: VII Faculty: Mrs. Neelam Phadnis
Course Objectives:
Sr. No. Statement
1 To understand the key concepts of NLP.
2 To learn various phases of NLP.
3 To design and implement various language models and POS tagging techniques.
4 To understand various NLP Algorithms
5 To learn NLP applications such as Information Extraction, Sentiment Analysis, Question
6 answering, Machine translation etc.
Course Outcomes:
CO's No. Abbre. Statement
CSDL7013.1 CO1 Apply various text processing techniques.
CSDL7013.2 CO2 Design language model for word level analysis.
CSDL7013.3 CO3 Model linguistic phenomena with formal grammar.
CSDL7013.4 CO4 Design, implement and analyze NLP algorithms.
To apply NLP techniques to design real world NLP applications such as
CSDL7013.5
CO5 machine translation, sentiment analysis, text summarization, information
extraction, Question Answering system etc.
CSDL7013.6 Implement proper experimental methodology for training and evaluating
CO6
empirical NLP systems.
Course Prerequisite:
Sr. No. Pre-requisite
1 Java/Python
Teaching and Examination Scheme:
Teaching Scheme Credits Assigned Examination Scheme
(Hrs)
TW/
Theory Pract Tut Theory
Pract
Tut Total Theory
Oral
Internal End Exam TW & Total
Assessment Sem. Duration Pract
3 2 - 3 1 - 4 Test 1 Test 2 Avg. Exam ( in Hrs)
20 20 20 80 3 25 - 125
Term Work (Total 25 Marks) = (Experiments: 15 mark + Assignments: 05 mark + Attendance: 05 marks
(TH+PR)).
DEPARTMENT OF COMPUTER ENGINEERING

Course Exit Form

Student Name: __________________________________ Roll No.: ______________________


Class/Semester: _______________________________ Academic Year: ________________
Course Name: ___________________________________ Course Code: __________________

Judge your ability with regard to the following points by putting a (√), on the scale of 1 (lowest) to 5 (highest),
based on the knowledge and skills you attained from this course.

Sr. 1 5
Your ability to 2 3 4
No. Lowest Highest

Apply various text processing techniques.


1
2 Design language model for word level analysis.

3 Model linguistic phenomena with formal grammar.

4 Design, implement and analyze NLP algorithms.


To apply NLP techniques to design real world NLP
applications such as machine translation, sentiment
5 analysis, text summarization, information
extraction, Question Answering system etc.
Implement proper experimental methodology for
6 training and evaluating empirical NLP systems.

______________ _______________
Student’s Signature Date
Experiment No. 1
Aim: To study and implement Preprocessing of text (Tokenization, Filtration, Script Validation)
Take any paragraph in English as well as any other natural language(hindi/marathi) and perform
the following preprocessing steps and attach the original text and output.
1. Tokenization
2. To lowercase
3. Remove numbers
4. Replace numbers by corresponding number words
5. Remove punctuation
6. Remove whitespaces
Theory:
Text preprocessing is a fundamental step in Natural Language Processing (NLP), transforming
raw text into a format suitable for machine learning or language models. It ensures that the data is
clean, structured, and ready for analysis. Three essential steps in text preprocessing are
tokenization, filtration, and script validation.
1. Tokenization: Tokenization involves breaking down text into smaller units called tokens,
which could be words, sentences, or subwords. For example, in English, a sentence like “I
love programming” can be tokenized into individual words: ["I", "love", "programming"].
Similarly, in Hindi or Marathi, "मुझे पढ़ना पसंद है " (Hindi for "I like reading") would be
tokenized into ["मुझे", "पढ़ना", "पसंद", "है "]. Tokenization is crucial as it helps machines
understand the boundaries of words and sentences. There are different approaches to
tokenization such as word-level, sentence-level, or even subword tokenization (Byte Pair
Encoding, WordPiece) for languages where word boundaries are unclear.
2. Filtration: Filtration is the process of removing unnecessary elements such as stop words,
punctuation marks, and other irrelevant characters from the text. Stop words are commonly
used words that usually don’t carry significant meaning (e.g., "the," "is," "and"). By
filtering these out, we can focus on the meaningful parts of the text. For example, from the
sentence “The cat is sitting on the mat,” filtration would remove "The," "is," and "on" to
retain: ["cat", "sitting", "mat"]. This process enhances computational efficiency without
losing the context. Similarly, for non-English languages, stop words in Hindi or Marathi
like "है", "के", "में" (meaning "is", "in") can be filtered out.

3. Script Validation: Script validation ensures that the text adheres to the script norms of the
language being processed. For multilingual environments, it is essential to ensure that non-
English texts are written in the correct script (e.g., Devanagari for Hindi or Marathi). This
step also involves verifying the language model's compatibility with the input script.
Inconsistent scripts (e.g., mixing Latin characters with Devanagari) may lead to errors in
downstream NLP tasks.
Example:
English text: Original: "I love to read and write code in Python!" Tokenized: ["I", "love", "to",
"read", "and", "write", "code", "in", "Python"] Filtered: ["love", "read", "write", "code", "Python"]

Hindi text: Original: "मुझे पढ़ना और कोड लिखना पसंद है !" Tokenized: ["मुझे", "पढ़ना", "और",
"कोड", "लिखना", "पसंद", "है"] Filtered: ["पढ़ना", "कोड", "लिखना", "पसंद"]

Program 1(Tokenization) :
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
text = "Assessment consists of two class tests of 20 marks each. The first class test is to be
conducted when approx. 40% syllabus is completed and second class test when additional 40%
syllabus is completed. Duration of each test shall be one hour."
print(sent_tokenize(text))
print(word_tokenize(text))

Output 1 :

Program 2(To lowercase):


!pip install nltk PyPDF2
from google.colab import files
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import PyPDF2
import io
nltk.download('punkt')

hindi_stopwords = {'एक', 'से', 'है ', 'का', 'के', 'लिए', 'यह', 'और', 'में', 'को', 'तो', 'की', 'पर'}

def extract_text_from_pdf(pdf_file):
text = ""
reader = PyPDF2.PdfReader(pdf_file)
for page in reader.pages:
text += page.extract_text() + " "
return text
def hindi_stemmer(word):

suffixes = ['ता', 'ने', 'ना', 'ोो', 'ोे ', 'ोी', 'ोा', 'ोे ों ', 'ोोों ', 'ोीों '] # Common Hindi suffixes

for suffix in suffixes:


if word.endswith(suffix):
return word[:-len(suffix)]
return word
def process_hindi_text(text):
tokens = word_tokenize(text)

filtered_tokens = [word for word in tokens if word not in hindi_stopwords]


stemmed_tokens = [hindi_stemmer(word) for word in filtered_tokens]
return tokens, filtered_tokens, stemmed_tokens
uploaded = files.upload()
for filename in uploaded.keys():
with io.BytesIO(uploaded[filename]) as pdf_file:
pdf_text = extract_text_from_pdf(pdf_file)
original_tokens, filtered_tokens, stemmed_tokens = process_hindi_text(pdf_text)
print("\nOriginal Tokens (Hindi):", original_tokens)
print("\nFiltered Tokens (Stop Words Removed):", filtered_tokens)
print("\nStemmed Tokens:", stemmed_tokens)

Output 2:

Program 3 (Remove numbers):


import nltk
import string
import re
def remove_numbers(text):
result = re.sub(r'\d+', '', text)
return result
input_str = "There are 3 balls in this bag, and 12 in the other one."
print(remove_numbers(input_str) )
Output 3:
Program 4 (Replace numbers by corresponding number words):
import nltk
import inflect
p = inflect.engine()
# convert number into words
def convert_number(text):
# split string into list of words
temp_str = text.split()
# initialise empty list
new_string = []
for word in temp_str:
# if word is a digit, convert the digit
# to numbers and append into the new_string list
if word.isdigit():
temp = p.number_to_words(word)
new_string.append(temp)
# append the word as it is
else:
new_string.append(word)

# join the words of new_string to form a string


temp_str = ' '.join(new_string)
return temp_str
input_str = 'There are three balls in this bag, and twelve in the other one.'
print(convert_number(input_str))

Output 4:
Program 5 (Remove punctuation):
import nltk
def remove_punctuation(text):
translator = str.maketrans('', '', string.punctuation)
return text.translate(translator)

input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5
more days !!"
print(remove_punctuation(input_str) )
Output 5:

Program 6 (Remove whitespaces):


import nltk
def remove_whitespace(text):
return " ".join(text.split())

input_str = " we don't need the given questions"


print(remove_whitespace(input_str) )

Output 6:

Conclusion: Hence we have implemented Preprocessing of text (Tokenization, Filtration, Script


Validation) Take any paragraph in English as well as any other natural language(hindi/marathi)
and perform the following preprocessing steps and attach the original text and output.
Experiment No. 2
Aim: To study and implement Preprocessing of Document (Stop Word Removal, Stemming)
1.) Take any document in English and perform following preprocessing steps
1.Stop word removal
2.Stemming using (Porter, Lancaster, Snowball) stemmers
3.Lemmatization
2.) Take any document in natural language(hinde/marathi) and perform
1.Stop word removal
2.Stemming

Attach both original document and preprocessed document.Attach both original document and
preprocessed document.

Theory:
1. Natural Language Processing (NLP): NLP is a field of artificial intelligence focused on the
interaction between computers and humans through natural language. It involves the application
of computational techniques to analyze and synthesize natural language.

2. Preprocessing in NLP: Preprocessing is a crucial step in NLP that transforms raw text into a
form that can be analyzed by machine learning models. Common preprocessing steps include stop
word removal, stemming, and lemmatization.

3. Stop Word Removal: Stop words are common words like 'the', 'is', 'in', and 'at' which often do
not contribute much to the meaning of a sentence and are removed during preprocessing.

4. Stemming: Stemming is the process of reducing words to their root form. Common algorithms
for stemming include:

 Porter Stemmer: A widely used stemming algorithm that works by applying a series of
rules to strip suffixes from words.
 Lancaster Stemmer: An aggressive stemming algorithm known for its simplicity.
 Snowball Stemmer: Also known as the Porter2 stemmer, it’s an improvement over the
original Porter stemmer.

5. Lemmatization: Lemmatization reduces words to their base form (lemma) based on dictionary
definitions, unlike stemming which trims words mechanically. It ensures that words are
transformed into meaningful root forms (e.g., 'better' to 'good').

Program 1 :
!pip install nltk PyPDF2
from google.colab import files
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import PyPDF2
import io
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
def extract_text_from_pdf(pdf_file):
text = ""
reader = PyPDF2.PdfReader(pdf_file)
for page in reader.pages:
text += page.extract_text() + " "
return text
def remove_stopwords(text):
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
return filtered_tokens
def perform_stemming(tokens):
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")
porter_stems = [porter.stem(word) for word in tokens]
lancaster_stems = [lancaster.stem(word) for word in tokens]
snowball_stems = [snowball.stem(word) for word in tokens]
return porter_stems, lancaster_stems, snowball_stems
def perform_lemmatization(tokens):
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in tokens]
return lemmas
uploaded = files.upload()
for filename in uploaded.keys():
with io.BytesIO(uploaded[filename]) as pdf_file
pdf_text = extract_text_from_pdf(pdf_file)
filtered_words = remove_stopwords(pdf_text)
print("Filtered Tokens (Stop Words Removed):", filtered_words
porter_stems, lancaster_stems, snowball_stems = perform_stemming(filtered_words)
print("\nPorter Stems:", porter_stems)
print("Lancaster Stems:", lancaster_stems)
print("Snowball Stems:", snowball_stems)
lemmas = perform_lemmatization(filtered_words)
print("\nLemmatized Words:", lemmas)

Output 1 :
The Benefits of Nature.pdf(application/pdf) - 362548 bytes, last modified: 9/27/2024 - 100% done
Saving The Benefits of Nature.pdf to The Benefits of Nature.pdf

Filtered Tokens (Stop Words Removed): ['Benefits', 'Nature', 'Nature', 'essential', 'human', 'well', '-
being', '.', 'Spending', 'time', 'outdoo', 'rs', 'significantly', 'improve', 'mental', 'health', '.', 'Research',
'shows', 'nature', 'reduces', 'stress', ',', 'anxiety', ',', 'depression', '.', 'Additionally', ',', 'nature',
'encourages', 'physical', 'activity', ',', 'promoting', 'healthier', 'lifestyle', '.', 'Activities', 'like', 'hiking',
',', 'biking', ',', 'simply', 'walking', 'park', 'boost', 'cardiovascular', 'health', 'improve', 'mood', '.',
'Moreover', ',', 'exposure', 'natural', 'environments', 'enhances', 'creativity', 'cognitive', 'function', '.',
'Thus', ',', 'incorporating', 'nature', 'daily', 'life', 'beneficial', 'physical', 'mental', 'health', '.',
'Conclusion', 'Embracing', 'nature', 'crucial', 'balanced', 'life', '.', 'Whether', "'s", 'stroll', 'park',
'weekend', 'hike', ',', 'make', 'time', 'great', 'outdoors', '!']

Porter Stems: ['benefit', 'natur', 'natur', 'essenti', 'human', 'well', '-be', '.', 'spend', 'time', 'outdoo', 'rs',
'significantli', 'improv', 'mental', 'health', '.', 'research', 'show', 'natur', 'reduc', 'stress', ',', 'anxieti', ',',
'depress', '.', 'addit', ',', 'natur', 'encourag', 'physic', 'activ', ',', 'promot', 'healthier', 'lifestyl', '.', 'activ',
'like', 'hike', ',', 'bike', ',', 'simpli', 'walk', 'park', 'boost', 'cardiovascular', 'health', 'improv', 'mood', '.',
'moreov', ',', 'exposur', 'natur', 'environ', 'enhanc', 'creativ', 'cognit', 'function', '.', 'thu', ',', 'incorpor',
'natur', 'daili', 'life', 'benefici', 'physic', 'mental', 'health', '.', 'conclus', 'embrac', 'natur', 'crucial',
'balanc', 'life', '.', 'whether', "'s", 'stroll', 'park', 'weekend', 'hike', ',', 'make', 'time', 'great', 'outdoor',
'!']

Lancaster Stems: ['benefit', 'nat', 'nat', 'ess', 'hum', 'wel', '-being', '.', 'spend', 'tim', 'outdoo', 'rs', 'sign',
'improv', 'ment', 'heal', '.', 'research', 'show', 'nat', 'reduc', 'stress', ',', 'anxy', ',', 'depress', '.', 'addit', ',',
'nat', 'enco', 'phys', 'act', ',', 'promot', 'healthy', 'lifestyl', '.', 'act', 'lik', 'hik', ',', 'bik', ',', 'simply', 'walk',
'park', 'boost', 'cardiovascul', 'heal', 'improv', 'mood', '.', 'moreov', ',', 'expos', 'nat', 'environ', 'enh',
'cre', 'cognit', 'funct', '.', 'thu', ',', 'incorp', 'nat', 'dai', 'lif', 'benef', 'phys', 'ment', 'heal', '.', 'conclud',
'embrac', 'nat', 'cruc', 'bal', 'lif', '.', 'wheth', "'s", 'stroll', 'park', 'weekend', 'hik', ',', 'mak', 'tim', 'gre',
'outdo', '!']

Snowball Stems: ['benefit', 'natur', 'natur', 'essenti', 'human', 'well', '-be', '.', 'spend', 'time', 'outdoo',
'rs', 'signific', 'improv', 'mental', 'health', '.', 'research', 'show', 'natur', 'reduc', 'stress', ',', 'anxieti', ',',
'depress', '.', 'addit', ',', 'natur', 'encourag', 'physic', 'activ', ',', 'promot', 'healthier', 'lifestyl', '.', 'activ',
'like', 'hike', ',', 'bike', ',', 'simpli', 'walk', 'park', 'boost', 'cardiovascular', 'health', 'improv', 'mood', '.',
'moreov', ',', 'exposur', 'natur', 'environ', 'enhanc', 'creativ', 'cognit', 'function', '.', 'thus', ',', 'incorpor',
'natur', 'daili', 'life', 'benefici', 'physic', 'mental', 'health', '.', 'conclus', 'embrac', 'natur', 'crucial',
'balanc', 'life', '.', 'whether', "'s", 'stroll', 'park', 'weekend', 'hike', ',', 'make', 'time', 'great', 'outdoor',
'!']

Lemmatized Words: ['Benefits', 'Nature', 'Nature', 'essential', 'human', 'well', '-being', '.', 'Spending',
'time', 'outdoo', 'r', 'significantly', 'improve', 'mental', 'health', '.', 'Research', 'show', 'nature',
'reduces', 'stress', ',', 'anxiety', ',', 'depression', '.', 'Additionally', ',', 'nature', 'encourages', 'physical',
'activity', ',', 'promoting', 'healthier', 'lifestyle', '.', 'Activities', 'like', 'hiking', ',', 'biking', ',', 'simply',
'walking', 'park', 'boost', 'cardiovascular', 'health', 'improve', 'mood', '.', 'Moreover', ',', 'exposure',
'natural', 'environment', 'enhances', 'creativity', 'cognitive', 'function', '.', 'Thus', ',', 'incorporating',
'nature', 'daily', 'life', 'beneficial', 'physical', 'mental', 'health', '.', 'Conclusion', 'Embracing', 'nature',
'crucial', 'balanced', 'life', '.', 'Whether', "'s", 'stroll', 'park', 'weekend', 'hike', ',', 'make', 'time', 'great',
'outdoors', '!']

Program 2:
!pip install nltk PyPDF2
from google.colab import files
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import PyPDF2
import io
nltk.download('punkt')

hindi_stopwords = {'एक', 'से ', 'है ', 'का', 'के', 'लिए', 'यह', 'और', 'में ', 'को', 'तो', 'की', 'पर'}

def extract_text_from_pdf(pdf_file):
text = ""
reader = PyPDF2.PdfReader(pdf_file)
for page in reader.pages:
text += page.extract_text() + " "
return text
def hindi_stemmer(word):

suffixes = ['ता', 'ने ', 'ना', 'ोो', 'ोे ', 'ोी', 'ोा', 'ोे ो', 'ोोो', 'ोीो'] # Common Hindi suffixes

for suffix in suffixes:


if word.endswith(suffix):
return word[:-len(suffix)]
return word
def process_hindi_text(text):
tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word not in hindi_stopwords]
stemmed_tokens = [hindi_stemmer(word) for word in filtered_tokens]
return tokens, filtered_tokens, stemmed_tokens
uploaded = files.upload()
for filename in uploaded.keys():
with io.BytesIO(uploaded[filename]) as pdf_file:
pdf_text = extract_text_from_pdf(pdf_file)
original_tokens, filtered_tokens, stemmed_tokens = process_hindi_text(pdf_text)
print("\nOriginal Tokens (Hindi):", original_tokens)
print("\nFiltered Tokens (Stop Words Removed):", filtered_tokens)
print("\nStemmed Tokens:", stemmed_tokens)

Output 2:

Original Tokens (Hindi): ['Ůक', 'ो', 'लोत', 'क', 'ोे ', 'िाभ', 'Ůक', 'ो', 'लोत', 'मानव', 'कʞोाण', 'क', 'ोे ',
'लोिए', 'अȑोावʴक', 'है ।', 'बाहरी', 'वातावरण', 'मŐ', 'समय', 'लोबताने ', 'से ', 'मालनसक', 'ˢोा˖', 'मŐ',
'महȕपूणŊ', 'सु धार', 'हो', 'सकता', 'है ।', 'शोध', 'से ', 'पता', 'चिता', 'है ', 'लोक', 'Ůक', 'ो', 'लोत', 'मŐ', 'समय',
'लोबताने ', 'से ', 'तनाव', ',', 'लोचता', ',', 'और', 'अवसाद', 'मŐ', 'कमी', 'आती', 'है ।', 'इसक', 'ोे ', 'अिावा', ',',
'Ůक', 'ो', 'लोत', 'शारीįरक', 'लिलतलवधयो', 'को', 'बढ़ावा', 'दे ती', 'है ', ',', 'जो', 'एक', 'ˢ̾थ', 'जीवनशैिी', 'को',
'ŮोोȖोालोहत', 'करती', 'है ।', 'जैसे', 'लोक', 'टŌोे लोक', 'ो', 'ि', ',', 'सालइकि', 'चिाना', ',', 'या', 'पाकŊ',
'मŐ', 'टहिना', 'şदय', 'ˢोा˖', 'को', 'बेहतर', 'बनाते ', 'हœ', 'और', 'मनोİ̾लथत', 'मŐ', 'सु धार', 'करते ', 'हœ।',
'साथ', 'ही', ',', 'Ůोाक', 'ो', 'लोतक', 'वातावरण', 'क', 'ोे ', 'सपकŊ', 'मŐ', 'आने ', 'से', 'रचनाȏकता', 'और',
'सǒोानाȏक', 'Ɨमता', 'मŐ', 'वİȠ', 'होती', 'है ।', 'इस', 'Ůकार', ',', 'दै लोनक', 'जीवन', 'मŐ', 'Ůक', 'ो', 'लोत',
'को', 'शालोमि', 'करना', 'शारीįरक', 'और', 'मालनसक', 'ˢोा˖', 'दोनो', 'क', 'ोे ', 'लोिए', 'िाभकारी', 'है ।',
'लोनʺषŊ', 'सतुलोित', 'जीवन', 'क', 'ोे ', 'लोिए', 'Ůक', 'ो', 'लोत', 'को', 'अपनाना', 'महȕपूणŊ', 'है ।', 'चाहे ',
'वह', 'पाकŊ', 'मŐ', 'एक', 'सैर', 'हो', 'या', 'सɑोाहात', 'की', 'टŌोे लोक', 'ो', 'ि', ',', 'बाहरी', 'वातावरण', 'क',
'ोे ', 'साथ', 'समय', 'लोबताने ', 'क', 'ोे ', 'लोिए', 'समय', 'अवʴ', 'लोनकािŐ', '!']
Filtered Tokens (Stop Words Removed): ['Ůक', 'ो', 'लोत', 'क', 'ोे ', 'िाभ', 'Ůक', 'ो', 'लोत', 'मानव',
'कʞोाण', 'क', 'ोे ', 'लोिए', 'अȑोावʴक', 'है ।', 'बाहरी', 'वातावरण', 'मŐ', 'समय', 'लोबताने ', 'मालनसक', 'ˢोा˖',
'मŐ', 'महȕपूणŊ', 'सुधार', 'हो', 'सकता', 'है ।', 'शोध', 'पता', 'चिता', 'लोक', 'Ůक', 'ो', 'लोत', 'मŐ', 'समय',
'लोबताने ', 'तनाव', ',', 'लोचता', ',', 'अवसाद', 'मŐ', 'कमी', 'आती', 'है ।', 'इसक', 'ोे ', 'अिावा', ',', 'Ůक', 'ो',
'लोत', 'शारीįरक', 'लिलतलवधयो', 'बढ़ावा', 'दे ती', ',', 'जो', 'ˢ̾थ', 'जीवनशैिी', 'ŮोोȖोालोहत', 'करती', 'है ।',
'जैसे', 'लोक', 'टŌोे लोक', 'ो', 'ि', ',', 'सालइकि', 'चिाना', ',', 'या', 'पाकŊ', 'मŐ', 'टहिना', 'şदय', 'ˢोा˖',
'बेहतर', 'बनाते ', 'हœ', 'मनोİ̾लथत', 'मŐ', 'सुधार', 'करते ', 'हœ।', 'साथ', 'ही', ',', 'Ůोाक', 'ो', 'लोतक',
'वातावरण', 'क', 'ोे ', 'सपकŊ', 'मŐ', 'आने ', 'रचनाȏकता', 'सǒोानाȏक', 'Ɨमता', 'मŐ', 'वİȠ', 'होती', 'है ।',
'इस', 'Ůकार', ',', 'दै लोनक', 'जीवन', 'मŐ', 'Ůक', 'ो', 'लोत', 'शालोमि', 'करना', 'शारीįरक', 'मालनसक', 'ˢोा˖',
'दोनो', 'क', 'ोे ', 'लोिए', 'िाभकारी', 'है ।', 'लोनʺषŊ', 'सतुलोित', 'जीवन', 'क', 'ोे ', 'लोिए', 'Ůक', 'ो', 'लोत',
'अपनाना', 'महȕपूणŊ', 'है ।', 'चाहे ', 'वह', 'पाकŊ', 'मŐ', 'सैर', 'हो', 'या', 'सɑोाहात', 'टŌोे लोक', 'ो', 'ि', ',',
'बाहरी', 'वातावरण', 'क', 'ोे ', 'साथ', 'समय', 'लोबताने ', 'क', 'ोे ', 'लोिए', 'समय', 'अवʴ', 'लोनकािŐ', '!']

Stemmed Tokens: ['Ůक', 'ो', 'लोत', 'क', '', 'िाभ', 'Ůक', 'ो', 'लोत', 'मानव', 'कʞोाण', 'क', '', 'लोिए',
'अȑोावʴक', 'है ।', 'बाहर', 'वातावरण', 'मŐ', 'समय', 'लोबता', 'मालनसक', 'ˢोा˖', 'मŐ', 'महȕपूणŊ', 'सुधार',
'ह', 'सक', 'है ।', 'शोध', 'प', 'चि', 'लोक', 'Ůक', 'ो', 'लोत', 'मŐ', 'समय', 'लोबता', 'तनाव', ',', 'लोच', ',',
'अवसाद', 'मŐ', 'कम', 'आत', 'है ।', 'इसक', '', 'अिाव', ',', 'Ůक', 'ो', 'लोत', 'शारीįरक', 'लिलतलवधय', 'बढ़ाव',
'दे त', ',', 'ज', 'ˢ̾थ', 'जीवनशै ि', 'ŮोोȖोालोहत', 'करत', 'है ।', 'जैस', 'लोक', 'टŌोे लोक', 'ो', 'ि', ',', 'सालइकि',
'चिा', ',', 'य', 'पाकŊ', 'मŐ', 'टहि', 'şदय', 'ˢोा˖', 'बेहतर', 'बनात', 'हœ', 'मनोİ̾लथत', 'मŐ', 'सु धार', 'करत',
'हœ।', 'साथ', 'ह', ',', 'Ůोाक', 'ो', 'लोतक', 'वातावरण', 'क', '', 'सपकŊ', 'मŐ', 'आ', 'रचनाȏक', 'सǒोानाȏक',
'Ɨम', 'मŐ', 'वİȠ', 'होत', 'है ।', 'इस', 'Ůकार', ',', 'दै लोनक', 'जीवन', 'मŐ', 'Ůक', 'ो', 'लोत', 'शालोमि', 'कर',
'शारीįरक', 'मालनसक', 'ˢोा˖', 'दोन', 'क', '', 'लोिए', 'िाभकार', 'है ।', 'लोनʺषŊ', 'सतुलोित', 'जीवन', 'क', '',
'लोिए', 'Ůक', 'ो', 'लोत', 'अपना', 'महȕपूणŊ', 'है ।', 'चाह', 'वह', 'पाकŊ', 'मŐ', 'सैर', 'ह', 'य', 'सɑोाहात',
'टŌोे लोक', 'ो', 'ि', ',', 'बाहर', 'वातावरण', 'क', '', 'साथ', 'समय', 'लोबता', 'क', '', 'लोिए', 'समय', 'अवʴ',
'लोनकािŐ', '!']

Conclusion:
In this experiment, we successfully implemented document preprocessing techniques, including
stop word removal and stemming, for both English and Hindi texts. These methods help in
reducing the dimensionality of textual data and preparing it for further natural language processing
tasks. The results demonstrate the effectiveness of preprocessing in simplifying text while
retaining its essential meaning.
Experiment No. 3
Aim: Perform morphological analysis using various stemmers for English as well as any Indian
language.

Theory :

Morphological Analysis: Morphological analysis is the study of the structure and form of words
in a language. It involves examining the internal structure of words and how they can be modified
to convey different meanings. The smallest units of meaning within words are called morphemes.
Morphological analysis can be divided into two main categories:

1. Morphology: This involves creating new words by adding prefixes and suffixes (e.g.,
happy → happiness).

2. Inflectional Morphology: This modifies a word to express different grammatical categories


(e.g., run → runs, running).

Stemming: Stemming, a subfield of morphological analysis, focuses on reducing words to their


base or root forms (stems). This process is essential in various natural language processing (NLP)
applications, such as information retrieval, text mining, and machine learning, as it helps in
normalizing different word forms and improving search accuracy.

Here are three widely used stemmers:

Porter Stemmer : The Porter Stemmer is one of the most well-known stemming algorithms. It
was developed by Martin Porter in 1980 and is specifically designed for the English language. The
algorithm applies a series of rules (a set of heuristics) to reduce inflected words to their base forms.

Algorithm Process: The Porter Stemmer consists of several steps that involve:

Step 1: Removing common suffixes such as -ed, -ing, -ly, etc.

Step 2: Reducing words based on specific rules and conditions.

Step 3: Removing any remaining suffixes while considering the length and structure of the word.

Snowball Stemmer : The Snowball Stemmer is an improved version of the Porter Stemmer,
developed by Martin Porter as well. It offers a more extensive and flexible approach to stemming,
supporting multiple languages. The Snowball framework includes algorithms for not only English
but also languages like French, Spanish, German, and many others.

Algorithm Process: The Snowball Stemmer employs similar principles as the Porter Stemmer but
includes additional rules and heuristics to cater to various languages. It is more efficient and
produces better results by reducing the ambiguity associated with stemming.

Indic Stemmer : The Indic Stemmer is specifically designed for Indian languages, including
Hindi, Marathi, Bengali, and others. These languages exhibit complex morphological structures
and rich inflections, making traditional stemming algorithms like Porter and Snowball less
effective.
Algorithm Process: The Indic Stemmer applies a set ofrulestailored to the unique characteristics
of Indian languages. It handles:

Inflectional Forms: Recognizing and reducing various forms of verbs, nouns, and
adjectives. Compound Words: Decomposing compound words into their constituent
morphemes.

Code :

# Install required packages


!pip install nltk
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
!pip install indic-nlp-library
!git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
import sys
from indicnlp import common
from indicnlp import loader
from indicnlp.tokenize import indic_tokenize

# Set up the resource path to the downloaded resources


INDIC_NLP_RESOURCES = "./indic_nlp_resources"
common.set_resources_path(INDIC_NLP_RESOURCES)
loader.load()
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")
english_words = ["running", "jumps", "easily", "faster"]
print("Stemming for English Words:")
for word in english_words:
print(f"Original Word: {word}")
print(f"Porter Stemmer: {porter.stem(word)}")
print(f"Lancaster Stemmer: {lancaster.stem(word)}")
print(f"Snowball Stemmer: {snowball.stem(word)}")
print()

hindi_sentence = "राम दौड़ रहा है " # "Ram is running"


hindi_tokens = indic_tokenize.trivial_tokenize(hindi_sentence, 'hi')
print("\nStemming for Hindi Words:")
for word in hindi_tokens:
stemmed_word = word[:-1] if len(word) > 3 else word
print(f"Original Word: {word}, Stemmed Word: {stemmed_word}")
marathi_sentence = "राम धावत आहे " # "Ram is running" in Marathi
marathi_tokens = indic_tokenize.trivial_tokenize(marathi_sentence, 'mr')
print("\nStemming for Marathi Words:")
for word in marathi_tokens:
stemmed_word = word[:-1] if len(word) > 3 else word
print(f"Original Word: {word}, Stemmed Word: {stemmed_word}")

Output :
Conclusion : Hence we have implemented morphological analysis using various stemmers for
English as well as any Indian language.
Experiment No. 4

Aim : To generate N grams from sentences for English and any Indian language.

Theory :
N-grams are contiguous sequences of n items from a given sample of text or speech. They are
widely used in NLP tasks such as text analysis, language modeling, and information retrieval. An
N-gram of size 1 is referred to as a "unigram", size 2 as a "bigram", and size 3 as a "trigram". For
example:
For the sentence: "I love NLP"
Unigrams: ["I", "love", "NLP"]
Bigrams: ["I love", "love NLP"]
Trigrams: ["I love NLP"]

N-grams capture the local context within a text, enabling a better understanding of word
dependencies and patterns. They are crucial for machine learning models that rely on text features.
For this experiment, we will use English and Hindi as the target languages.

Program :
The following code uses Python along with the nltk library for generating N-grams from sentences
in English and Hindi:
import nltk
from nltk import ngrams
from collections import Counter
nltk.download('punkt')
def generate_ngrams(sentence, n):
tokens = nltk.word_tokenize(sentence)
n_grams = list(ngrams(tokens, n))
return n_grams

english_sentence = "I am learning Natural Language Processing."


n=2
english_bigrams = generate_ngrams(english_sentence, n)
print(f"English {n}-grams: {english_bigrams}")

hindi_sentence = "म ाकृितक भाषा सं रण सीख रहा ँ ।"

hindi_bigrams = generate_ngrams(hindi_sentence, n)
print(f"Hindi {n}-grams: {hindi_bigrams}")

Output :

Conclusion : The experiment successfully generates N-grams for sentences in English and Hindi,
demonstrating the use of NLP techniques to extract meaningful word sequences.
Experiment No. 5

Aim: Perform POS tagging for English and Hindi using a tagger.

Theory:
Part-of-Speech (POS) Tagging is the process of assigning a part of speech label to each word in a
given text based on its context and role in the sentence. POS tagging is an essential step in
various Natural Language Processing (NLP) applications because it helps in understanding the
syntactic and grammatical structure of a sentence.
What is a Part of Speech?
A part of speech is a category to which a word is assigned based on its syntactic function.
Examples of parts of speech include:
1. Noun (NN)
2. Verb (VB)
3. Adjective (JJ)
4. Adverb (RB)
5. Pronoun (PRP)
6. Preposition (IN)
7. Conjunction (CC)
8. Determiner (DT)
How does POS Tagging work?
POS tagging typically relies on the context of the word in a sentence and a probabilistic model or
set of rules to decide its part of speech. Two popular approaches are:
Rule-Based POS Tagging:
Uses predefined grammar rules to determine the POS tag.
E.g., if a word ends in "-ing," it is likely to be a verb (e.g., "running").
Statistical POS Tagging:
Uses machine learning models like Hidden Markov Models (HMM) or neural networks.
These models are trained on large tagged corpora and learn the probability distribution of word
sequences.
Hybrid POS Tagging:
Combines rule-based and statistical approaches for higher accuracy.
Example of POS Tagging :
"The quick brown fox jumps over the lazy dog."
In this sentence:
"The" is tagged as a Determiner (DT).
"quick" and "brown" are both tagged as Adjectives (JJ).
"fox" is tagged as a Noun (NN).
"jumps" is tagged as a Verb (VBZ) in the 3rd person singular form.
"over" is tagged as a Preposition (IN).
"the" is again tagged as a Determiner (DT).
"lazy" is tagged as an Adjective (JJ).
"dog" is tagged as a Noun (NN).

Code:

!pip install nltk


!pip install stanza
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import stanza
stanza.download('hi') # Download Hindi model
import nltk
from nltk import pos_tag, word_tokenize
import stanza
nlp_hi = stanza.Pipeline('hi')
def english_pos_tagging(text):
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
return tagged
def hindi_pos_tagging(text):
doc = nlp_hi(text)
tagged = [(word.text, word.upos) for sentence in doc.sentences for word in sentence.words]
return tagged
english_text = "The quick brown fox jumps over the lazy dog."

hindi_text = "गिलास में पानी है ।"

english_tags = english_pos_tagging(english_text)
hindi_tags = hindi_pos_tagging(hindi_text)
print("English POS Tags:", english_tags)
print("Hindi POS Tags:", hindi_tags)

Output:

Conclusion:
In this experiment, we successfully performed Part-of-Speech (POS) tagging for both English
and Hindi using two different NLP libraries.
Experiment No. 6
Aim: Perform chunking of text in english language

Theory: Chunking is defined as the process of natural language processing used to identify parts
of speech and short phrases present in a given sentence. Recalling our good old English grammar
classes back in school, note that there are eight parts of speech namely the noun, verb, adjective,
adverb, preposition, conjunction, pronoun, and interjection. Also, in the above definition of
chunking, short phrases refer to the phrases formed by including any of these parts of speech.

For example, chunking can be done to identify and thus group noun phrases or nouns alone,
adjectives or adjective phrases, and so on. Consider the sentence below:

“I had burgers and pastries for breakfast.”

In this case, if we wish to group or chunk noun phrases, we will get “burgers”, “pastries” and
“lunch” which are the nouns or noun groups of the sentence.

Where is chunking used?

Chunking is used to get the required phrases from a given sentence. However, POS tagging can
be used only to spot the parts of speech that every word of the sentence belongs to.

When we have loads of descriptions or modifications around a particular word or the phrase of
our interest, we use chunking to grab the required phrase alone, ignoring the rest around it.
Hence, chunking paves a way to group the required phrases and exclude all the modifiers around
them which are not necessary for our analysis. Interestingly, this process of chunking in NLP is
extended to various other applications; for instance, to group fruits of a specific category, say,
fruits rich in proteins as a group, fruits rich in vitamins as another group, and so on. Besides,
chunking can also be used to group similar cars, say, cars supporting auto-gear into one group
and the others which support manual gear into another chunk and so on.

Types of Chunking

There are, broadly, two types of chunking:

Chunking up

Chunking down
Program:

Dependencies:

pip install nltk

import nltk

nltk.download('punkt')

nltk.download('averaged_perceptron_tagger')

nltk.download('maxent_ne_chunker')

nltk.download('words')

Code:

import nltk

from nltk import word_tokenize, pos_tag, ne_chunk

# Sample text

text = "Chunking is a process that helps people understand and remember information better."

# Step 1: Tokenize the text

tokens = word_tokenize(text)

# Step 2: Perform Part-of-Speech Tagging


tagged = pos_tag(tokens)
# Step 3: Chunking using Named Entity Recognition

chunked = ne_chunk(tagged)

# Step 4: Print chunked output in a more understandable format


for subtree in chunked:
# Check if subtree is a named entity

if hasattr(subtree, 'label'):

entity_name = " ".join([token for token, pos in subtree.leaves()])


entity_type = subtree.label()
print(f"Entity: {entity_name}, Type: {entity_type}")

else:

# If it's not a named entity, print the token and POS tag

print (f"Token: {subtree[0]}, POS Tag: {subtree[1]}")

Output:

Conclusion: Hence we have implemented chunking of text in English language.


Experiment No. 7

Aim: Perform Named Entity Recognition for English language.

Theory:

Named Entity Recognition:

Named Entity Recognition (NER) is a technique in natural language processing (NLP) that
focuses on identifying and classifying entities. The purpose of NER is to automatically extract
structured information from unstructured text, enabling machines to understand and categorize
entities in a meaningful manner for various applications like text summarization, building
knowledge graphs, question answering, and knowledge graph construction. The article explores
the fundamentals, methods and implementation of the NER model.

What is Named Entity Recognition (NER)?

Name-entity recognition (NER) is also referred to as entity identification, entity chunking, and
entity extraction. NER is the component of information extraction that aims to identify and
categorize named entities within unstructured text. NER involves the identification of key
information in the text and classification into a set of predefined categories. An entity is the
thing that is consistently talked about or refer to in the text, such as person names,
organizations, locations, time expressions, quantities, percentages and more predefined
categories.

NER system fin applications across various domains, including question answering, information
retrieval and machine translation. NER plays an important role in enhancing the precision of
other NLP tasks like part-of-speech tagging and parsing. At its core, NLP is just a two-step
process, below are the two steps that are involved:
● Detecting the entities from the text
● Classifying them into different categories

Ambiguity in NER

● For a person, the category definition is intuitively quite clear, but for computers, there
is some ambiguity in classification. Let’s look at some ambiguous examples:
○ England (Organization) won the 2019 world cup vs The 2019 world cup
happened in England (Location).
○ Washington (Location) is the capital of the US vs The first president of the
US was Washington (Person).

How Named Entity Recognition (NER) works?

The working of Named Entity Recognition is discussed below:


● The NER system analyses the entire input text to identify and locate the named entities.
● The system then identifies the sentence boundaries by considering capitalization rules. It
recognizes the end of the sentence when a word starts with a capital letter, assuming it
could be
the beginning of a new sentence. Knowing sentence boundaries aids in contextualizing
entities within the text, allowing the model to understand relationships and meanings.
● NER can be trained to classify entire documents into different types, such as invoices,
receipts, or passports. Document classification enhances the versatility of NER, allowing
it to adapt its entity recognition based on the specific characteristics and context of
different document types.
● NER employs machine learning algorithms, including supervised learning, to analyze
labeled datasets. These datasets contain examples of annotated entities, guiding the
model in recognizing similar entities in new, unseen data.
● Through multiple training iterations, the model refines its understanding of contextual
features, syntactic structures, and entity patterns, continuously improving its accuracy over
time. ● The model’s ability to adapt to new data allows it to handle variations in language,
context, and entity types, making it more robust and effective.

Named Entity Recognition (NER) Methods:

Lexicon Based Method :


The NER uses a dictionary with a list of words or terms. The process involves checking if any
of these words are present in a given text. However, this approach isn’t commonly used
because it requires constant updating and careful maintenance of the dictionary to stay
accurate and effective.

Rule Based Method :


The Rule Based NER method uses a set of predefined rules guides the extraction of
information. These rules are based on patterns and context. Pattern-based rules focus on the
structure and form of words, looking at their morphological patterns. On the other hand,
context-based rules consider the surrounding words or the context in which a word appears
within the text document. This combination of pattern-based and context-based rules enhances
the precision of information extraction in Named Entity Recognition (NER).

Machine Learning-Based Method:

Multi-Class Classification with Machine Learning Algorithms


One way is to train the model for multi-class classification using different machine learning
algorithms, but it requires a lot of labelling. In addition to labelling the model also requires a
deep understanding of context to deal with the ambiguity of the sentences. This makes it a
challenging task for a simple machine learning algorithm.

Conditional Random Field (CRF)


Conditional random field is implemented by both NLP Speech Tagger and NLTK. It is a
probabilistic model that can be used to model sequential data such as words.
The CRF can capture a deep understanding of the context of the sentence. In this model,

the input

Deep Learning Based Method:

Deep learning NER system is much more accurate than previous method, as it is capable to
assemble words. This is due to the fact that it used a method called word embedding, that is
capable of understanding the semantic and syntactic relationship between various words.
It is also able to learn analyzes topic specific as well as high level words automatically. This
makes deep learning NER applicable for performing multiple tasks. Deep learning can do
most of the repetitive work itself, hence researchers for example can use their time more
efficiently.
Program:

Dependencies :

pip install spacy


python -m spacy download en_core_web_sm

Code :

import spacy

# Load the spaCy model for English


nlp = spacy.load("en_core_web_sm")

# Longer example text


text = """
Elon Musk, the CEO of SpaceX and Tesla, was born in Pretoria, South Africa, on June 28, 1971.
In 2002, he founded SpaceX,
with the goal of reducing space transportation costs to enable the colonization of Mars. Tesla, on
the other hand,
is headquartered in Palo Alto, California, and has revolutionized the electric car industry. In
2020, Musk became
the richest person in the world, overtaking Jeff Bezos, the founder of Amazon, whose company
is based in Seattle, Washington.
SpaceX's Falcon 9 rocket launched successfully from Kennedy Space Center in Florida,
marking another achievement in Musk's career.
"""

# Process the text


doc = nlp(text)

# Extract named entities


for entity in doc.ents:
print(f"Entity: {entity.text}, Label: {entity.label_}")

Output:
Conclusion: Hence we have implemented Named Entity Recognition for the English language.
Experiment No. 8

Aim: Perform top down and bottom up parsing using CFG for English language.

Theory:

Parsing :

Parsing is the process of analyzing a string of symbols, either in natural language or computer
languages, to determine its grammatical structure. In the context of context-free grammars
(CFGs), there are two primary approaches to parsing: top-down and bottom-up parsing.

Top-Down Parsing
Top-down parsing starts from the highest-level rule of the grammar and works down to the
leaves (the actual input string). It tries to find a derivation for the string by recursively expanding
non-terminal symbols.

Example:

Consider the CFG:

S → NP VP
NP → Det N
VP → V NP
Det → 'a' | 'the'
N → 'cat' | 'dog'
V → 'chased' | 'saw'

For the input string "the cat chased a dog", the top-down parser would start with S and try to
match the input step by step.
Bottom-Up Parsing
Bottom-up parsing, on the other hand, starts with the input symbols and attempts to construct the
parse tree by reversing the production rules until it reaches the start symbol.

Example:

Using the same CFG as above, a bottom-up parser would take the input string "the cat chased a
dog" and work its way up to derive S, using the rightmost derivation.

Code for Parsing


Implementation of both top-down and bottom-up parsers in Python using the CFG

class CFG:
def __init__(self):
self.grammar = {
'S': [['NP', 'VP']],
'NP': [['Det', 'N']],
'VP': [['V', 'NP']],
'Det': [['the'], ['a']],
'N': [['cat'], ['dog']],
'V': [['chased'], ['saw']]
}

# Top-Down Parser
def top_down_parse(symbol, input_tokens, index, parse_tree):
if index >= len(input_tokens):
return False

if symbol not in grammar.grammar:


if index < len(input_tokens) and input_tokens[index] == symbol:
parse_tree.append(symbol)
return True
return False

for production in grammar.grammar[symbol]:


new_index = index
subtree = []
for prod_symbol in production:
if top_down_parse(prod_symbol, input_tokens, new_index, subtree):
new_index += 1
else:
break
if len(subtree) == len(production):
parse_tree.append((symbol, subtree))
return True

return False

# Bottom-Up Parser
def bottom_up_parse(input_tokens):
stack = []
for token in input_tokens:
stack.append(token)
while True:
for symbol, production in grammar.grammar.items():
for prod in production:
if stack[-len(prod):] == prod:
stack = stack[:-len(prod)]
stack.append(symbol)
break
else:
continue
break
else:
break
return stack[0] if stack == ['S'] else None

# Example Usage
grammar = CFG()

input_string = "the cat chased a dog"


input_tokens = input_string.split()

# Top-Down Parsing
parse_tree = []
if top_down_parse('S', input_tokens, 0, parse_tree):
print("Top-Down Parse Tree:", parse_tree)
else:
print("Top-Down Parsing failed.")

# Bottom-Up Parsing
result = bottom_up_parse(input_tokens)
if result:
print("Bottom-Up Parsing succeeded with root:", result)
else:
print("Bottom-Up Parsing failed.")

Output:

Top-Down Parsing failed.


Bottom-Up Parsing succeeded with root: S

Conclusion:

In this experiment, we effectively implemented both top-down and bottom-up parsing techniques
using a context-free grammar for the English language.
‭Experiment No. 9‬

‭ im‬ ‭:‬ ‭To‬ ‭implement‬ ‭a‬ ‭Text‬ ‭Similarity‬ ‭Recognizer‬ ‭using‬ ‭NLP‬ ‭techniques‬ ‭to‬ ‭identify‬ ‭the‬
A
‭similarity between two chosen text documents.‬

‭Theory :‬

‭ ext‬ ‭similarity‬ ‭is‬ ‭a‬ ‭measure‬ ‭used‬ ‭to‬ ‭quantify‬ ‭the‬ ‭similarity‬ ‭between‬ ‭two‬ ‭text‬ ‭documents.‬ ‭It‬ ‭is‬
T
‭often‬ ‭used‬ ‭in‬ ‭applications‬ ‭like‬ ‭plagiarism‬ ‭detection,‬ ‭document‬ ‭clustering,‬ ‭and‬ ‭information‬
‭retrieval. There are two primary types of similarity measures:‬

‭1.‬ C ‭ osine‬ ‭Similarity‬‭:‬ ‭Cosine‬ ‭similarity‬ ‭calculates‬ ‭the‬ ‭cosine‬ ‭of‬ ‭the‬ ‭angle‬ ‭between‬ ‭two‬
‭vectors‬ ‭representing‬ ‭text‬ ‭documents.‬ ‭This‬ ‭method‬ ‭is‬ ‭widely‬ ‭used‬ ‭because‬ ‭it‬ ‭is‬
‭independent of document length.‬
‭2.‬ ‭Jaccard‬ ‭Similarity‬‭:‬ ‭This‬ ‭measure‬ ‭compares‬ ‭the‬‭shared‬‭elements‬‭between‬‭two‬‭sets.‬‭For‬
‭text documents, these sets could be words or terms.‬
‭3.‬ ‭Euclidean‬‭Distance‬‭:‬‭This‬‭method‬‭measures‬‭the‬‭straight-line‬‭distance‬‭between‬‭two‬‭points‬
‭(documents) in a multidimensional space.‬

‭Mathematical Representation‬
‭Program :‬

‭pip install scikit-learn nltk python-docx PyPDF2‬

‭from sklearn.feature_extraction.text import CountVectorizer‬

‭from sklearn.metrics.pairwise import cosine_similarity‬

‭from nltk.corpus import stopwords‬

‭from docx import Document‬

‭import PyPDF2‬

‭import nltk‬

‭# Download stopwords if not already available‬

‭nltk.download('stopwords')‬

‭# Function to read and preprocess text from a .docx file‬

‭def preprocess_docx(file_path):‬

‭try:‬

‭doc = Document(file_path)‬

‭text = ' '.join([para.text for para in doc.paragraphs])‬

‭return preprocess_text(text)‬

‭except Exception as e:‬

‭print(f"Error processing Word file {file_path}: {e}")‬

‭return ""‬

‭# Function to read and preprocess text from a .pdf file‬

‭def preprocess_pdf(file_path):‬

‭try:‬

‭text = ""‬
‭with open(file_path, 'rb') as file:‬

‭reader = PyPDF2.PdfReader(file)‬

‭# Extract text from each page of the PDF‬

‭for page in reader.pages:‬

‭text += page.extract_text()‬

‭return preprocess_text(text)‬

‭except Exception as e:‬

‭print(f"Error processing PDF file {file_path}: {e}")‬

‭return ""‬

‭# Function to preprocess text: lowercase, remove stopwords, and filter non-alphanumeric words‬

‭def preprocess_text(text):‬

‭try:‬

‭text = text.lower()‬

‭stop_words = set(stopwords.words('english'))‬

f‭ iltered_text‬ ‭=‬ ‭'‬ ‭'.join([word‬ ‭for‬ ‭word‬ ‭in‬ ‭text.split()‬ ‭if‬ ‭word.isalnum()‬ ‭and‬ ‭word‬ ‭not‬ ‭in‬
‭stop_words])‬

‭return filtered_text‬

‭except Exception as e:‬

‭print(f"Error during text preprocessing: {e}")‬

‭return ""‬

‭# Specify file paths for your PDF documents on Google Colab‬

‭file_path1 = "/content/document1.pdf"‬

‭file_path2 = "/content/document2.pdf"‬
‭# Preprocess the PDF documents‬

‭doc1_cleaned = preprocess_pdf(file_path1)‬

‭doc2_cleaned = preprocess_pdf(file_path2)‬

‭# Check if both documents have been read correctly‬

‭if not doc1_cleaned or not doc2_cleaned:‬

‭print("One or both documents could not be read. Please check the file paths and try again.")‬

‭else:‬

‭# Vectorization‬

‭vectorizer = CountVectorizer()‬

‭count_matrix = vectorizer.fit_transform([doc1_cleaned, doc2_cleaned])‬

‭# Compute Cosine Similarity‬

‭cosine_sim = cosine_similarity(count_matrix)[0][1]‬

‭# Output Result‬

‭print(f"Cosine Similarity between the given documents: {cosine_sim:.4f}")‬

‭Output :‬

‭ onclusion‬‭:‬‭Hence,‬‭we‬‭have‬‭implemented‬‭a‬‭Text‬‭Similarity‬‭Recognizer‬‭using‬‭NLP‬‭techniques‬
C
‭to identify the similarity between two chosen text documents.‬
Experiment No. 9

Aim : To implement and study the concept of Wordnet.

Theory:

WordNet is a lexical database of the English language that groups words into sets of synonyms called synsets. It
provides short definitions and usage examples and records various semantic relations between these synonym sets,
including hypernyms (more general terms), hyponyms (more specific terms), meronyms (part-whole relationships),
and more.

Key Concepts:
1. Synsets: A group of synonymous words representing a single concept. For example, the
word "car" may belong to the synset that includes "automobile," "motorcar," etc.
2. Relationships:
● Hyponymy: A relationship where one word is a more specific term than another.
E.g., "dog" is a hyponym of "animal."
● Hypernymy: The inverse of hyponymy; it represents more general terms.
● Meronymy: A part-to-whole relationship. For example, "wheel" is a meronym of
"car."
3. Parts of Speech: WordNet categorizes words into nouns, verbs, adjectives, and adverbs.
4. Usage Examples: Each synset may include example sentences illustrating how the word
is used.
Applications of WordNet :
● Natural Language Processing (NLP): Used in applications like sentiment analysis,
information retrieval, and semantic similarity calculations.
● Machine Learning: Provides semantic features for various algorithms.

Code:
pip install nltk

import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

# Function to get synsets of a word


def get_synsets(word):
synsets = wn.synsets(word)
return synsets

# Function to display synsets and their definitions


def display_synsets(synsets):
for synset in synsets:
print(f'Synset: {synset.name()}')
print(f'Definition: {synset.definition()}')
print(f'Examples: {synset.examples()}')
print('---')

# Example usage
word = 'dog'
synsets = get_synsets(word)
display_synsets(synsets)

# Get hypernyms of the first synset


if synsets:
hypernyms = synsets[0].hypernyms()
print(f'Hypernyms of {synsets[0].name()}:')
for hyper in hypernyms:
print(hyper.name())
Conclusion :
Thus we have studied the concept of WordNet and demonstrated its use.
“Twitter Sentiment
Analysis”

A Mini Project Proposal Submitted in


partial fulfillment of the requirements of the
degree of

BACHELOR OF

ENGINEERING IN

COMPUTER ENGINEERING

Submitted By
`` NamaN Bhalani (02)
Sachin Prajapati (31)
Dhrumil Upadhyay (53)
Shrey Varma (56)

Neelam Phadnis (Assistant Professor. HOD)

DEPARTMENT OF COMPUTER ENGINEERING


Accredited by NBA for 3 years w.e .f. 1st July 2022

SHREE L. R. TIWARI COLLEGE OF ENGINEERING


SHREE L.R. TIWARI EDUCATIONAL CAMPUS, MIRA ROAD (East),
THANE -401 107, MAHARASHTRA.

University of Mumbai
(AY 2024-25)
TABLE OF CONTENTS

1. Introduction
2. Literature Review
3. Implementation
4. Resources
5. Emoticons:
6. Unicode:
7. Case:
8. Targets:
9. Negation:
10. Sequence of repeated characters:
11. Machine learning
12. Naive Bayes
13. Baseline
14. Improvements
15. Conclusion
16. References
Introduction

Sentiment analysis deals with identifying and classifying opinions or sentiments expressed in source
text. Social media is generating a vast amount of sentiment rich data in the form of tweets, status updates,
blog posts etc. Sentiment analysis of this user generated data is very useful in knowing the opinion of
the crowd. Twitter sentiment analysis is difficult compared to general sentiment analysis due to the
presence of slang words and misspellings. The maximum limit of characters that are allowed in Twitter
is 140. Knowledge base approach and Machine learning approach are the two strategies used for
analyzing sentiments from the text. In this paper, we try to analyze the twitter posts about electronic
products like mobiles, laptops etc using Machine Learning approach. By doing sentiment analysis in a
specific domain, it is possible to identify the effect of domain information in sentiment classification.
We present a new feature vector for classifying the tweets as positive, negative and extract peoples'
opinion about products. In this project I choose to try to classify tweets from Twitter into “positive” or
“negative” sentiment by building a model based on probabilities. Twitter is a microblogging website
where people can share their feelings quickly and spontaneously by sending a tweets limited by 140
characters. You can directly address a tweet to someone by adding the target sign “@” or
participate to a topic by adding an hastag “#” to your tweet. Because of the usage of Twitter, it is
a perfect source of data to determine the current overall opinion about anything.

Implementation

To gather the data many options are possible. In some previous paper researches, they built a
program to collect automatically a corpus of tweets based on two classes, “positive” and
“negative”, by querying Twitter with two type of emoticons:
● Happy emoticons, such as “:)”, “:P”, “: )” etc.
● Sad emoticons, such as “:(“, “:’(”, “=(“.
Others make their own dataset of tweets my collecting and annotating them manually which very
long and fastidious.
Additionally to find a way of getting a corpus of tweets, we need to take of having a balanced data
set, meaning we should have an equal number of positive and negative tweets, but it needs also to
be large enough. Indeed, more the data we have, more we can train our classifier and more the
accuracy will be.
After many researches, I found a dataset of 1578612 tweets in english coming from two sources:
Kaggle and Sentiment140. It is composed of four columns that are ItemID, Sentiment,
SentimentSource and SentimentText. We are only interested by the Sentiment column
corresponding to our label class taking a binary value, 0 if the tweet is negative, 1 if the tweet is
positive and the SentimentText columns containing the tweets in a raw format.

Table 1. Example of twitter posts annotated with their corresponding sentiment, 0 if it is negative, 1
if it is positive.

In the Table 1 showing the first ten twitter posts we can already notice some particularities
and difficulties that we are going to encounter during the preprocessing steps.
● The presence of acronyms "bf" or more complicated "APL". Does it means apple ?Apple (the
company) ? In this context we have "friend" after so we could think that he refers to his
smartphone and so Apple, but what about if the word "friend" was not here ?
● The presence of sequences of repeated characters such as
"Juuuuuuuuuuuuuuuuussssst", "hmmmm". In general when we repeat
several characters in a word, it is to emphasize it, to increase its impact.
● The presence of emoticons , ":O", "T_T", ": |" and much more, give insights about
user's moods.
● Spelling mistakes and “urban grammar ” like "im gunna" or "mi".
● The presence of nouns such as "TV", "New
Moon". Furthermore, we can also add,
● People also indicate their moods, emotions, states, between two such as, *\cries*,
*hummin*, *sigh*.
● The negation, “can't”, “cannot”, “don't”, “haven't” that we need to handle like: “I don’t
likechocolate”, “like” in this case is negative.
We could also be interested by the grammar structure of the tweets, or if a tweet is
subjective/objective and so on. As you can see, it is extremely complex to deal with languages
and even more when we want to analyse text typed by users on the Internet because people don’t
take care of making sentences that are grammatically correct and use a ton of acronyms and words
that are more or less english in our case. We can visualize a bit more the dataset by making a
chart of how many positive and negative tweets does it contains,

Figure 1: Histogram of the tweets according to their sentiment

We have exactly 790177 positive tweets and 788435 negative tweets which signify that the dataset
is well balanced. There is also no duplicates.
Finally, let’s recall the Twitter terminology since we are going to have to deal with in the tweets:
● Hashtag: A hashtag is any word or phrase immediately preceded by the # symbol. Whenyou click
on a hashtag, you’ll see other Tweets containing the same keyword or topic. ● @username: A
username
is how you’re identified on Twitter, and is always preceded immediately by the @ symbol. For
instance, Katy Perry is @katyperry.
● MT: Similar to RT (Retweet), an abbreviation for “Modified Tweet.” Placed before
theRetweeted text when users manually retweet a message with modifications, for example
shortening a Tweet.
● Retweet: RT, A Tweet that you forward to your followers is known as a Retweet. Oftenused to
pass along news or other valuable discoveries on Twitter, Retweets always retain original
attribution.
● Emoticons: Composed using punctuation and letters, they are used to express
emotions concisely, ";) :) ...".
5
Now we have the corpus of tweets, we need to use other resources to make easier the pre processing
step.
Resources

In order to facilitate the pre processing part of the data, we introduce five resources which are,
● An emoticon dictionary regrouping 132 of the most used emoticons in western
withtheir sentiment, negative or positive.
● An acronym dictionary of 5465 acronyms with their translation.
● A stop word dictionary corresponding to words which are filtered out before or
afterprocessing of natural language data because they are not useful in our case.
● A positive and negative word dictionaries g iven the polarity (sentiment out of context)of words.
● A negative contractions and auxiliaries dictionary which will be used to detectnegation in
a given tweet such as “don’t”, “can’t”, “cannot”, etc.
The introduction of these resources will allow to uniform tweets and remove some of their
complexities with the acronym dictionary for instance because a lot of acronyms are used in tweets.
The positive and negative word dictionaries could be useful to increase (or not) the accuracy score
of the classifier. The emoticon dictionary has been built from wikipedia with each emoticon
annotated manually. The stop word dictionary contains 635 words such as “the”, “of”, “without”.
Normally they should not be useful for classifying tweets according to their sentiment but it is
possible that they are.
Also we use Python 2.7 (https://www.python.org/) which is a programming language widely used in
data science and scikit learn (http://scikit learn.org/) a very complete and useful library for machine
learning containing every techniques, methods we need and the website is also full of tutorials well
explained. With Python, the libraries, Numpy (http://www.numpy.org/) and Panda
(http://pandas.pydata.org/) for manipulating data easily and intuitively are just essential.
Pre-processing

Now that we have the corpus of tweets and all the resources that could be useful, we can pre process
the tweets. It is a very important since all the modifications that we are going to during this process
will directly impact the classifier’s performance. The pre processing includes cleaning, normalization,
transformation, feature extraction and selection, etc. The result of pre processing will be consistent
and uniform data that are workable to maximize the classifier's performance. All of the tweets are pre
processed by passing through the following steps in the same order.

Emoticons:

We replace all emoticons by their sentiment polarity ||pos|| and ||neg|| using the emoticon
dictionary. To do the replacement, we pass through each tweet and by using a regex we find
out if it contains emoticons, if yes they are replaced by their corresponding polarity.

Table 2. Before processing emoticons, list of tweets where some of them


contain emoticons.
Table 3. After processing emoticons, they have been replaced by their corresponding tag
The data set contains 19469 positive emoticons and 11025 negative emoticons.

Unicode:

The data set contains 19469 positive emoticons and 11025 negative emoticons.

Table 4. Tweets before processing Unicode.

Table 5. Tweets after processing Unicode.


Case:

The case is something that can appears useless but in fact it is really important for distinguish proper
noun and other kind of words. Indeed: “General Motor” is the same thing that “general motor”, or
“MSc” and “msc”. So reduce all letters to lowercase should be normally done wisely. In this project,
for simplicity we will not take care of that since we assume that it should not impact too much the
classifier’s performance.

Table 6. Tweets before processing lowercase.

Table 7. Tweets after processing lowercase.

Targets:

The target correspond to usernames in twitter preceded by “@” symbol. It is used to address a tweet to
someone or just grab the attention. We replace all usernames/targets by the tag ||target|| . Notice that in
the data set we have 735757 targets.
Table 8. Tweets before processing targets.

Table 9. Tweets after processing targets.

Acronyms:

We replace all acronyms with their translation. An acronym is an abbreviation formed from the initial
components in a phrase or a word. Usually these components are individual letters (as in NATO or laser)
or parts of words or names (as in Benelux). Many acronyms are used in our data set of tweets as you can
see in the following bar chart. At this point, tweets are going to be tokenized by getting rid of the
punctuation and using split in order to do the process really fast. We could use nltk.tokenizer but it is
definitely much much slower (also much more accurate).
Figure 3. Top 20 of acronyms in the data set of tweets
As you can see, “lol”, “u”, “im”, “2” are really often used by users. The table below shows the top 20
acronyms with their translation and their count.

Table 10. Top 20 of acronyms in the data set of tweets with their translation and count

Negation:

We replace all negation words such as “not”, “no”, “never” by the tag ||not|| using the negation dictionary
in order to take more or less of sentences like "I don't like it". Here like should not be considered as
positive because of the "don't" before. To do so we will replace "don't" by ||not|| and the word like will
not be counted as positive. We should say that each time a negation is encountered, the
words followed by the negation word contained in the positive and negative word dictionaries will be
reversed, positive becomes negative, negative becomes positive, we will do this when we will try to find
positive and negative words.

Figure 4. A tweet before processing negation words.

Figure 5. A tweet after processing negation words.

Sequence of repeated characters:

Now, we replace all sequences of repeated characters by two characters (e.g: "helloooo" = "helloo")
to keep the emphasized usage of the word.

Table 11. Tweets before processing sequences of repeated characters.


Table 12. Tweets after processing sequences of repeated characters.

Machine learning

Once we have applied the different steps of the preprocessing part, we can now focus on the machine
learning part. There are three major models used in sentiment analysis to classify a sentence into positive
or negative: SVM, Naive Bayes and Language Models (N Gram). SVM is known to be the model giving
the best results but in this project we focus only on probabilistic model that are Naive Bayes and
Language Models that have been widely used in this field. Let’s first introduce the Naive Bayes model
which is well known for its simplicity and efficiency for text classification.

Naive Bayes

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on
applying Bayes' theorem with strong (naive)independence assumptions between the features. Naive
Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables
(features/predictors) in a learning problem. Maximum likelihood training can be done by evaluating a
closed form expression (mathematical expression that can be evaluated in a finite number of operations),
which takes linear time. It is based on the application of the Baye’s rule given by the following formula:

Formula: Baye’s rule

where D denotes the document and C the category (label), d and c are instances of D and C and P(D =
d) = ∑ (D |C )P(C ).
The Multi variate Bernoulli Model : Also called binomial model, useful if our feature vectors are binary
(e.g 0s and 1s). An application can be text classification with bag of words model where the 0s 1s are
"word does not occur in the document" and "word occurs in the document" respectively. ● The
Multinomial Model : Typically used for discrete counts. In text classification, we extend the Bernoulli
model further by counting the number of times a word $w_i$ appears over the number of words rather
than saying 0 or 1 if word occurs or not. ● the Gaussian Model : We assume that features follow a normal
distribution. Instead of discrete counts, we have continuous features.

Baseline

In every machine learning task, it is always good to have what we called a baseline. It often a “quick and
dirty” implementation of a basic model for doing the first classification and based on its accuracy, try to
improve it. We use the Multinomial Naive Bayes as learning algorithm with the Laplace smoothing
representing the classic way of doing text classification. Since we need to extract features from our data
set of tweets, we use the bag of words model to represent it. The bag of words model is a simplifying
representation of a document where it is represented as a bag of its words without taking consideration
of the grammar or word order. In text classification, the count (number of time) of each word appears is
a document is used as a feature for training the classifier. Firstly, we divide the data set into two parts,
the training set and the test set. To do this, we first shuffle the data set to get rid of any order applied to
the data, then we from the set of positive tweets and the set of negative tweets, we take 3/4 of tweets
from each set and merge them together to make the training set. The rest is used to make the test set.
Finally the size of the training set is 1183958 tweets and the test set is 394654 tweets. Notice that they
are balanced and follow the same distribution of the initial data set. Once the training set and the test set
are created we actually need a third set of data called the validation set . It is really useful because it will
be used to validate our model against unseen data and tune the possible parameters of the learning
algorithm to avoid underfitting and overfitting for example. We need this validation set because our test
set should be used only to verify how well the model will generalize . If we use the test set rather than
the validation set, our model could be overly optimistic and twist the results. 16 To make the validation
set, there are two main options: ● Split the training set into two parts
(60%, 20%) with a ratio 2:8 where each part contains an equal distribution of example types. We train
the classifier with the largest part, and make prediction with the smaller one to validate the model. This
technique works well but has the disadvantage of our classifier not getting trained and validated on all
examples in the data set (without counting the test set). ● The K fold cross validation . We split the data
set into k parts, hold out one, combine the others and train on them, then validate against the held out
portion. We repeat that process k times (each fold), holding out a different portion each time. Then we
average the score measured for each fold to get a more accurate estimation of our model's performance.

Figure 6. 10 fold cross validation

We split the training data into 10 folds and cross validate on them using scikit learn as shown in the
figure 2.4.2.1 above. The number of K folds is arbitrary and usually set to 10 it is not a rule. In fact,
determine the best K is still an unsolved problem but with lower K: computationally cheaper, less
variance, more bias. With large K: computationally expensive, higher variance, lower bias. We can now
train the naive bayes classifier with the training set, validate it using the hold out part of data taken from
the training set, the validation set, repeat this 10 times and average the results to get the final accuracy
which is about 0.77 as shown in the screen results below,
Figure 7. Result of the naive bayes classifier with the score representing the average of the results of
each 10 fold cross validation, and the overall confusion matrix.

Improvements

From the baseline, the goal is to improve the accuracy of the classifier, which is 0.77, in order to
determine better which tweet is positive or negative. There are several ways of doing this and we present
only few possible improvements (or not). First we could try to removed what we called, stop words.
Stop words usually refer to the most common words in the English language (in our case) such as: "the",
"of", “to” and so on. They do not indicate any valuable information about the sentiment of a sentence
and it can be necessary to remove them from the tweets in order to keep only words for which we are
interested. To do this we use the list of 635 stopwords that we found. In the table below, you can see the
most frequent words in the data set with their counts,

Table 13. Most frequent words in the data set with their corresponding count.
We could also try to stem the words in the data set. Stemming is the process by which endings are
removed from words in order to remove things like tense or plurality. The stem form of a word could
not exist in a dictionary (different from Lemmatization). This technique allows to unify words and
reduce the dimensionality of the dataset. It's not appropriate for all cases but can make it easier to connect
together tenses to see if you're covering the same subject matter. It is faster than

Lemmatization ( remove inflectional endings only and return the base or dictionary form of a word,
which is known as the lemma). Using the library NLTK which is a library in Python specialized in
natural language processing, we get the following results after stemming the words in the data set,

Figure 8. Result of the naive bayes classifier after stemming.

We actually lose 0.002 in accuracy score compared to the results of the baseline. We conclude that
stemming words does not improve the classifier’s accuracy and actually do not make any sensible
changes.

Let’s introduce language models to see if we can have better results than those for our baseline. Language
models are models assigning probabilities to sequence of words. Initially, they are extensively used in
speech recognition and spelling correction but it turns out that they give good results in text
classification.

An important note is that n gram classifiers are in fact a generalization of Naive Bayes. A unigram
classifier with Laplace smoothing corresponds exactly to the traditional naive Bayes classifier. Since we
use bag of words model, meaning we translate this sentence: "I don't like chocolate" into "I", "don't",
"like", "chocolate", we could try to use bigram model to take care of negation with "don't like" for this
example. Using bigrams as feature in the classifier we get the following results,
Formula: Results of the naive bayes classifier with bigram features.
Using only bigram features we have slightly improved our accuracy score about 0.01. Based on that
we can think of adding unigram and bigram could increase the accuracy score more.

Formula: Results of the naive bayes classifier with unigram and bigram features. and
indeed, we increased slightly the accuracy score about 0.02 compared to the baseline.

Conclusion

Nowadays, sentiment analysis or opinion mining is a hot topic in machine learning. We are still
far to detect the sentiments of s corpus of texts very accurately because of the complexity in the
English language and even more if we consider other languages such as Chinese.
In this project we tried to show the basic way of classifying tweets into positive or negative
category using Naive Bayes as baseline and how language models are related to the Naive Bayes
and can produce better results. We could further improve our classifier by trying to extract more
features from the tweets, trying different kinds of features, tuning the parameters of the naïve
Bayes classifier, or trying another classifier all together.
References

[1] Alexander Pak, Patrick Paroubek. 2010, Twitter as a Corpus for Sentiment Analysis and
OpinionMining.

[2] Alec Go, Richa Bhayani, Lei Huang. Twitter Sentiment Classification using Distant
Supervision.
Jin Bai, Jian Yun Nie. Using Language Models for Text Classification.

[3] Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, Rebecca Passonneau. Sentiment
Analysisof Twitter Data.

[4] Fuchun Peng. 2003, Augmenting Naive Bayes Classifiers with Statistical Language Models
Assignment 2
Problem Statement: The goal of the assignment is to write a tweet tokenizer. The input of
the code will be a set of tweets and the output will be the tokens in each tweet.

Manual tokenization:

Tweet:
1. I’m going to the park! NLP’s great :)
2. Can’t believe it’s already 2024! #TimeFlies @Friend
3. Jones’ car is faster than John’s. LOL!!!
4. R E T W E E T if you agree!
5. U.S.A. is the land of opportunities.
6. "NLP’s future looks bright!" – said @Professor_AI.

Tokenization:
1. ['I', 'am', 'going', 'to', 'the', 'park', '!', 'NLP', "'s", 'great', ':)']
2. Can’t believe it’s already 2024! #TimeFlies @Friend
3. ['Jones', "'s", 'car', 'is', 'faster', 'than', 'John', "'s", '.', 'LOL', '!', '!', '!']
4. ['RETWEET', 'if', 'you', 'agree', '!']
5. ['U.S.A.', 'is', 'the', 'land', 'of', 'opportunities', '.']
6. ['"', 'NLP', "'s", 'future', 'looks', 'bright', '!', '"', '–', 'said', '@Professor_AI', '.']

Automated Tokenization:

Code:
import re

def tokenize_tweet(tweet):

# Define clitic replacements

clitics = {

"I’m": "I am", "I'm": "I am", "he's": "he


is", "she's": "she is", "it's": "it is",

"we're": "we are", "they're": "they are",


"can't": "can not", "won't": "will not",

"he'd": "he would", "she'd": "she


would", "I'd": "I would", "you'd": "you
would",

"she'll": "she will", "he'll": "he will",


"I'll": "I will"

# Replace clitics

for clitic, replacement in clitics.items():

tweet = tweet.replace(clitic,
replacement)

# Handle possessive apostrophes: NLP's -


> NLP 's

tweet = re.sub(r"(\w+)'s", r"\1 's", tweet)

# Convert word-ending apostrophe (Jones'


-> Jones 's)

tweet = re.sub(r"(\w+)'(\s|$)", r"\1 's\2",


tweet)

# Handle punctuations, hashtags, and


mentions

tweet = re.sub(r"([#@])(\w+)", r"\1\2",


tweet) # Handle hashtags and mentions

tweet = re.sub(r"([\"'.,!?;:()])", r" \1 ",


tweet) # Separate punctuations

# Handle space-separated tokens like "R E


T W E E T" -> RETWEET

tweet = re.sub(r"\b(\w\s)+\w\b", lambda


match: match.group().replace(" ", ""),
tweet)
# Split by whitespace to get tokens

tokens = tweet.split()

return tokens

# Example usage: Tokenize a set of tweets


from your file

def tokenize_tweets(tweets):

# Split the input text into individual tweets

tweet_list = tweets.splitlines() # Split


tweets by newlines

# Tokenize each tweet and return the


tokenized tweets

tokenized_tweets =
[tokenize_tweet(tweet) for tweet in
tweet_list]

return tokenized_tweets

# Read the input from the provided file

file_path = '/content/file2' # Path to the


uploaded file

with open(file_path, 'r') as file:

tweet_input = file.read()

# Tokenize the tweets from your provided


data

tokenized_tweets =
tokenize_tweets(tweet_input)

# Output the tokenized tweets

for i, tokens in enumerate(tokenized_tweets):

print(f"Tweet {i + 1}: {tokens}")

Output:
Tweet 1: ['Camping', 'in', 'Maine', 'for', 'the', 'weekend', '.', 'Hey', 'Dad', ',', 'Mama', 'Loves', 'YOU', ':', 'http',
':', '//www', '.', 'mamapalooza', '.', 'com']
Tweet 2: ['Its', 'american', 'tradition', 'bitch']
Tweet 3: ['@ThroughTheVoid', 'They', 'love', 'it', '!', 'The', 'only', 'pleasure', 'they', 'get', 'in', 'life', '.', 'I',
'actually', 'do', 'that', '.', 'I', 'am', 'sure', 'I', 'hear', 'a', 'tiny', 'squeak', '.', '.', '.', 'Then', 'louder', 'ones']
Tweet 4: ['"', 'RT', '@latti', ':', '@AbsoHilare', 'stop', 'tweeting', 'in', 'church', '!', 'Lol', '<---', '"', '"', 'I', 'tweet',
'because', 'I', 'am', 'happy', ',', 'I', 'tweet', 'because', 'I', 'am', 'free', '"', '"', 'LOL', '!', '"']
Tweet 5: ['Samsung', 'Mini', 'S2', 'portable', 'HDD', 'graced', 'with', 'colors', 'that', 'perfectly', 'match', 'your',
'tacky', 'beach', 'gear', ':', 'Sammy', "'", 's', 'done', 'it', 'aga', '.', '.', 'http', ':', '//tinyurl', '.', 'com/lb5p6m']
Tweet 6: ['@dialloc', 'congrats', 'on', 'finding', 'your', 'way', 'over', '.', 'it', 'may', 'be', 'slow', 'going', 'at', 'first',
'.', 'hang', 'in', 'there', '.', 'it', 'is', 'kinda', 'cool', 'when', 'u', 'get', 'up', 'to', 'speed', '.']
Tweet 7: ['iPhone', 'activation', 'delays', 'continue', ',', 'Apple', 'offers', '$30', 'http', ':', '//twt', '.', 'gs/l3Ki']
Tweet 8: ['RT', '@GoogleAtWork', 'Gmail', 'maximum', 'attachment', 'size', 'now', '25MB', 'http', ':', '//bit', '.',
'ly/62mjw', 'Nice', '!', '!', '!']
Tweet 9: ['RT', '@acfou', 'The', 'Ads', 'Won', 'Awards', 'for', 'Crispin', ';', 'But', 'Did', 'Nothing', 'for', 'Client',
'BurgerKing', "'", 's', 'Sales/Marketshare', '-', 'Big', 'Surprise', '-', 'http', ':', '//ping', '.', 'fm/vw8TI']
Tweet 10: ['Hey', 'doll', '!', 'Great', 'I', 'missed', 'True', 'Blood', 'yday', 'boo', 'lol', 'Rt', '@FrankBanuat78',
'@jhillstephens', 'Hello', 'Sunshine', 'how', 'are', 'u', 'today', '?', ':', '-', ')']
Tweet 11: ['Australian', 'artist', 'Pogo', 'made', 'these', 'free', 'songs', 'primarily', 'from', 'sampled', 'audio',
'from', 'Alice', 'In', 'Wonderland', '.', 'http', ':', '//www', '.', 'last', '.', 'fm/music/Pogo/Wonderland']
Tweet 12: ['@mppritchard', 'they', 'wanted', 'to', 'sell', 'all', 'the', 'preorders', '&', 'then', 'sell', 'all', 'of', 'the',
'ones', 'they', 'had', 'in', 'stock', 'to', 'those', 'that', 'just', 'walked', 'in', '.', 'Can', "'", 't', 'do', 'both']
Tweet 13: ['Incoming', ':', 'Frightened', 'Rabbit', ',', 'Sept', '.', '22', '(', 'Tucson', ')', ':', 'If', 'Fat', 'Cat', 'Records',
'is', 'going', 'to', 'send', 'three', 'great', 'bands', 'from', 'Scot', '.', '.', 'http', ':', '//tinyurl', '.', 'com/nz6xcv']
Tweet 14: ['Hey', '@ginoandfran', 'please', 'greet', 'philip', '!', '(', 'GinoandFran', 'live', '>', 'http', ':', '//ustre', '.',
'am/2YyQ', ')']
Tweet 15: ['Ik', 'weet', 'niet', 'wie', 'er', 'achter', 'de', 'T-Mobile', 'iPhone', 'Twitter', 'zit', 'maar', 'ik', 'vind',
'het', 'niet', 'echt', "'", 'corporate', "'", 's', 'taalgebruik', '.', '.', '.', 'Best', 'vreemd', 'eigenlijk']
Tweet 16: ['Polizei-Sondereinsatz', 'mit', 'Hindernissen', 'http', ':', '//tinyurl', '.', 'com/kv7w7p']
Tweet 17: ['we', 'are', 'watching', 'dr', '.', 'phil', 'classics', '.', 'haha', ':', ')', '&', 'we', 'are', 'learning', 'how', 'to',
'not', 'give', 'mixed', 'signals', '.']
Tweet 18: ['Oh', 'yeah', '.', '.', '.', 'Washtenaw', 'Dairy', 'mint', 'chip', '.', 'Just', 'like', 'when', 'I', 'was', 'a', 'wee',
'lad', '.', 'http', ':', '//twitpic', '.', 'com/88pb1']
Tweet 19: ['RT', '@TheTrillYoungB', ':', 'Download', 'my', 'new', 'single', 'True', 'Religion', '!', '!', '!', 'http', ':',
'//tinyurl', '.', 'com/ynctruereligion']
Tweet 20: ['Show', 'support', 'for', 'democracy', 'in', 'Iran', 'add', 'green', 'ribbon', 'to', 'your', 'Twitter', 'avatar',
'with', '1-click', '-', 'http', ':', '//helpiranelection', '.', 'com/']
Tweet 21: ['"', '@shanti45', ':', '"', '"', 'Only', 'just', 'realised', 'I', 'like', 'these', '.', '.', '.', 'lol', '"', '"', '♫', 'http', ':',
'//blip', '.', 'fm/~8szom', '"']
Tweet 22: ['Listening', '@DannyAkin', 'speak', 'at', 'sebts', 'luncheon', '.', 'Exciting', 'times', 'in', '#sbc2009',
'http', ':', '//twitpic', '.', 'com/8amua']
Tweet 23: ['@careyd', 'try', 'OmniFocus', 'for', 'the', 'iPhone', '(', '&Mac', 'if', 'you', 'have', 'it', ')', '.', 'I', "'",
've', 'saved', 'so', 'much', 'time', 'with', 'it', 'I', 'have', 'time', 'to', 'recommend', 'it', 'on', 'Twitter', '.', 'Sam']
Tweet 24: ['"', 'RT', '@Shoq', ':', '"', '"', '.', '@DougCurran', 'See', 'Replies', 'discussed', 'here', ':', 'http', ':',
'//bit', '.', 'ly/shoqtips', '"', '"', '//', 'How', '2', 'get', 'around', 'the', 'replies', 'issue', 'on', 'twitter', '--', '"']
Tweet 25: ['i', 'love', 'mia', 'michaels', 'and', 'randi', '&', 'evan']
Tweet 26: ['MonksDen', 'berrybck', ':', 'only', 'trick', 'is', 'the', 'friggin', 'connectors', '.', 'he', 'may', ':',
'berrybck', 'http', ':', '//tinyurl', '.', 'com/mthfxj']
Tweet 27: ['DEU', 'MILEYYY', 'meigo/']
Tweet 28: ['RT', '@aminjavan', ':', '@rishaholmes', 'thank', 'you', 'for', 'your', 'support', 'thank', 'you', '(', 'no',
'problem', '!', ':', ')']
Tweet 29: ['@lululovesbombay', 'I', 'love', 'breakfast', 'foods', 'at', 'pretty', 'much', 'any', 'time', 'of', 'the', 'day',
'hence', 'all', 'the', 'ideas', ':', ')']

You might also like