NLP Merged
NLP Merged
Name: __________________________________________________________
Semester: VII (Seventh) Roll No.: ___________________
_______________
Student’s Signature
DEPARTMENT OF COMPUTER ENGINEERING
Certificate
Seal of
Institution
INSTRUCTION FOR STUDENTS
Students shall read the points given below for understanding the theoretical concepts and
practical applications.
1) Listen carefully to the lecture given by teacher about importance of subject, curriculum
philosophy learning structure, skills to be developed, information about equipment,
instruments, procedure, method of continuous assessment, tentative plan of work in
laboratory and total amount of work to be done in a semester.
2) Student shall undergo study visit of the laboratory for types of equipment, instruments,
software to be used, before performing experiments.
3) Read the write up of each experiment to be performed, a day in advance.
4) Organize the work in the group and make a record of all observations.
5) Understand the purpose of experiment and its practical implications.
6) Write the answers of the questions allotted by the teacher during practical hours if
possible or afterwards, but immediately.
7) Student should not hesitate to ask any difficulty faced during conduct of
practical/exercise.
8) The student shall study all the questions given in the laboratory manual and practice to
write the answers to these questions.
9) Student shall develop maintenance skills as expected by the industries.
10) Student should develop the habit of pocket discussion/group discussion related to the
experiments/exercises so that exchanges of knowledge/skills could take place.
11) Student shall attempt to develop related hands-on-skills and gain confidence.
12) Student shall focus on development of skills rather than theoretical or codified
knowledge.
13) Student shall visit the nearby workshops, workstation, industries, laboratories, technical
exhibitions, trade fair etc. even not included in the Lab manual. In short, students should
have exposure to the area of work right in the student hood.
14) Student shall insist for the completion of recommended laboratory work, industrial
visits, answers to the given questions, etc.
15) Student shall develop the habit of evolving more ideas, innovations, skills etc. those
included in the scope of the manual.
16) Student shall refer technical magazines, proceedings of the seminars, refer websites
related to the scope of the subjects and update his knowledge and skills.
17) Student should develop the habit of not to depend totally on teachers but to develop self-
learning techniques.
18) Student should develop the habit to react with the teacher without hesitation with respect
to the academics involved.
19) Student should develop habit to submit the practicals, exercise continuously and
progressively on the scheduled dates and should get the assessment done.
20) Student should be well prepared while submitting the write up of the exercise. This will
develop the continuity of the studies and he/she will not be over loaded at the end of the
term.
GUIDELINES FOR TEACHERS
Teachers shall discuss the following points with students before start of practicals of the subject.
1) Learning Overview: To develop better understanding of importance of the subject. To
know related skills to be developed such as Intellectual skills and Motor skills.
2) Learning Structure: In this, topic and sub topics are organized in systematic way so that
ultimate purpose of learning the subject is achieved. This is arranged in the form of fact,
concept, principle, procedure, application and problem.
3) Know your Laboratory Work: To understand the layout of laboratory, specifications of
equipment/Instruments/Materials, procedure, working in groups, planning time ets.
Also to know total amount of work to be done in the laboratory.
4) Teaching shall ensure that required equipments are in working condition before start of
experiment, also keep operating instruction manual available.
5) Explain prior concepts to the students before starting of each experiment.
6) Involve students activity at the time of conduct of each experiment.
7) While taking reading/observation each student shall be given a chance to perform or
observe the experiment.
8) If the experimental set up has variations in the specifications of the equipment, the
teachers are advised to make the necessary changes, wherever needed.
9) Teacher shall assess the performance of students continuously as per norms prescribed
by university of Mumbai and guidelines provided by IQAC.
10) Teacher should ensure that the respective skills and competencies are developed in the
students after the completion of the practical exercise..
11) Teacher is expected to share the skills and competencies are developed in the students.
12) Teacher may provide additional knowledge and skills to the students even though not
covered in the manual but are expected from the students by the industries.
13) Teachers shall ensure that industrial visits if recommended in the manual are covered.
14) Teacher may suggest the students to refer additional related literature of the Technical
papers/Reference books/Seminar proceedings, etc.
15) During assessment teacher is expected to ask questions to the students to tap their
achievements regarding related knowledge and skills so that students can prepare while
submitting record of the practicals. Focus should be given on development of enlisted
skills rather than theoretical /codified knowledge.
16) Teacher should enlist the skills to be developed in the students that are expected by the
industry.
17) Teacher should organize Group discussions /brain storming sessions / Seminars to
facilitate the exchange of knowledge amongst the students.
18) Teacher should ensure that revised assessment norms are followed simultaneously and
progressively.
19) Teacher should give more focus on hands on skills and should actually share the same.
20) Teacher shall also refer to the circulars related to practicals supervise and assessment
for additional guidelines.
DEPARTMENT OF COMPUTER ENGINEERING
Student’s Progress Assessments
Student Name: __________________________________ Roll No.: ______________________
Class/Semester: BE CS/SEM-VII Academic Year: 2024-25
Course Name: Natural Language Processing Lab Course Code: CSDL7013
Assessment Parameters for Practical’s/Assignments
Criteria for Grading Total
(out of Average
Exp. Covered
No.
Title of Experiment PE KT DR DN PL (out of 5) COs
(Out of 3) (Out of 3) (Out of 3) (Out of 3) (Out of 3) 15)
To study and implement pre-
1
processing of texts.
To study and implement pre-
2
processing of documents.
Perform morphological analysis
3 and word generation for any
given text.
To generate N grams from
4 sentences for English and any
Indian Language
Perform POS tagging for English
5
and Hindi using a tagger.
Perform Chunking of text in
6
English Language.
Perform Named Entity
7 Recognition for English
Language.
Perform top down and bottom up
8 parsing using CFG for English
Language.
To implement a text similarity
9
recognizer using NLP techniques
To study and implement the
10
concept of WORDNET
11 Mini-Project
Average Marks
Criteria for Grading – Preparedness and Efforts(PE), Knowledge of tools(KT), Debugging and results(DR),
Documentation(DN), Punctuality & Lab Ethics(PL)
Criteria for Grading Total
Assignments
Average Covered COs
TS OM NT IS (out of 12) (out of 5)
(Out of 3) (Out of 3) (Out of 3) (Out of 3)
Assignment No. 1
Assignment No. 2
Assignment No. 3
Assignment No. 4
Assignment No. 5
Average
Criteria for Grading –Timely submission(TS), Originality of the material(OM), Neatness(NT), Innovative solution(IS)
Grades – Meet Expectations (3 Marks), Moderate Expectations (2 Marks), Below Expectations (1 Mark)
Mini-Project
11
Average Marks (Out of 10)
Teacher's
Sr. Date of Date of Assessment CO
Assignment Page No. Signature and
no. Display Completion (Out of 12) Remark Covered
1 Assignment No.1
2 Assignment No.2
3 Assignment No. 3
4 Assignment No. 4
5 Assignment No. 5
Average Marks (Out of 12)
Converted Marks (Out of 5) (B)
Assessment of Mini-Project (C)
Teacher's
Sr. Date of Date of Assessment CO
Mini-Project Page No. Signature and
no. Display Completion (Out of 18) Remark Covered
1 Mini_Project
Average Marks (Out of 18)
Converted Marks (Out of 5) (B)
Computer Engineering The graduate should be able to adapt Computer Engineering knowledge and skills to
PSO-2
knowledge and skills create career paths in industries or business organizations or institutes of repute.
DEPARTMENT OF COMPUTER ENGINEERING
Judge your ability with regard to the following points by putting a (√), on the scale of 1 (lowest) to 5 (highest),
based on the knowledge and skills you attained from this course.
Sr. 1 5
Your ability to 2 3 4
No. Lowest Highest
______________ _______________
Student’s Signature Date
Experiment No. 1
Aim: To study and implement Preprocessing of text (Tokenization, Filtration, Script Validation)
Take any paragraph in English as well as any other natural language(hindi/marathi) and perform
the following preprocessing steps and attach the original text and output.
1. Tokenization
2. To lowercase
3. Remove numbers
4. Replace numbers by corresponding number words
5. Remove punctuation
6. Remove whitespaces
Theory:
Text preprocessing is a fundamental step in Natural Language Processing (NLP), transforming
raw text into a format suitable for machine learning or language models. It ensures that the data is
clean, structured, and ready for analysis. Three essential steps in text preprocessing are
tokenization, filtration, and script validation.
1. Tokenization: Tokenization involves breaking down text into smaller units called tokens,
which could be words, sentences, or subwords. For example, in English, a sentence like “I
love programming” can be tokenized into individual words: ["I", "love", "programming"].
Similarly, in Hindi or Marathi, "मुझे पढ़ना पसंद है " (Hindi for "I like reading") would be
tokenized into ["मुझे", "पढ़ना", "पसंद", "है "]. Tokenization is crucial as it helps machines
understand the boundaries of words and sentences. There are different approaches to
tokenization such as word-level, sentence-level, or even subword tokenization (Byte Pair
Encoding, WordPiece) for languages where word boundaries are unclear.
2. Filtration: Filtration is the process of removing unnecessary elements such as stop words,
punctuation marks, and other irrelevant characters from the text. Stop words are commonly
used words that usually don’t carry significant meaning (e.g., "the," "is," "and"). By
filtering these out, we can focus on the meaningful parts of the text. For example, from the
sentence “The cat is sitting on the mat,” filtration would remove "The," "is," and "on" to
retain: ["cat", "sitting", "mat"]. This process enhances computational efficiency without
losing the context. Similarly, for non-English languages, stop words in Hindi or Marathi
like "है", "के", "में" (meaning "is", "in") can be filtered out.
3. Script Validation: Script validation ensures that the text adheres to the script norms of the
language being processed. For multilingual environments, it is essential to ensure that non-
English texts are written in the correct script (e.g., Devanagari for Hindi or Marathi). This
step also involves verifying the language model's compatibility with the input script.
Inconsistent scripts (e.g., mixing Latin characters with Devanagari) may lead to errors in
downstream NLP tasks.
Example:
English text: Original: "I love to read and write code in Python!" Tokenized: ["I", "love", "to",
"read", "and", "write", "code", "in", "Python"] Filtered: ["love", "read", "write", "code", "Python"]
Hindi text: Original: "मुझे पढ़ना और कोड लिखना पसंद है !" Tokenized: ["मुझे", "पढ़ना", "और",
"कोड", "लिखना", "पसंद", "है"] Filtered: ["पढ़ना", "कोड", "लिखना", "पसंद"]
Program 1(Tokenization) :
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
text = "Assessment consists of two class tests of 20 marks each. The first class test is to be
conducted when approx. 40% syllabus is completed and second class test when additional 40%
syllabus is completed. Duration of each test shall be one hour."
print(sent_tokenize(text))
print(word_tokenize(text))
Output 1 :
hindi_stopwords = {'एक', 'से', 'है ', 'का', 'के', 'लिए', 'यह', 'और', 'में', 'को', 'तो', 'की', 'पर'}
def extract_text_from_pdf(pdf_file):
text = ""
reader = PyPDF2.PdfReader(pdf_file)
for page in reader.pages:
text += page.extract_text() + " "
return text
def hindi_stemmer(word):
suffixes = ['ता', 'ने', 'ना', 'ोो', 'ोे ', 'ोी', 'ोा', 'ोे ों ', 'ोोों ', 'ोीों '] # Common Hindi suffixes
Output 2:
Output 4:
Program 5 (Remove punctuation):
import nltk
def remove_punctuation(text):
translator = str.maketrans('', '', string.punctuation)
return text.translate(translator)
input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5
more days !!"
print(remove_punctuation(input_str) )
Output 5:
Output 6:
Attach both original document and preprocessed document.Attach both original document and
preprocessed document.
Theory:
1. Natural Language Processing (NLP): NLP is a field of artificial intelligence focused on the
interaction between computers and humans through natural language. It involves the application
of computational techniques to analyze and synthesize natural language.
2. Preprocessing in NLP: Preprocessing is a crucial step in NLP that transforms raw text into a
form that can be analyzed by machine learning models. Common preprocessing steps include stop
word removal, stemming, and lemmatization.
3. Stop Word Removal: Stop words are common words like 'the', 'is', 'in', and 'at' which often do
not contribute much to the meaning of a sentence and are removed during preprocessing.
4. Stemming: Stemming is the process of reducing words to their root form. Common algorithms
for stemming include:
Porter Stemmer: A widely used stemming algorithm that works by applying a series of
rules to strip suffixes from words.
Lancaster Stemmer: An aggressive stemming algorithm known for its simplicity.
Snowball Stemmer: Also known as the Porter2 stemmer, it’s an improvement over the
original Porter stemmer.
5. Lemmatization: Lemmatization reduces words to their base form (lemma) based on dictionary
definitions, unlike stemming which trims words mechanically. It ensures that words are
transformed into meaningful root forms (e.g., 'better' to 'good').
Program 1 :
!pip install nltk PyPDF2
from google.colab import files
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import PyPDF2
import io
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
def extract_text_from_pdf(pdf_file):
text = ""
reader = PyPDF2.PdfReader(pdf_file)
for page in reader.pages:
text += page.extract_text() + " "
return text
def remove_stopwords(text):
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
return filtered_tokens
def perform_stemming(tokens):
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")
porter_stems = [porter.stem(word) for word in tokens]
lancaster_stems = [lancaster.stem(word) for word in tokens]
snowball_stems = [snowball.stem(word) for word in tokens]
return porter_stems, lancaster_stems, snowball_stems
def perform_lemmatization(tokens):
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in tokens]
return lemmas
uploaded = files.upload()
for filename in uploaded.keys():
with io.BytesIO(uploaded[filename]) as pdf_file
pdf_text = extract_text_from_pdf(pdf_file)
filtered_words = remove_stopwords(pdf_text)
print("Filtered Tokens (Stop Words Removed):", filtered_words
porter_stems, lancaster_stems, snowball_stems = perform_stemming(filtered_words)
print("\nPorter Stems:", porter_stems)
print("Lancaster Stems:", lancaster_stems)
print("Snowball Stems:", snowball_stems)
lemmas = perform_lemmatization(filtered_words)
print("\nLemmatized Words:", lemmas)
Output 1 :
The Benefits of Nature.pdf(application/pdf) - 362548 bytes, last modified: 9/27/2024 - 100% done
Saving The Benefits of Nature.pdf to The Benefits of Nature.pdf
Filtered Tokens (Stop Words Removed): ['Benefits', 'Nature', 'Nature', 'essential', 'human', 'well', '-
being', '.', 'Spending', 'time', 'outdoo', 'rs', 'significantly', 'improve', 'mental', 'health', '.', 'Research',
'shows', 'nature', 'reduces', 'stress', ',', 'anxiety', ',', 'depression', '.', 'Additionally', ',', 'nature',
'encourages', 'physical', 'activity', ',', 'promoting', 'healthier', 'lifestyle', '.', 'Activities', 'like', 'hiking',
',', 'biking', ',', 'simply', 'walking', 'park', 'boost', 'cardiovascular', 'health', 'improve', 'mood', '.',
'Moreover', ',', 'exposure', 'natural', 'environments', 'enhances', 'creativity', 'cognitive', 'function', '.',
'Thus', ',', 'incorporating', 'nature', 'daily', 'life', 'beneficial', 'physical', 'mental', 'health', '.',
'Conclusion', 'Embracing', 'nature', 'crucial', 'balanced', 'life', '.', 'Whether', "'s", 'stroll', 'park',
'weekend', 'hike', ',', 'make', 'time', 'great', 'outdoors', '!']
Porter Stems: ['benefit', 'natur', 'natur', 'essenti', 'human', 'well', '-be', '.', 'spend', 'time', 'outdoo', 'rs',
'significantli', 'improv', 'mental', 'health', '.', 'research', 'show', 'natur', 'reduc', 'stress', ',', 'anxieti', ',',
'depress', '.', 'addit', ',', 'natur', 'encourag', 'physic', 'activ', ',', 'promot', 'healthier', 'lifestyl', '.', 'activ',
'like', 'hike', ',', 'bike', ',', 'simpli', 'walk', 'park', 'boost', 'cardiovascular', 'health', 'improv', 'mood', '.',
'moreov', ',', 'exposur', 'natur', 'environ', 'enhanc', 'creativ', 'cognit', 'function', '.', 'thu', ',', 'incorpor',
'natur', 'daili', 'life', 'benefici', 'physic', 'mental', 'health', '.', 'conclus', 'embrac', 'natur', 'crucial',
'balanc', 'life', '.', 'whether', "'s", 'stroll', 'park', 'weekend', 'hike', ',', 'make', 'time', 'great', 'outdoor',
'!']
Lancaster Stems: ['benefit', 'nat', 'nat', 'ess', 'hum', 'wel', '-being', '.', 'spend', 'tim', 'outdoo', 'rs', 'sign',
'improv', 'ment', 'heal', '.', 'research', 'show', 'nat', 'reduc', 'stress', ',', 'anxy', ',', 'depress', '.', 'addit', ',',
'nat', 'enco', 'phys', 'act', ',', 'promot', 'healthy', 'lifestyl', '.', 'act', 'lik', 'hik', ',', 'bik', ',', 'simply', 'walk',
'park', 'boost', 'cardiovascul', 'heal', 'improv', 'mood', '.', 'moreov', ',', 'expos', 'nat', 'environ', 'enh',
'cre', 'cognit', 'funct', '.', 'thu', ',', 'incorp', 'nat', 'dai', 'lif', 'benef', 'phys', 'ment', 'heal', '.', 'conclud',
'embrac', 'nat', 'cruc', 'bal', 'lif', '.', 'wheth', "'s", 'stroll', 'park', 'weekend', 'hik', ',', 'mak', 'tim', 'gre',
'outdo', '!']
Snowball Stems: ['benefit', 'natur', 'natur', 'essenti', 'human', 'well', '-be', '.', 'spend', 'time', 'outdoo',
'rs', 'signific', 'improv', 'mental', 'health', '.', 'research', 'show', 'natur', 'reduc', 'stress', ',', 'anxieti', ',',
'depress', '.', 'addit', ',', 'natur', 'encourag', 'physic', 'activ', ',', 'promot', 'healthier', 'lifestyl', '.', 'activ',
'like', 'hike', ',', 'bike', ',', 'simpli', 'walk', 'park', 'boost', 'cardiovascular', 'health', 'improv', 'mood', '.',
'moreov', ',', 'exposur', 'natur', 'environ', 'enhanc', 'creativ', 'cognit', 'function', '.', 'thus', ',', 'incorpor',
'natur', 'daili', 'life', 'benefici', 'physic', 'mental', 'health', '.', 'conclus', 'embrac', 'natur', 'crucial',
'balanc', 'life', '.', 'whether', "'s", 'stroll', 'park', 'weekend', 'hike', ',', 'make', 'time', 'great', 'outdoor',
'!']
Lemmatized Words: ['Benefits', 'Nature', 'Nature', 'essential', 'human', 'well', '-being', '.', 'Spending',
'time', 'outdoo', 'r', 'significantly', 'improve', 'mental', 'health', '.', 'Research', 'show', 'nature',
'reduces', 'stress', ',', 'anxiety', ',', 'depression', '.', 'Additionally', ',', 'nature', 'encourages', 'physical',
'activity', ',', 'promoting', 'healthier', 'lifestyle', '.', 'Activities', 'like', 'hiking', ',', 'biking', ',', 'simply',
'walking', 'park', 'boost', 'cardiovascular', 'health', 'improve', 'mood', '.', 'Moreover', ',', 'exposure',
'natural', 'environment', 'enhances', 'creativity', 'cognitive', 'function', '.', 'Thus', ',', 'incorporating',
'nature', 'daily', 'life', 'beneficial', 'physical', 'mental', 'health', '.', 'Conclusion', 'Embracing', 'nature',
'crucial', 'balanced', 'life', '.', 'Whether', "'s", 'stroll', 'park', 'weekend', 'hike', ',', 'make', 'time', 'great',
'outdoors', '!']
Program 2:
!pip install nltk PyPDF2
from google.colab import files
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import PyPDF2
import io
nltk.download('punkt')
hindi_stopwords = {'एक', 'से ', 'है ', 'का', 'के', 'लिए', 'यह', 'और', 'में ', 'को', 'तो', 'की', 'पर'}
def extract_text_from_pdf(pdf_file):
text = ""
reader = PyPDF2.PdfReader(pdf_file)
for page in reader.pages:
text += page.extract_text() + " "
return text
def hindi_stemmer(word):
suffixes = ['ता', 'ने ', 'ना', 'ोो', 'ोे ', 'ोी', 'ोा', 'ोे ो', 'ोोो', 'ोीो'] # Common Hindi suffixes
Output 2:
Original Tokens (Hindi): ['Ůक', 'ो', 'लोत', 'क', 'ोे ', 'िाभ', 'Ůक', 'ो', 'लोत', 'मानव', 'कʞोाण', 'क', 'ोे ',
'लोिए', 'अȑोावʴक', 'है ।', 'बाहरी', 'वातावरण', 'मŐ', 'समय', 'लोबताने ', 'से ', 'मालनसक', 'ˢोा˖', 'मŐ',
'महȕपूणŊ', 'सु धार', 'हो', 'सकता', 'है ।', 'शोध', 'से ', 'पता', 'चिता', 'है ', 'लोक', 'Ůक', 'ो', 'लोत', 'मŐ', 'समय',
'लोबताने ', 'से ', 'तनाव', ',', 'लोचता', ',', 'और', 'अवसाद', 'मŐ', 'कमी', 'आती', 'है ।', 'इसक', 'ोे ', 'अिावा', ',',
'Ůक', 'ो', 'लोत', 'शारीįरक', 'लिलतलवधयो', 'को', 'बढ़ावा', 'दे ती', 'है ', ',', 'जो', 'एक', 'ˢ̾थ', 'जीवनशैिी', 'को',
'ŮोोȖोालोहत', 'करती', 'है ।', 'जैसे', 'लोक', 'टŌोे लोक', 'ो', 'ि', ',', 'सालइकि', 'चिाना', ',', 'या', 'पाकŊ',
'मŐ', 'टहिना', 'şदय', 'ˢोा˖', 'को', 'बेहतर', 'बनाते ', 'हœ', 'और', 'मनोİ̾लथत', 'मŐ', 'सु धार', 'करते ', 'हœ।',
'साथ', 'ही', ',', 'Ůोाक', 'ो', 'लोतक', 'वातावरण', 'क', 'ोे ', 'सपकŊ', 'मŐ', 'आने ', 'से', 'रचनाȏकता', 'और',
'सǒोानाȏक', 'Ɨमता', 'मŐ', 'वİȠ', 'होती', 'है ।', 'इस', 'Ůकार', ',', 'दै लोनक', 'जीवन', 'मŐ', 'Ůक', 'ो', 'लोत',
'को', 'शालोमि', 'करना', 'शारीįरक', 'और', 'मालनसक', 'ˢोा˖', 'दोनो', 'क', 'ोे ', 'लोिए', 'िाभकारी', 'है ।',
'लोनʺषŊ', 'सतुलोित', 'जीवन', 'क', 'ोे ', 'लोिए', 'Ůक', 'ो', 'लोत', 'को', 'अपनाना', 'महȕपूणŊ', 'है ।', 'चाहे ',
'वह', 'पाकŊ', 'मŐ', 'एक', 'सैर', 'हो', 'या', 'सɑोाहात', 'की', 'टŌोे लोक', 'ो', 'ि', ',', 'बाहरी', 'वातावरण', 'क',
'ोे ', 'साथ', 'समय', 'लोबताने ', 'क', 'ोे ', 'लोिए', 'समय', 'अवʴ', 'लोनकािŐ', '!']
Filtered Tokens (Stop Words Removed): ['Ůक', 'ो', 'लोत', 'क', 'ोे ', 'िाभ', 'Ůक', 'ो', 'लोत', 'मानव',
'कʞोाण', 'क', 'ोे ', 'लोिए', 'अȑोावʴक', 'है ।', 'बाहरी', 'वातावरण', 'मŐ', 'समय', 'लोबताने ', 'मालनसक', 'ˢोा˖',
'मŐ', 'महȕपूणŊ', 'सुधार', 'हो', 'सकता', 'है ।', 'शोध', 'पता', 'चिता', 'लोक', 'Ůक', 'ो', 'लोत', 'मŐ', 'समय',
'लोबताने ', 'तनाव', ',', 'लोचता', ',', 'अवसाद', 'मŐ', 'कमी', 'आती', 'है ।', 'इसक', 'ोे ', 'अिावा', ',', 'Ůक', 'ो',
'लोत', 'शारीįरक', 'लिलतलवधयो', 'बढ़ावा', 'दे ती', ',', 'जो', 'ˢ̾थ', 'जीवनशैिी', 'ŮोोȖोालोहत', 'करती', 'है ।',
'जैसे', 'लोक', 'टŌोे लोक', 'ो', 'ि', ',', 'सालइकि', 'चिाना', ',', 'या', 'पाकŊ', 'मŐ', 'टहिना', 'şदय', 'ˢोा˖',
'बेहतर', 'बनाते ', 'हœ', 'मनोİ̾लथत', 'मŐ', 'सुधार', 'करते ', 'हœ।', 'साथ', 'ही', ',', 'Ůोाक', 'ो', 'लोतक',
'वातावरण', 'क', 'ोे ', 'सपकŊ', 'मŐ', 'आने ', 'रचनाȏकता', 'सǒोानाȏक', 'Ɨमता', 'मŐ', 'वİȠ', 'होती', 'है ।',
'इस', 'Ůकार', ',', 'दै लोनक', 'जीवन', 'मŐ', 'Ůक', 'ो', 'लोत', 'शालोमि', 'करना', 'शारीįरक', 'मालनसक', 'ˢोा˖',
'दोनो', 'क', 'ोे ', 'लोिए', 'िाभकारी', 'है ।', 'लोनʺषŊ', 'सतुलोित', 'जीवन', 'क', 'ोे ', 'लोिए', 'Ůक', 'ो', 'लोत',
'अपनाना', 'महȕपूणŊ', 'है ।', 'चाहे ', 'वह', 'पाकŊ', 'मŐ', 'सैर', 'हो', 'या', 'सɑोाहात', 'टŌोे लोक', 'ो', 'ि', ',',
'बाहरी', 'वातावरण', 'क', 'ोे ', 'साथ', 'समय', 'लोबताने ', 'क', 'ोे ', 'लोिए', 'समय', 'अवʴ', 'लोनकािŐ', '!']
Stemmed Tokens: ['Ůक', 'ो', 'लोत', 'क', '', 'िाभ', 'Ůक', 'ो', 'लोत', 'मानव', 'कʞोाण', 'क', '', 'लोिए',
'अȑोावʴक', 'है ।', 'बाहर', 'वातावरण', 'मŐ', 'समय', 'लोबता', 'मालनसक', 'ˢोा˖', 'मŐ', 'महȕपूणŊ', 'सुधार',
'ह', 'सक', 'है ।', 'शोध', 'प', 'चि', 'लोक', 'Ůक', 'ो', 'लोत', 'मŐ', 'समय', 'लोबता', 'तनाव', ',', 'लोच', ',',
'अवसाद', 'मŐ', 'कम', 'आत', 'है ।', 'इसक', '', 'अिाव', ',', 'Ůक', 'ो', 'लोत', 'शारीįरक', 'लिलतलवधय', 'बढ़ाव',
'दे त', ',', 'ज', 'ˢ̾थ', 'जीवनशै ि', 'ŮोोȖोालोहत', 'करत', 'है ।', 'जैस', 'लोक', 'टŌोे लोक', 'ो', 'ि', ',', 'सालइकि',
'चिा', ',', 'य', 'पाकŊ', 'मŐ', 'टहि', 'şदय', 'ˢोा˖', 'बेहतर', 'बनात', 'हœ', 'मनोİ̾लथत', 'मŐ', 'सु धार', 'करत',
'हœ।', 'साथ', 'ह', ',', 'Ůोाक', 'ो', 'लोतक', 'वातावरण', 'क', '', 'सपकŊ', 'मŐ', 'आ', 'रचनाȏक', 'सǒोानाȏक',
'Ɨम', 'मŐ', 'वİȠ', 'होत', 'है ।', 'इस', 'Ůकार', ',', 'दै लोनक', 'जीवन', 'मŐ', 'Ůक', 'ो', 'लोत', 'शालोमि', 'कर',
'शारीįरक', 'मालनसक', 'ˢोा˖', 'दोन', 'क', '', 'लोिए', 'िाभकार', 'है ।', 'लोनʺषŊ', 'सतुलोित', 'जीवन', 'क', '',
'लोिए', 'Ůक', 'ो', 'लोत', 'अपना', 'महȕपूणŊ', 'है ।', 'चाह', 'वह', 'पाकŊ', 'मŐ', 'सैर', 'ह', 'य', 'सɑोाहात',
'टŌोे लोक', 'ो', 'ि', ',', 'बाहर', 'वातावरण', 'क', '', 'साथ', 'समय', 'लोबता', 'क', '', 'लोिए', 'समय', 'अवʴ',
'लोनकािŐ', '!']
Conclusion:
In this experiment, we successfully implemented document preprocessing techniques, including
stop word removal and stemming, for both English and Hindi texts. These methods help in
reducing the dimensionality of textual data and preparing it for further natural language processing
tasks. The results demonstrate the effectiveness of preprocessing in simplifying text while
retaining its essential meaning.
Experiment No. 3
Aim: Perform morphological analysis using various stemmers for English as well as any Indian
language.
Theory :
Morphological Analysis: Morphological analysis is the study of the structure and form of words
in a language. It involves examining the internal structure of words and how they can be modified
to convey different meanings. The smallest units of meaning within words are called morphemes.
Morphological analysis can be divided into two main categories:
1. Morphology: This involves creating new words by adding prefixes and suffixes (e.g.,
happy → happiness).
Porter Stemmer : The Porter Stemmer is one of the most well-known stemming algorithms. It
was developed by Martin Porter in 1980 and is specifically designed for the English language. The
algorithm applies a series of rules (a set of heuristics) to reduce inflected words to their base forms.
Algorithm Process: The Porter Stemmer consists of several steps that involve:
Step 3: Removing any remaining suffixes while considering the length and structure of the word.
Snowball Stemmer : The Snowball Stemmer is an improved version of the Porter Stemmer,
developed by Martin Porter as well. It offers a more extensive and flexible approach to stemming,
supporting multiple languages. The Snowball framework includes algorithms for not only English
but also languages like French, Spanish, German, and many others.
Algorithm Process: The Snowball Stemmer employs similar principles as the Porter Stemmer but
includes additional rules and heuristics to cater to various languages. It is more efficient and
produces better results by reducing the ambiguity associated with stemming.
Indic Stemmer : The Indic Stemmer is specifically designed for Indian languages, including
Hindi, Marathi, Bengali, and others. These languages exhibit complex morphological structures
and rich inflections, making traditional stemming algorithms like Porter and Snowball less
effective.
Algorithm Process: The Indic Stemmer applies a set ofrulestailored to the unique characteristics
of Indian languages. It handles:
Inflectional Forms: Recognizing and reducing various forms of verbs, nouns, and
adjectives. Compound Words: Decomposing compound words into their constituent
morphemes.
Code :
Output :
Conclusion : Hence we have implemented morphological analysis using various stemmers for
English as well as any Indian language.
Experiment No. 4
Aim : To generate N grams from sentences for English and any Indian language.
Theory :
N-grams are contiguous sequences of n items from a given sample of text or speech. They are
widely used in NLP tasks such as text analysis, language modeling, and information retrieval. An
N-gram of size 1 is referred to as a "unigram", size 2 as a "bigram", and size 3 as a "trigram". For
example:
For the sentence: "I love NLP"
Unigrams: ["I", "love", "NLP"]
Bigrams: ["I love", "love NLP"]
Trigrams: ["I love NLP"]
N-grams capture the local context within a text, enabling a better understanding of word
dependencies and patterns. They are crucial for machine learning models that rely on text features.
For this experiment, we will use English and Hindi as the target languages.
Program :
The following code uses Python along with the nltk library for generating N-grams from sentences
in English and Hindi:
import nltk
from nltk import ngrams
from collections import Counter
nltk.download('punkt')
def generate_ngrams(sentence, n):
tokens = nltk.word_tokenize(sentence)
n_grams = list(ngrams(tokens, n))
return n_grams
hindi_bigrams = generate_ngrams(hindi_sentence, n)
print(f"Hindi {n}-grams: {hindi_bigrams}")
Output :
Conclusion : The experiment successfully generates N-grams for sentences in English and Hindi,
demonstrating the use of NLP techniques to extract meaningful word sequences.
Experiment No. 5
Aim: Perform POS tagging for English and Hindi using a tagger.
Theory:
Part-of-Speech (POS) Tagging is the process of assigning a part of speech label to each word in a
given text based on its context and role in the sentence. POS tagging is an essential step in
various Natural Language Processing (NLP) applications because it helps in understanding the
syntactic and grammatical structure of a sentence.
What is a Part of Speech?
A part of speech is a category to which a word is assigned based on its syntactic function.
Examples of parts of speech include:
1. Noun (NN)
2. Verb (VB)
3. Adjective (JJ)
4. Adverb (RB)
5. Pronoun (PRP)
6. Preposition (IN)
7. Conjunction (CC)
8. Determiner (DT)
How does POS Tagging work?
POS tagging typically relies on the context of the word in a sentence and a probabilistic model or
set of rules to decide its part of speech. Two popular approaches are:
Rule-Based POS Tagging:
Uses predefined grammar rules to determine the POS tag.
E.g., if a word ends in "-ing," it is likely to be a verb (e.g., "running").
Statistical POS Tagging:
Uses machine learning models like Hidden Markov Models (HMM) or neural networks.
These models are trained on large tagged corpora and learn the probability distribution of word
sequences.
Hybrid POS Tagging:
Combines rule-based and statistical approaches for higher accuracy.
Example of POS Tagging :
"The quick brown fox jumps over the lazy dog."
In this sentence:
"The" is tagged as a Determiner (DT).
"quick" and "brown" are both tagged as Adjectives (JJ).
"fox" is tagged as a Noun (NN).
"jumps" is tagged as a Verb (VBZ) in the 3rd person singular form.
"over" is tagged as a Preposition (IN).
"the" is again tagged as a Determiner (DT).
"lazy" is tagged as an Adjective (JJ).
"dog" is tagged as a Noun (NN).
Code:
english_tags = english_pos_tagging(english_text)
hindi_tags = hindi_pos_tagging(hindi_text)
print("English POS Tags:", english_tags)
print("Hindi POS Tags:", hindi_tags)
Output:
Conclusion:
In this experiment, we successfully performed Part-of-Speech (POS) tagging for both English
and Hindi using two different NLP libraries.
Experiment No. 6
Aim: Perform chunking of text in english language
Theory: Chunking is defined as the process of natural language processing used to identify parts
of speech and short phrases present in a given sentence. Recalling our good old English grammar
classes back in school, note that there are eight parts of speech namely the noun, verb, adjective,
adverb, preposition, conjunction, pronoun, and interjection. Also, in the above definition of
chunking, short phrases refer to the phrases formed by including any of these parts of speech.
For example, chunking can be done to identify and thus group noun phrases or nouns alone,
adjectives or adjective phrases, and so on. Consider the sentence below:
In this case, if we wish to group or chunk noun phrases, we will get “burgers”, “pastries” and
“lunch” which are the nouns or noun groups of the sentence.
Chunking is used to get the required phrases from a given sentence. However, POS tagging can
be used only to spot the parts of speech that every word of the sentence belongs to.
When we have loads of descriptions or modifications around a particular word or the phrase of
our interest, we use chunking to grab the required phrase alone, ignoring the rest around it.
Hence, chunking paves a way to group the required phrases and exclude all the modifiers around
them which are not necessary for our analysis. Interestingly, this process of chunking in NLP is
extended to various other applications; for instance, to group fruits of a specific category, say,
fruits rich in proteins as a group, fruits rich in vitamins as another group, and so on. Besides,
chunking can also be used to group similar cars, say, cars supporting auto-gear into one group
and the others which support manual gear into another chunk and so on.
Types of Chunking
Chunking up
Chunking down
Program:
Dependencies:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
Code:
import nltk
# Sample text
text = "Chunking is a process that helps people understand and remember information better."
tokens = word_tokenize(text)
chunked = ne_chunk(tagged)
if hasattr(subtree, 'label'):
else:
# If it's not a named entity, print the token and POS tag
Output:
Theory:
Named Entity Recognition (NER) is a technique in natural language processing (NLP) that
focuses on identifying and classifying entities. The purpose of NER is to automatically extract
structured information from unstructured text, enabling machines to understand and categorize
entities in a meaningful manner for various applications like text summarization, building
knowledge graphs, question answering, and knowledge graph construction. The article explores
the fundamentals, methods and implementation of the NER model.
Name-entity recognition (NER) is also referred to as entity identification, entity chunking, and
entity extraction. NER is the component of information extraction that aims to identify and
categorize named entities within unstructured text. NER involves the identification of key
information in the text and classification into a set of predefined categories. An entity is the
thing that is consistently talked about or refer to in the text, such as person names,
organizations, locations, time expressions, quantities, percentages and more predefined
categories.
NER system fin applications across various domains, including question answering, information
retrieval and machine translation. NER plays an important role in enhancing the precision of
other NLP tasks like part-of-speech tagging and parsing. At its core, NLP is just a two-step
process, below are the two steps that are involved:
● Detecting the entities from the text
● Classifying them into different categories
Ambiguity in NER
● For a person, the category definition is intuitively quite clear, but for computers, there
is some ambiguity in classification. Let’s look at some ambiguous examples:
○ England (Organization) won the 2019 world cup vs The 2019 world cup
happened in England (Location).
○ Washington (Location) is the capital of the US vs The first president of the
US was Washington (Person).
○
How Named Entity Recognition (NER) works?
the input
Deep learning NER system is much more accurate than previous method, as it is capable to
assemble words. This is due to the fact that it used a method called word embedding, that is
capable of understanding the semantic and syntactic relationship between various words.
It is also able to learn analyzes topic specific as well as high level words automatically. This
makes deep learning NER applicable for performing multiple tasks. Deep learning can do
most of the repetitive work itself, hence researchers for example can use their time more
efficiently.
Program:
Dependencies :
Code :
import spacy
Output:
Conclusion: Hence we have implemented Named Entity Recognition for the English language.
Experiment No. 8
Aim: Perform top down and bottom up parsing using CFG for English language.
Theory:
Parsing :
Parsing is the process of analyzing a string of symbols, either in natural language or computer
languages, to determine its grammatical structure. In the context of context-free grammars
(CFGs), there are two primary approaches to parsing: top-down and bottom-up parsing.
Top-Down Parsing
Top-down parsing starts from the highest-level rule of the grammar and works down to the
leaves (the actual input string). It tries to find a derivation for the string by recursively expanding
non-terminal symbols.
Example:
S → NP VP
NP → Det N
VP → V NP
Det → 'a' | 'the'
N → 'cat' | 'dog'
V → 'chased' | 'saw'
For the input string "the cat chased a dog", the top-down parser would start with S and try to
match the input step by step.
Bottom-Up Parsing
Bottom-up parsing, on the other hand, starts with the input symbols and attempts to construct the
parse tree by reversing the production rules until it reaches the start symbol.
Example:
Using the same CFG as above, a bottom-up parser would take the input string "the cat chased a
dog" and work its way up to derive S, using the rightmost derivation.
class CFG:
def __init__(self):
self.grammar = {
'S': [['NP', 'VP']],
'NP': [['Det', 'N']],
'VP': [['V', 'NP']],
'Det': [['the'], ['a']],
'N': [['cat'], ['dog']],
'V': [['chased'], ['saw']]
}
# Top-Down Parser
def top_down_parse(symbol, input_tokens, index, parse_tree):
if index >= len(input_tokens):
return False
return False
# Bottom-Up Parser
def bottom_up_parse(input_tokens):
stack = []
for token in input_tokens:
stack.append(token)
while True:
for symbol, production in grammar.grammar.items():
for prod in production:
if stack[-len(prod):] == prod:
stack = stack[:-len(prod)]
stack.append(symbol)
break
else:
continue
break
else:
break
return stack[0] if stack == ['S'] else None
# Example Usage
grammar = CFG()
# Top-Down Parsing
parse_tree = []
if top_down_parse('S', input_tokens, 0, parse_tree):
print("Top-Down Parse Tree:", parse_tree)
else:
print("Top-Down Parsing failed.")
# Bottom-Up Parsing
result = bottom_up_parse(input_tokens)
if result:
print("Bottom-Up Parsing succeeded with root:", result)
else:
print("Bottom-Up Parsing failed.")
Output:
Conclusion:
In this experiment, we effectively implemented both top-down and bottom-up parsing techniques
using a context-free grammar for the English language.
Experiment No. 9
im : To implement a Text Similarity Recognizer using NLP techniques to identify the
A
similarity between two chosen text documents.
Theory :
ext similarity is a measure used to quantify the similarity between two text documents. It is
T
often used in applications like plagiarism detection, document clustering, and information
retrieval. There are two primary types of similarity measures:
1. C osine Similarity: Cosine similarity calculates the cosine of the angle between two
vectors representing text documents. This method is widely used because it is
independent of document length.
2. Jaccard Similarity: This measure compares thesharedelementsbetweentwosets.For
text documents, these sets could be words or terms.
3. EuclideanDistance:Thismethodmeasuresthestraight-linedistancebetweentwopoints
(documents) in a multidimensional space.
Mathematical Representation
Program :
import PyPDF2
import nltk
nltk.download('stopwords')
def preprocess_docx(file_path):
try:
doc = Document(file_path)
return preprocess_text(text)
return ""
def preprocess_pdf(file_path):
try:
text = ""
with open(file_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
text += page.extract_text()
return preprocess_text(text)
return ""
# Function to preprocess text: lowercase, remove stopwords, and filter non-alphanumeric words
def preprocess_text(text):
try:
text = text.lower()
stop_words = set(stopwords.words('english'))
f iltered_text = ' '.join([word for word in text.split() if word.isalnum() and word not in
stop_words])
return filtered_text
return ""
file_path1 = "/content/document1.pdf"
file_path2 = "/content/document2.pdf"
# Preprocess the PDF documents
doc1_cleaned = preprocess_pdf(file_path1)
doc2_cleaned = preprocess_pdf(file_path2)
print("One or both documents could not be read. Please check the file paths and try again.")
else:
# Vectorization
vectorizer = CountVectorizer()
cosine_sim = cosine_similarity(count_matrix)[0][1]
# Output Result
Output :
onclusion:Hence,wehaveimplementedaTextSimilarityRecognizerusingNLPtechniques
C
to identify the similarity between two chosen text documents.
Experiment No. 9
Theory:
WordNet is a lexical database of the English language that groups words into sets of synonyms called synsets. It
provides short definitions and usage examples and records various semantic relations between these synonym sets,
including hypernyms (more general terms), hyponyms (more specific terms), meronyms (part-whole relationships),
and more.
Key Concepts:
1. Synsets: A group of synonymous words representing a single concept. For example, the
word "car" may belong to the synset that includes "automobile," "motorcar," etc.
2. Relationships:
● Hyponymy: A relationship where one word is a more specific term than another.
E.g., "dog" is a hyponym of "animal."
● Hypernymy: The inverse of hyponymy; it represents more general terms.
● Meronymy: A part-to-whole relationship. For example, "wheel" is a meronym of
"car."
3. Parts of Speech: WordNet categorizes words into nouns, verbs, adjectives, and adverbs.
4. Usage Examples: Each synset may include example sentences illustrating how the word
is used.
Applications of WordNet :
● Natural Language Processing (NLP): Used in applications like sentiment analysis,
information retrieval, and semantic similarity calculations.
● Machine Learning: Provides semantic features for various algorithms.
Code:
pip install nltk
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
# Example usage
word = 'dog'
synsets = get_synsets(word)
display_synsets(synsets)
BACHELOR OF
ENGINEERING IN
COMPUTER ENGINEERING
Submitted By
`` NamaN Bhalani (02)
Sachin Prajapati (31)
Dhrumil Upadhyay (53)
Shrey Varma (56)
University of Mumbai
(AY 2024-25)
TABLE OF CONTENTS
1. Introduction
2. Literature Review
3. Implementation
4. Resources
5. Emoticons:
6. Unicode:
7. Case:
8. Targets:
9. Negation:
10. Sequence of repeated characters:
11. Machine learning
12. Naive Bayes
13. Baseline
14. Improvements
15. Conclusion
16. References
Introduction
Sentiment analysis deals with identifying and classifying opinions or sentiments expressed in source
text. Social media is generating a vast amount of sentiment rich data in the form of tweets, status updates,
blog posts etc. Sentiment analysis of this user generated data is very useful in knowing the opinion of
the crowd. Twitter sentiment analysis is difficult compared to general sentiment analysis due to the
presence of slang words and misspellings. The maximum limit of characters that are allowed in Twitter
is 140. Knowledge base approach and Machine learning approach are the two strategies used for
analyzing sentiments from the text. In this paper, we try to analyze the twitter posts about electronic
products like mobiles, laptops etc using Machine Learning approach. By doing sentiment analysis in a
specific domain, it is possible to identify the effect of domain information in sentiment classification.
We present a new feature vector for classifying the tweets as positive, negative and extract peoples'
opinion about products. In this project I choose to try to classify tweets from Twitter into “positive” or
“negative” sentiment by building a model based on probabilities. Twitter is a microblogging website
where people can share their feelings quickly and spontaneously by sending a tweets limited by 140
characters. You can directly address a tweet to someone by adding the target sign “@” or
participate to a topic by adding an hastag “#” to your tweet. Because of the usage of Twitter, it is
a perfect source of data to determine the current overall opinion about anything.
Implementation
To gather the data many options are possible. In some previous paper researches, they built a
program to collect automatically a corpus of tweets based on two classes, “positive” and
“negative”, by querying Twitter with two type of emoticons:
● Happy emoticons, such as “:)”, “:P”, “: )” etc.
● Sad emoticons, such as “:(“, “:’(”, “=(“.
Others make their own dataset of tweets my collecting and annotating them manually which very
long and fastidious.
Additionally to find a way of getting a corpus of tweets, we need to take of having a balanced data
set, meaning we should have an equal number of positive and negative tweets, but it needs also to
be large enough. Indeed, more the data we have, more we can train our classifier and more the
accuracy will be.
After many researches, I found a dataset of 1578612 tweets in english coming from two sources:
Kaggle and Sentiment140. It is composed of four columns that are ItemID, Sentiment,
SentimentSource and SentimentText. We are only interested by the Sentiment column
corresponding to our label class taking a binary value, 0 if the tweet is negative, 1 if the tweet is
positive and the SentimentText columns containing the tweets in a raw format.
Table 1. Example of twitter posts annotated with their corresponding sentiment, 0 if it is negative, 1
if it is positive.
In the Table 1 showing the first ten twitter posts we can already notice some particularities
and difficulties that we are going to encounter during the preprocessing steps.
● The presence of acronyms "bf" or more complicated "APL". Does it means apple ?Apple (the
company) ? In this context we have "friend" after so we could think that he refers to his
smartphone and so Apple, but what about if the word "friend" was not here ?
● The presence of sequences of repeated characters such as
"Juuuuuuuuuuuuuuuuussssst", "hmmmm". In general when we repeat
several characters in a word, it is to emphasize it, to increase its impact.
● The presence of emoticons , ":O", "T_T", ": |" and much more, give insights about
user's moods.
● Spelling mistakes and “urban grammar ” like "im gunna" or "mi".
● The presence of nouns such as "TV", "New
Moon". Furthermore, we can also add,
● People also indicate their moods, emotions, states, between two such as, *\cries*,
*hummin*, *sigh*.
● The negation, “can't”, “cannot”, “don't”, “haven't” that we need to handle like: “I don’t
likechocolate”, “like” in this case is negative.
We could also be interested by the grammar structure of the tweets, or if a tweet is
subjective/objective and so on. As you can see, it is extremely complex to deal with languages
and even more when we want to analyse text typed by users on the Internet because people don’t
take care of making sentences that are grammatically correct and use a ton of acronyms and words
that are more or less english in our case. We can visualize a bit more the dataset by making a
chart of how many positive and negative tweets does it contains,
We have exactly 790177 positive tweets and 788435 negative tweets which signify that the dataset
is well balanced. There is also no duplicates.
Finally, let’s recall the Twitter terminology since we are going to have to deal with in the tweets:
● Hashtag: A hashtag is any word or phrase immediately preceded by the # symbol. Whenyou click
on a hashtag, you’ll see other Tweets containing the same keyword or topic. ● @username: A
username
is how you’re identified on Twitter, and is always preceded immediately by the @ symbol. For
instance, Katy Perry is @katyperry.
● MT: Similar to RT (Retweet), an abbreviation for “Modified Tweet.” Placed before
theRetweeted text when users manually retweet a message with modifications, for example
shortening a Tweet.
● Retweet: RT, A Tweet that you forward to your followers is known as a Retweet. Oftenused to
pass along news or other valuable discoveries on Twitter, Retweets always retain original
attribution.
● Emoticons: Composed using punctuation and letters, they are used to express
emotions concisely, ";) :) ...".
5
Now we have the corpus of tweets, we need to use other resources to make easier the pre processing
step.
Resources
In order to facilitate the pre processing part of the data, we introduce five resources which are,
● An emoticon dictionary regrouping 132 of the most used emoticons in western
withtheir sentiment, negative or positive.
● An acronym dictionary of 5465 acronyms with their translation.
● A stop word dictionary corresponding to words which are filtered out before or
afterprocessing of natural language data because they are not useful in our case.
● A positive and negative word dictionaries g iven the polarity (sentiment out of context)of words.
● A negative contractions and auxiliaries dictionary which will be used to detectnegation in
a given tweet such as “don’t”, “can’t”, “cannot”, etc.
The introduction of these resources will allow to uniform tweets and remove some of their
complexities with the acronym dictionary for instance because a lot of acronyms are used in tweets.
The positive and negative word dictionaries could be useful to increase (or not) the accuracy score
of the classifier. The emoticon dictionary has been built from wikipedia with each emoticon
annotated manually. The stop word dictionary contains 635 words such as “the”, “of”, “without”.
Normally they should not be useful for classifying tweets according to their sentiment but it is
possible that they are.
Also we use Python 2.7 (https://www.python.org/) which is a programming language widely used in
data science and scikit learn (http://scikit learn.org/) a very complete and useful library for machine
learning containing every techniques, methods we need and the website is also full of tutorials well
explained. With Python, the libraries, Numpy (http://www.numpy.org/) and Panda
(http://pandas.pydata.org/) for manipulating data easily and intuitively are just essential.
Pre-processing
Now that we have the corpus of tweets and all the resources that could be useful, we can pre process
the tweets. It is a very important since all the modifications that we are going to during this process
will directly impact the classifier’s performance. The pre processing includes cleaning, normalization,
transformation, feature extraction and selection, etc. The result of pre processing will be consistent
and uniform data that are workable to maximize the classifier's performance. All of the tweets are pre
processed by passing through the following steps in the same order.
Emoticons:
We replace all emoticons by their sentiment polarity ||pos|| and ||neg|| using the emoticon
dictionary. To do the replacement, we pass through each tweet and by using a regex we find
out if it contains emoticons, if yes they are replaced by their corresponding polarity.
Unicode:
The data set contains 19469 positive emoticons and 11025 negative emoticons.
The case is something that can appears useless but in fact it is really important for distinguish proper
noun and other kind of words. Indeed: “General Motor” is the same thing that “general motor”, or
“MSc” and “msc”. So reduce all letters to lowercase should be normally done wisely. In this project,
for simplicity we will not take care of that since we assume that it should not impact too much the
classifier’s performance.
Targets:
The target correspond to usernames in twitter preceded by “@” symbol. It is used to address a tweet to
someone or just grab the attention. We replace all usernames/targets by the tag ||target|| . Notice that in
the data set we have 735757 targets.
Table 8. Tweets before processing targets.
Acronyms:
We replace all acronyms with their translation. An acronym is an abbreviation formed from the initial
components in a phrase or a word. Usually these components are individual letters (as in NATO or laser)
or parts of words or names (as in Benelux). Many acronyms are used in our data set of tweets as you can
see in the following bar chart. At this point, tweets are going to be tokenized by getting rid of the
punctuation and using split in order to do the process really fast. We could use nltk.tokenizer but it is
definitely much much slower (also much more accurate).
Figure 3. Top 20 of acronyms in the data set of tweets
As you can see, “lol”, “u”, “im”, “2” are really often used by users. The table below shows the top 20
acronyms with their translation and their count.
Table 10. Top 20 of acronyms in the data set of tweets with their translation and count
Negation:
We replace all negation words such as “not”, “no”, “never” by the tag ||not|| using the negation dictionary
in order to take more or less of sentences like "I don't like it". Here like should not be considered as
positive because of the "don't" before. To do so we will replace "don't" by ||not|| and the word like will
not be counted as positive. We should say that each time a negation is encountered, the
words followed by the negation word contained in the positive and negative word dictionaries will be
reversed, positive becomes negative, negative becomes positive, we will do this when we will try to find
positive and negative words.
Now, we replace all sequences of repeated characters by two characters (e.g: "helloooo" = "helloo")
to keep the emphasized usage of the word.
Machine learning
Once we have applied the different steps of the preprocessing part, we can now focus on the machine
learning part. There are three major models used in sentiment analysis to classify a sentence into positive
or negative: SVM, Naive Bayes and Language Models (N Gram). SVM is known to be the model giving
the best results but in this project we focus only on probabilistic model that are Naive Bayes and
Language Models that have been widely used in this field. Let’s first introduce the Naive Bayes model
which is well known for its simplicity and efficiency for text classification.
Naive Bayes
In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on
applying Bayes' theorem with strong (naive)independence assumptions between the features. Naive
Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables
(features/predictors) in a learning problem. Maximum likelihood training can be done by evaluating a
closed form expression (mathematical expression that can be evaluated in a finite number of operations),
which takes linear time. It is based on the application of the Baye’s rule given by the following formula:
where D denotes the document and C the category (label), d and c are instances of D and C and P(D =
d) = ∑ (D |C )P(C ).
The Multi variate Bernoulli Model : Also called binomial model, useful if our feature vectors are binary
(e.g 0s and 1s). An application can be text classification with bag of words model where the 0s 1s are
"word does not occur in the document" and "word occurs in the document" respectively. ● The
Multinomial Model : Typically used for discrete counts. In text classification, we extend the Bernoulli
model further by counting the number of times a word $w_i$ appears over the number of words rather
than saying 0 or 1 if word occurs or not. ● the Gaussian Model : We assume that features follow a normal
distribution. Instead of discrete counts, we have continuous features.
Baseline
In every machine learning task, it is always good to have what we called a baseline. It often a “quick and
dirty” implementation of a basic model for doing the first classification and based on its accuracy, try to
improve it. We use the Multinomial Naive Bayes as learning algorithm with the Laplace smoothing
representing the classic way of doing text classification. Since we need to extract features from our data
set of tweets, we use the bag of words model to represent it. The bag of words model is a simplifying
representation of a document where it is represented as a bag of its words without taking consideration
of the grammar or word order. In text classification, the count (number of time) of each word appears is
a document is used as a feature for training the classifier. Firstly, we divide the data set into two parts,
the training set and the test set. To do this, we first shuffle the data set to get rid of any order applied to
the data, then we from the set of positive tweets and the set of negative tweets, we take 3/4 of tweets
from each set and merge them together to make the training set. The rest is used to make the test set.
Finally the size of the training set is 1183958 tweets and the test set is 394654 tweets. Notice that they
are balanced and follow the same distribution of the initial data set. Once the training set and the test set
are created we actually need a third set of data called the validation set . It is really useful because it will
be used to validate our model against unseen data and tune the possible parameters of the learning
algorithm to avoid underfitting and overfitting for example. We need this validation set because our test
set should be used only to verify how well the model will generalize . If we use the test set rather than
the validation set, our model could be overly optimistic and twist the results. 16 To make the validation
set, there are two main options: ● Split the training set into two parts
(60%, 20%) with a ratio 2:8 where each part contains an equal distribution of example types. We train
the classifier with the largest part, and make prediction with the smaller one to validate the model. This
technique works well but has the disadvantage of our classifier not getting trained and validated on all
examples in the data set (without counting the test set). ● The K fold cross validation . We split the data
set into k parts, hold out one, combine the others and train on them, then validate against the held out
portion. We repeat that process k times (each fold), holding out a different portion each time. Then we
average the score measured for each fold to get a more accurate estimation of our model's performance.
We split the training data into 10 folds and cross validate on them using scikit learn as shown in the
figure 2.4.2.1 above. The number of K folds is arbitrary and usually set to 10 it is not a rule. In fact,
determine the best K is still an unsolved problem but with lower K: computationally cheaper, less
variance, more bias. With large K: computationally expensive, higher variance, lower bias. We can now
train the naive bayes classifier with the training set, validate it using the hold out part of data taken from
the training set, the validation set, repeat this 10 times and average the results to get the final accuracy
which is about 0.77 as shown in the screen results below,
Figure 7. Result of the naive bayes classifier with the score representing the average of the results of
each 10 fold cross validation, and the overall confusion matrix.
Improvements
From the baseline, the goal is to improve the accuracy of the classifier, which is 0.77, in order to
determine better which tweet is positive or negative. There are several ways of doing this and we present
only few possible improvements (or not). First we could try to removed what we called, stop words.
Stop words usually refer to the most common words in the English language (in our case) such as: "the",
"of", “to” and so on. They do not indicate any valuable information about the sentiment of a sentence
and it can be necessary to remove them from the tweets in order to keep only words for which we are
interested. To do this we use the list of 635 stopwords that we found. In the table below, you can see the
most frequent words in the data set with their counts,
Table 13. Most frequent words in the data set with their corresponding count.
We could also try to stem the words in the data set. Stemming is the process by which endings are
removed from words in order to remove things like tense or plurality. The stem form of a word could
not exist in a dictionary (different from Lemmatization). This technique allows to unify words and
reduce the dimensionality of the dataset. It's not appropriate for all cases but can make it easier to connect
together tenses to see if you're covering the same subject matter. It is faster than
Lemmatization ( remove inflectional endings only and return the base or dictionary form of a word,
which is known as the lemma). Using the library NLTK which is a library in Python specialized in
natural language processing, we get the following results after stemming the words in the data set,
We actually lose 0.002 in accuracy score compared to the results of the baseline. We conclude that
stemming words does not improve the classifier’s accuracy and actually do not make any sensible
changes.
Let’s introduce language models to see if we can have better results than those for our baseline. Language
models are models assigning probabilities to sequence of words. Initially, they are extensively used in
speech recognition and spelling correction but it turns out that they give good results in text
classification.
An important note is that n gram classifiers are in fact a generalization of Naive Bayes. A unigram
classifier with Laplace smoothing corresponds exactly to the traditional naive Bayes classifier. Since we
use bag of words model, meaning we translate this sentence: "I don't like chocolate" into "I", "don't",
"like", "chocolate", we could try to use bigram model to take care of negation with "don't like" for this
example. Using bigrams as feature in the classifier we get the following results,
Formula: Results of the naive bayes classifier with bigram features.
Using only bigram features we have slightly improved our accuracy score about 0.01. Based on that
we can think of adding unigram and bigram could increase the accuracy score more.
Formula: Results of the naive bayes classifier with unigram and bigram features. and
indeed, we increased slightly the accuracy score about 0.02 compared to the baseline.
Conclusion
Nowadays, sentiment analysis or opinion mining is a hot topic in machine learning. We are still
far to detect the sentiments of s corpus of texts very accurately because of the complexity in the
English language and even more if we consider other languages such as Chinese.
In this project we tried to show the basic way of classifying tweets into positive or negative
category using Naive Bayes as baseline and how language models are related to the Naive Bayes
and can produce better results. We could further improve our classifier by trying to extract more
features from the tweets, trying different kinds of features, tuning the parameters of the naïve
Bayes classifier, or trying another classifier all together.
References
[1] Alexander Pak, Patrick Paroubek. 2010, Twitter as a Corpus for Sentiment Analysis and
OpinionMining.
[2] Alec Go, Richa Bhayani, Lei Huang. Twitter Sentiment Classification using Distant
Supervision.
Jin Bai, Jian Yun Nie. Using Language Models for Text Classification.
[3] Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, Rebecca Passonneau. Sentiment
Analysisof Twitter Data.
[4] Fuchun Peng. 2003, Augmenting Naive Bayes Classifiers with Statistical Language Models
Assignment 2
Problem Statement: The goal of the assignment is to write a tweet tokenizer. The input of
the code will be a set of tweets and the output will be the tokens in each tweet.
Manual tokenization:
Tweet:
1. I’m going to the park! NLP’s great :)
2. Can’t believe it’s already 2024! #TimeFlies @Friend
3. Jones’ car is faster than John’s. LOL!!!
4. R E T W E E T if you agree!
5. U.S.A. is the land of opportunities.
6. "NLP’s future looks bright!" – said @Professor_AI.
Tokenization:
1. ['I', 'am', 'going', 'to', 'the', 'park', '!', 'NLP', "'s", 'great', ':)']
2. Can’t believe it’s already 2024! #TimeFlies @Friend
3. ['Jones', "'s", 'car', 'is', 'faster', 'than', 'John', "'s", '.', 'LOL', '!', '!', '!']
4. ['RETWEET', 'if', 'you', 'agree', '!']
5. ['U.S.A.', 'is', 'the', 'land', 'of', 'opportunities', '.']
6. ['"', 'NLP', "'s", 'future', 'looks', 'bright', '!', '"', '–', 'said', '@Professor_AI', '.']
Automated Tokenization:
Code:
import re
def tokenize_tweet(tweet):
clitics = {
# Replace clitics
tweet = tweet.replace(clitic,
replacement)
tokens = tweet.split()
return tokens
def tokenize_tweets(tweets):
tokenized_tweets =
[tokenize_tweet(tweet) for tweet in
tweet_list]
return tokenized_tweets
tweet_input = file.read()
tokenized_tweets =
tokenize_tweets(tweet_input)
Output:
Tweet 1: ['Camping', 'in', 'Maine', 'for', 'the', 'weekend', '.', 'Hey', 'Dad', ',', 'Mama', 'Loves', 'YOU', ':', 'http',
':', '//www', '.', 'mamapalooza', '.', 'com']
Tweet 2: ['Its', 'american', 'tradition', 'bitch']
Tweet 3: ['@ThroughTheVoid', 'They', 'love', 'it', '!', 'The', 'only', 'pleasure', 'they', 'get', 'in', 'life', '.', 'I',
'actually', 'do', 'that', '.', 'I', 'am', 'sure', 'I', 'hear', 'a', 'tiny', 'squeak', '.', '.', '.', 'Then', 'louder', 'ones']
Tweet 4: ['"', 'RT', '@latti', ':', '@AbsoHilare', 'stop', 'tweeting', 'in', 'church', '!', 'Lol', '<---', '"', '"', 'I', 'tweet',
'because', 'I', 'am', 'happy', ',', 'I', 'tweet', 'because', 'I', 'am', 'free', '"', '"', 'LOL', '!', '"']
Tweet 5: ['Samsung', 'Mini', 'S2', 'portable', 'HDD', 'graced', 'with', 'colors', 'that', 'perfectly', 'match', 'your',
'tacky', 'beach', 'gear', ':', 'Sammy', "'", 's', 'done', 'it', 'aga', '.', '.', 'http', ':', '//tinyurl', '.', 'com/lb5p6m']
Tweet 6: ['@dialloc', 'congrats', 'on', 'finding', 'your', 'way', 'over', '.', 'it', 'may', 'be', 'slow', 'going', 'at', 'first',
'.', 'hang', 'in', 'there', '.', 'it', 'is', 'kinda', 'cool', 'when', 'u', 'get', 'up', 'to', 'speed', '.']
Tweet 7: ['iPhone', 'activation', 'delays', 'continue', ',', 'Apple', 'offers', '$30', 'http', ':', '//twt', '.', 'gs/l3Ki']
Tweet 8: ['RT', '@GoogleAtWork', 'Gmail', 'maximum', 'attachment', 'size', 'now', '25MB', 'http', ':', '//bit', '.',
'ly/62mjw', 'Nice', '!', '!', '!']
Tweet 9: ['RT', '@acfou', 'The', 'Ads', 'Won', 'Awards', 'for', 'Crispin', ';', 'But', 'Did', 'Nothing', 'for', 'Client',
'BurgerKing', "'", 's', 'Sales/Marketshare', '-', 'Big', 'Surprise', '-', 'http', ':', '//ping', '.', 'fm/vw8TI']
Tweet 10: ['Hey', 'doll', '!', 'Great', 'I', 'missed', 'True', 'Blood', 'yday', 'boo', 'lol', 'Rt', '@FrankBanuat78',
'@jhillstephens', 'Hello', 'Sunshine', 'how', 'are', 'u', 'today', '?', ':', '-', ')']
Tweet 11: ['Australian', 'artist', 'Pogo', 'made', 'these', 'free', 'songs', 'primarily', 'from', 'sampled', 'audio',
'from', 'Alice', 'In', 'Wonderland', '.', 'http', ':', '//www', '.', 'last', '.', 'fm/music/Pogo/Wonderland']
Tweet 12: ['@mppritchard', 'they', 'wanted', 'to', 'sell', 'all', 'the', 'preorders', '&', 'then', 'sell', 'all', 'of', 'the',
'ones', 'they', 'had', 'in', 'stock', 'to', 'those', 'that', 'just', 'walked', 'in', '.', 'Can', "'", 't', 'do', 'both']
Tweet 13: ['Incoming', ':', 'Frightened', 'Rabbit', ',', 'Sept', '.', '22', '(', 'Tucson', ')', ':', 'If', 'Fat', 'Cat', 'Records',
'is', 'going', 'to', 'send', 'three', 'great', 'bands', 'from', 'Scot', '.', '.', 'http', ':', '//tinyurl', '.', 'com/nz6xcv']
Tweet 14: ['Hey', '@ginoandfran', 'please', 'greet', 'philip', '!', '(', 'GinoandFran', 'live', '>', 'http', ':', '//ustre', '.',
'am/2YyQ', ')']
Tweet 15: ['Ik', 'weet', 'niet', 'wie', 'er', 'achter', 'de', 'T-Mobile', 'iPhone', 'Twitter', 'zit', 'maar', 'ik', 'vind',
'het', 'niet', 'echt', "'", 'corporate', "'", 's', 'taalgebruik', '.', '.', '.', 'Best', 'vreemd', 'eigenlijk']
Tweet 16: ['Polizei-Sondereinsatz', 'mit', 'Hindernissen', 'http', ':', '//tinyurl', '.', 'com/kv7w7p']
Tweet 17: ['we', 'are', 'watching', 'dr', '.', 'phil', 'classics', '.', 'haha', ':', ')', '&', 'we', 'are', 'learning', 'how', 'to',
'not', 'give', 'mixed', 'signals', '.']
Tweet 18: ['Oh', 'yeah', '.', '.', '.', 'Washtenaw', 'Dairy', 'mint', 'chip', '.', 'Just', 'like', 'when', 'I', 'was', 'a', 'wee',
'lad', '.', 'http', ':', '//twitpic', '.', 'com/88pb1']
Tweet 19: ['RT', '@TheTrillYoungB', ':', 'Download', 'my', 'new', 'single', 'True', 'Religion', '!', '!', '!', 'http', ':',
'//tinyurl', '.', 'com/ynctruereligion']
Tweet 20: ['Show', 'support', 'for', 'democracy', 'in', 'Iran', 'add', 'green', 'ribbon', 'to', 'your', 'Twitter', 'avatar',
'with', '1-click', '-', 'http', ':', '//helpiranelection', '.', 'com/']
Tweet 21: ['"', '@shanti45', ':', '"', '"', 'Only', 'just', 'realised', 'I', 'like', 'these', '.', '.', '.', 'lol', '"', '"', '♫', 'http', ':',
'//blip', '.', 'fm/~8szom', '"']
Tweet 22: ['Listening', '@DannyAkin', 'speak', 'at', 'sebts', 'luncheon', '.', 'Exciting', 'times', 'in', '#sbc2009',
'http', ':', '//twitpic', '.', 'com/8amua']
Tweet 23: ['@careyd', 'try', 'OmniFocus', 'for', 'the', 'iPhone', '(', '&Mac', 'if', 'you', 'have', 'it', ')', '.', 'I', "'",
've', 'saved', 'so', 'much', 'time', 'with', 'it', 'I', 'have', 'time', 'to', 'recommend', 'it', 'on', 'Twitter', '.', 'Sam']
Tweet 24: ['"', 'RT', '@Shoq', ':', '"', '"', '.', '@DougCurran', 'See', 'Replies', 'discussed', 'here', ':', 'http', ':',
'//bit', '.', 'ly/shoqtips', '"', '"', '//', 'How', '2', 'get', 'around', 'the', 'replies', 'issue', 'on', 'twitter', '--', '"']
Tweet 25: ['i', 'love', 'mia', 'michaels', 'and', 'randi', '&', 'evan']
Tweet 26: ['MonksDen', 'berrybck', ':', 'only', 'trick', 'is', 'the', 'friggin', 'connectors', '.', 'he', 'may', ':',
'berrybck', 'http', ':', '//tinyurl', '.', 'com/mthfxj']
Tweet 27: ['DEU', 'MILEYYY', 'meigo/']
Tweet 28: ['RT', '@aminjavan', ':', '@rishaholmes', 'thank', 'you', 'for', 'your', 'support', 'thank', 'you', '(', 'no',
'problem', '!', ':', ')']
Tweet 29: ['@lululovesbombay', 'I', 'love', 'breakfast', 'foods', 'at', 'pretty', 'much', 'any', 'time', 'of', 'the', 'day',
'hence', 'all', 'the', 'ideas', ':', ')']