0% found this document useful (0 votes)

167 views5 pages

Assignment 4

This document provides instructions for an assignment on sentiment analysis using different natural language processing techniques. Students are asked to: 1. Preprocess movie review data and represent it as term-document matrices using unigram and bigram models with and without TF-IDF. 2. Train a stochastic gradient descent classifier on the preprocessed data. 3. Use the trained classifier to make predictions on test data and output the results to four files corresponding to the different data representations.

Uploaded by

Ahmed Haa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

167 views5 pages

Assignment 4

Uploaded by

Ahmed Haa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Assignment 4 : NLP

INSTRUCTIONS
Congratulation on making it to the last programming project. By coming this far, we
assume that you have accumulated formidable knowledge in both traditional Artificial
Intelligence (AI) and modern Machine Learning (ML), and from now on we will treat
you as such. This assignment intends to give you a flavor of a real world AI/ML
application, which often require to gather the raw data, do preprocessing, design
suitable ML algorithms and implement the solution. Today, we touch on an active
research area in Natural Language Processing (NLP), sentiment analysis.

Given the exponentially growing of online review data (Amazon, IMDB and etc),
sentiment analysis becomes increasingly important. We are going to build a
sentiment classifier, i.e., evaluating a piece of text being either positive or negative.

The "Large Movie Review Dataset"(*) shall be used for this project. The dataset is
compiled from a collection of 50,000 reviews from IMDB on the condition there are no
more than 30 reviews each movie. Number of positive and negative reviews are equal.
Negative reviews have scores lesser or equal 4 out of 10 while a positive review greater
or equal 7 out of 10. Neutral reviews are not included on the other hand. Then, 50,000
reviews are divided evenly into the training and test set.

*Dataset is credited to Prof. Andrew Mass in the paper, Andrew L. Maas, Raymond E. Daly,
Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning
Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for
Computational Linguistics (ACL 2011).

I. Instruction
Up until now, most of the course projects have been requiring you to implement algorithms
discussed in lectures. This assignment is going to introduce a few advanced concepts of
which implementations demand a non-trivial programming expertise. As such, before
reinventing the wheel, we would advise you to first explore the incredibly powerful existing
Python libraries. The following two are highly recommended:

• http://scikit-learn.org/stable/

• http://pandas.pydata.org/

Stochastic Gradient Descent Classifier

In this project, we will train a Stochastic Gradient Descent Classifier. Recalled from the
Machine Learning project, you were asked to implement a gradient descend update
algorithm for linear regression. While gradient descend is powerful, it can be prohibitively
expensive when the dataset is extremely large because every single data point needs to be
processed.

However, it turns out when the data is large, rather than the entire dataset, SGD algorithm
performs just as good with a small random subset of the original data. This is the central idea
of Stochastic SGD and particarly handy for the text data since corpus are often humongous.

You should read sklearn document and learn how to use a SGD classifier. For adventurers,
you are welcome to manually implement SGD yourself. Wikipedia provides a good first
reference, https://en.wikipedia.org/wiki/Stochastic_gradient_descent.

Data Preprocessing
The training data is provided in the directory
"../resource/lib/publicdata/aclImdb/train/" of Vocareum. If you wish to download
the data to your local machine for inspections, use the following
link: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz.

Your first task is explore this directory. There are two sub-directories pos/ for positive
texts and neg/ for negative ones. You do not need to worry about unsup/, and you do
not ned them.

Now combine the raw database into a single csv files, “imdb_tr.csv”. The csv file
should have three columns, "row_number" and “text” and “polarity”. The
column “text” contains review texts from the aclImdb database and the
column “polarity” consists of sentiment labels, 1 for positive and 0 for negative. An
example of "imdb.tr.csv" is provided in the workspace.

In addition, common English stopwords should be removed. An English stopwords

reference are provided in your Vocareum work space for your reference. Your driver.py (as
explained below) will have access to it during run time.

Unigram Data Representation

The very first step in solving any NLP problem is finding a way to represent the text
data so that machines can understand. A common approach is using a document-term
vector where each document is encoded as a discrete vector that counts occurrences of
each word in the vocabulary it contains. For example, consider two one-sentence
documents:
• d1: “I love Columbia Artificial Intelligence course.

• d2: “Artificial Intelligence is awesome”

The vocabulary V = {artificial, awesome, Columbia, course, I, intelligence, is, love} and
two documents can be encoded as v1 and v2 as follow:

Hint: When building our model you should assume no access to the test data. Then
what if there are words that appear only in test data but not in training data? The
features will mismatch if you include those. Therefore, when extracting features in the
test set, you should only use the vocabulary that was used in the training set.

If you wish to know more, start from here https://en.wikipedia.org/wiki/Document-term_

matrix. This data representation is also called a unigram model.

Now, write a python function to transform text column in imdb_tr.csv into a term-
document matrices using unigram model then train a Stochastic Gradient Descent
(SGD) classifier whose loss=“hinge” and penalty=“l1” on this data.

On the other hand, in the driver.py, you will also find the link
to "../resource/lib/publicdata/imdb_te.csv" which is our benchmark file for the performance
of the trained classifier. "imdb_te.csv" has two columns: "row_number" and "text". The
column "polarity" is excluded and your job is to use the trained SGD classifier to predict this
information. You should transform imdb_te.csv using unigram data model as well and use
the trained SGD to predict the converted test set. Predictions must be formatted line by line
and stored in "unigram.output.txt" in your Vocareum workspace. An example of the output
file is provided for your benefits.

If you wish to run the test in your local machine, download the following test file.

Bigram Representation
A more sophisticated data representation model is the bigram model where occurrences
depend on a sequence of two words rather than an individual one. Taking the same
example like before, v1 and v2 are now encoded as follow:
Instead of enumerating every individual words, bigram counts the number of instance a
word following after another one. In both d1 and d2 “intelligence” follows “artificial”
so v1(intelligence | artificial) = v2(intelligence | artificial) = 1. In contrast, “artificial”
does not follow “awesome” so v1(artificial | awesome) = v2(artificial | awesome) = 0.
Repeat the same exercise from Unigram for the Bigram Model Data Representation and
produce the test prediction file "bigram.output.txt" .

Tf-idf:
Sometimes, a very high word counting may not be meaningful. For example, a common
word like “say” may appear 10 times more frequent than a less-common word such as
“machine” but it does not mean “say” is 10 times more relevant to our sentiment
classifier. To alleviate this issue, we can instead use term frequency tf[t] = 1 + log(f[t,d]
) where f[t,d] is the count of term t in document d. The log function dampens the
unwanted influence of common English words.

Inverse document frequency (idf) is a similar concept. To take an example, it is likely

that all of our training documents belong to a same category which has specific jargons.
For example, Computer Science documents often have words such as computers, CPU,
programming and etc appearing over and over. While they are not common English
words, because of the document domain, their occurrences are very high. To rectify, we
can adjust usinginverse term frequency idf[t] = log( N / df[t] ) where df[t] is the
number of documents containing the term t and N is the total number of document in
the dataset.

Therefore, instead of just word frequency, tf-idf for each term t can be used, tf-idf[t] =
tf[t] ∗idf[t].

Repeat the same exercise as in the Unigram and Bigram data model but apply tf-idf this
time to produce test prediction
files, "unigramtfidf.output.txt" and "bigramtfidf.output.txt"

II. What you need to submit:

Your task in this assignment is to write driver.py to produce sentiment predictions over
the imdb_te.csv by various text data representation (unigram, unigram with tf-idf,
bigram and bigram with tf-idf). Please ensure your driver.py write the predictions to the
following files during the run time (one-time outputs are not accepted):

• unigram.output.txt

• unigramtfidf.output.txt

• bigram.output.txt

• bigramtfidf.output.txt

Be very precise with these file names because the auto-grader will rerun your driver.py and
look for them for evaluation. As usual, your program will be run as follows:

$python driver.py

If you want to use Python 3 then simply rename driver.py to driver_3.py and your program
will be executed as:

$python3 driver_3.py

It is highly recommended that before submission you should perform some sanity check so
you will not waste your time and opportunity to submit. Below are something you want to
keep in mind:

- The name of your program file correspond with the expected, exactly

- The name of the output file generated by your program

- The libraries that you are using in your program be allowed (only standards lib)

- The way you read the training and testing data is correct (Be aware of headers! Do not get
off-by-one error!)

- You have performed cross validation on your model

Note: Our grade will not call imdb_data_preprocess() ourselves. You will need to do data
processing under if __name__ == "__main__": by yourself in the driver.

Download Full Quantitative Methods for Business 13th Edition by David R. Anderson (eBook PDF) PDF All Chapters
100% (2)
Download Full Quantitative Methods for Business 13th Edition by David R. Anderson (eBook PDF) PDF All Chapters
41 pages
Glove
100% (1)
Glove
10 pages
Practical C++ Backend Programming
From Everand
Practical C++ Backend Programming
Justin Barbara
No ratings yet
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Python for Mechanical and Aerospace Engineering
From Everand
Python for Mechanical and Aerospace Engineering
Alexander Kenan
No ratings yet
KNIME Essentials
From Everand
KNIME Essentials
Gábor Bakos
No ratings yet
Custom Keto Meal Plan Email Sequence
100% (2)
Custom Keto Meal Plan Email Sequence
19 pages
Machine Learning With Python - Unit-5
No ratings yet
Machine Learning With Python - Unit-5
26 pages
Maneesha Nidigonda Verzeo Major Project
No ratings yet
Maneesha Nidigonda Verzeo Major Project
11 pages
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
6 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Maneesha Nidigonda Major Project
No ratings yet
Maneesha Nidigonda Major Project
11 pages
CS771: GROUP-19 Sentiment Analysis in Movie Reviews: Project Report
No ratings yet
CS771: GROUP-19 Sentiment Analysis in Movie Reviews: Project Report
28 pages
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
NLP_DeepNLP
No ratings yet
NLP_DeepNLP
61 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Everyday Data Structures
From Everand
Everyday Data Structures
William Smith
No ratings yet
Feature extraction techniques in NLP
No ratings yet
Feature extraction techniques in NLP
10 pages
Sentiment Analysis Classification For Rotten Tomatoes Phrases On Kaggle
No ratings yet
Sentiment Analysis Classification For Rotten Tomatoes Phrases On Kaggle
4 pages
Combine PDF
No ratings yet
Combine PDF
124 pages
Programming And Coding in Intermidiate Level
From Everand
Programming And Coding in Intermidiate Level
Memo
No ratings yet
Group 4 MovieReview
No ratings yet
Group 4 MovieReview
10 pages
Programming Concepts in C++
From Everand
Programming Concepts in C++
Robert Burns
No ratings yet
Sentiment Analysis On Movie Reviews: Natural Language Processing UML602 Project Report
No ratings yet
Sentiment Analysis On Movie Reviews: Natural Language Processing UML602 Project Report
13 pages
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
No ratings yet
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
4 pages
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
No ratings yet
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
6 pages
TextFeatureEnginerring-NLP lec2
No ratings yet
TextFeatureEnginerring-NLP lec2
60 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Computer Programming The Doctrine
From Everand
Computer Programming The Doctrine
Adesh Silva
No ratings yet
Text Classification_movie Review_news Wires
No ratings yet
Text Classification_movie Review_news Wires
5 pages
HW 5
No ratings yet
HW 5
7 pages
Sentiment Analysis Using Machine Learning Classifiers
No ratings yet
Sentiment Analysis Using Machine Learning Classifiers
41 pages
Instant Heat Maps in R How-to
From Everand
Instant Heat Maps in R How-to
Sebastian Raschka
No ratings yet
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
UNIT-II
No ratings yet
UNIT-II
20 pages
Text Categorization and Classification
No ratings yet
Text Categorization and Classification
13 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
From Everand
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
Eric Elliott
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Unit iv
No ratings yet
Unit iv
58 pages
Learning Advanced Programming
From Everand
Learning Advanced Programming
IT Campus Academy
No ratings yet
Useful Python
From Everand
Useful Python
Stuart Langridge
No ratings yet
Word Embeddings in NLP
No ratings yet
Word Embeddings in NLP
42 pages
Just Enough R: Learn Data Analysis with R in a Day
From Everand
Just Enough R: Learn Data Analysis with R in a Day
Sivakumaran Raman
3.5/5 (2)
NLP tutorial1
No ratings yet
NLP tutorial1
7 pages
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
From Everand
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
Justin Barbara
No ratings yet
CSE4062S21_Group3_Project_Delivery7_FinalReport
No ratings yet
CSE4062S21_Group3_Project_Delivery7_FinalReport
9 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
From Everand
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
Mark Magic
No ratings yet
ARTIN1 Week 10 NLP, Part 2
No ratings yet
ARTIN1 Week 10 NLP, Part 2
8 pages
RES Presentation
No ratings yet
RES Presentation
21 pages
Language Detector: Bachelor of Engineering (Sem-VIII)
No ratings yet
Language Detector: Bachelor of Engineering (Sem-VIII)
10 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
AI Lab Report BIM
No ratings yet
AI Lab Report BIM
34 pages
DOC-20250208-WA0002
No ratings yet
DOC-20250208-WA0002
21 pages
Machine Learning: Hands-On for Developers and Technical Professionals
From Everand
Machine Learning: Hands-On for Developers and Technical Professionals
Jason Bell
No ratings yet
R Object-oriented Programming
From Everand
R Object-oriented Programming
Kelly Black
3/5 (1)
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Sheet1: Keyword CPC $ (Global)
No ratings yet
Sheet1: Keyword CPC $ (Global)
16 pages
Sheet1: Keyword (Global) (Global) (Global)
No ratings yet
Sheet1: Keyword (Global) (Global) (Global)
12 pages
WMB Status Report 1
No ratings yet
WMB Status Report 1
2 pages
Benefit #1: Weight Loss: Page 1 of 2
No ratings yet
Benefit #1: Weight Loss: Page 1 of 2
2 pages
Lessons Learned Report
No ratings yet
Lessons Learned Report
1 page
HERO - Flat Belly Fix FAQs
No ratings yet
HERO - Flat Belly Fix FAQs
1 page
Gis5008-V1-Db-Writ1 - 1 - 2
No ratings yet
Gis5008-V1-Db-Writ1 - 1 - 2
15 pages
Team Contract: Project Name: Rob's Drugstore
No ratings yet
Team Contract: Project Name: Rob's Drugstore
2 pages
Mathlab Functions
No ratings yet
Mathlab Functions
3 pages
Cpa Is Not Dead
No ratings yet
Cpa Is Not Dead
3 pages
Special Research Fund Announcement 2020 Doctoral Grants For Candidates From Developing Countries North-South "Sandwich" - Type Grants
No ratings yet
Special Research Fund Announcement 2020 Doctoral Grants For Candidates From Developing Countries North-South "Sandwich" - Type Grants
17 pages
1PMP Canvas PDF
No ratings yet
1PMP Canvas PDF
1 page
Bug Report Template
No ratings yet
Bug Report Template
1 page
Assignment 3
No ratings yet
Assignment 3
4 pages
Keysearch 1 Table
No ratings yet
Keysearch 1 Table
1 page
Tolerance
No ratings yet
Tolerance
3 pages
Coles and Davison - Statistical Modelling of Extreme Values - June 2008
100% (1)
Coles and Davison - Statistical Modelling of Extreme Values - June 2008
70 pages
Lecture 2
No ratings yet
Lecture 2
35 pages
Python Will Make You Rich in The Stock Market
No ratings yet
Python Will Make You Rich in The Stock Market
8 pages
Why Do We Need Process Control
100% (2)
Why Do We Need Process Control
6 pages
Shape Functions: Finite Element Analysis in Geotechnical Engineering
No ratings yet
Shape Functions: Finite Element Analysis in Geotechnical Engineering
36 pages
Generation of Elementary Discrete Time Sequence With Formula Generation of Elementary Discrete Time Sequence Without Formula
No ratings yet
Generation of Elementary Discrete Time Sequence With Formula Generation of Elementary Discrete Time Sequence Without Formula
11 pages
MTS-PST-311 Course Outline (BGEN3, BMEN3, and BMMP3) - 2024
No ratings yet
MTS-PST-311 Course Outline (BGEN3, BMEN3, and BMMP3) - 2024
8 pages
Econometricians Assignment
No ratings yet
Econometricians Assignment
4 pages
Introduction To PKI, Certificates & Public Key Cryptography: Erwan Lemonnier
No ratings yet
Introduction To PKI, Certificates & Public Key Cryptography: Erwan Lemonnier
12 pages
[2020 Arxiv]A Survey on The Expressive Power of Graph Neural Networks
No ratings yet
[2020 Arxiv]A Survey on The Expressive Power of Graph Neural Networks
42 pages
Lossless Data Compression Using Neural Networks
No ratings yet
Lossless Data Compression Using Neural Networks
5 pages
Probability (Merged)
No ratings yet
Probability (Merged)
39 pages
NLP 2
No ratings yet
NLP 2
8 pages
Hash & Digital Signature
No ratings yet
Hash & Digital Signature
26 pages
Download full Facilities Design Third Edition Heragu ebook all chapters
No ratings yet
Download full Facilities Design Third Edition Heragu ebook all chapters
41 pages
Practical: 1: Aim: Implementation and Time Analysis of Bubble, Selection and Insertion Bubblesort
No ratings yet
Practical: 1: Aim: Implementation and Time Analysis of Bubble, Selection and Insertion Bubblesort
48 pages
Dsp-(Bec-502)_module 2 & 3
No ratings yet
Dsp-(Bec-502)_module 2 & 3
8 pages
Math 10 Problem Solving
No ratings yet
Math 10 Problem Solving
3 pages
Regularization of Neural Networks Using Dropconnect: Hinton Et Al. 2012 Hinton Et Al. 2012
No ratings yet
Regularization of Neural Networks Using Dropconnect: Hinton Et Al. 2012 Hinton Et Al. 2012
9 pages
LCTM and Gru
No ratings yet
LCTM and Gru
62 pages
Integrating Factors
No ratings yet
Integrating Factors
2 pages
Accenture Coding Needs
No ratings yet
Accenture Coding Needs
2 pages
Inbound 75510600044434454
No ratings yet
Inbound 75510600044434454
12 pages
Ai Roadmap
No ratings yet
Ai Roadmap
3 pages
CE 223 Lesson - Newton-Raphson Method
No ratings yet
CE 223 Lesson - Newton-Raphson Method
24 pages
Engineering Mathematics And Artificial Intelligence Foundations Methods And Applications Herb Kunze download
No ratings yet
Engineering Mathematics And Artificial Intelligence Foundations Methods And Applications Herb Kunze download
79 pages
Probability and Statistics - Int 2 - Answer Paper - Part B
No ratings yet
Probability and Statistics - Int 2 - Answer Paper - Part B
26 pages
Waqas Ahmed
No ratings yet
Waqas Ahmed
33 pages

Assignment 4

Uploaded by

Assignment 4

Uploaded by

Assignment 4 : NLP

Stochastic Gradient Descent Classifier

In addition, common English stopwords should be removed. An English stopwords

Unigram Data Representation

• d2: “Artificial Intelligence is awesome”

If you wish to know more, start from here https://en.wikipedia.org/wiki/Document-term_

Inverse document frequency (idf) is a similar concept. To take an example, it is likely

II. What you need to submit:

- The name of the output file generated by your program

- You have performed cross validation on your model

You might also like