Assignment 4
Assignment 4
INSTRUCTIONS
Congratulation on making it to the last programming project. By coming this far, we
assume that you have accumulated formidable knowledge in both traditional Artificial
Intelligence (AI) and modern Machine Learning (ML), and from now on we will treat
you as such. This assignment intends to give you a flavor of a real world AI/ML
application, which often require to gather the raw data, do preprocessing, design
suitable ML algorithms and implement the solution. Today, we touch on an active
research area in Natural Language Processing (NLP), sentiment analysis.
Given the exponentially growing of online review data (Amazon, IMDB and etc),
sentiment analysis becomes increasingly important. We are going to build a
sentiment classifier, i.e., evaluating a piece of text being either positive or negative.
The "Large Movie Review Dataset"(*) shall be used for this project. The dataset is
compiled from a collection of 50,000 reviews from IMDB on the condition there are no
more than 30 reviews each movie. Number of positive and negative reviews are equal.
Negative reviews have scores lesser or equal 4 out of 10 while a positive review greater
or equal 7 out of 10. Neutral reviews are not included on the other hand. Then, 50,000
reviews are divided evenly into the training and test set.
*Dataset is credited to Prof. Andrew Mass in the paper, Andrew L. Maas, Raymond E. Daly,
Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning
Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for
Computational Linguistics (ACL 2011).
I. Instruction
Up until now, most of the course projects have been requiring you to implement algorithms
discussed in lectures. This assignment is going to introduce a few advanced concepts of
which implementations demand a non-trivial programming expertise. As such, before
reinventing the wheel, we would advise you to first explore the incredibly powerful existing
Python libraries. The following two are highly recommended:
• http://scikit-learn.org/stable/
• http://pandas.pydata.org/
However, it turns out when the data is large, rather than the entire dataset, SGD algorithm
performs just as good with a small random subset of the original data. This is the central idea
of Stochastic SGD and particarly handy for the text data since corpus are often humongous.
You should read sklearn document and learn how to use a SGD classifier. For adventurers,
you are welcome to manually implement SGD yourself. Wikipedia provides a good first
reference, https://en.wikipedia.org/wiki/Stochastic_gradient_descent.
Data Preprocessing
The training data is provided in the directory
"../resource/lib/publicdata/aclImdb/train/" of Vocareum. If you wish to download
the data to your local machine for inspections, use the following
link: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz.
Your first task is explore this directory. There are two sub-directories pos/ for positive
texts and neg/ for negative ones. You do not need to worry about unsup/, and you do
not ned them.
Now combine the raw database into a single csv files, “imdb_tr.csv”. The csv file
should have three columns, "row_number" and “text” and “polarity”. The
column “text” contains review texts from the aclImdb database and the
column “polarity” consists of sentiment labels, 1 for positive and 0 for negative. An
example of "imdb.tr.csv" is provided in the workspace.
The vocabulary V = {artificial, awesome, Columbia, course, I, intelligence, is, love} and
two documents can be encoded as v1 and v2 as follow:
Hint: When building our model you should assume no access to the test data. Then
what if there are words that appear only in test data but not in training data? The
features will mismatch if you include those. Therefore, when extracting features in the
test set, you should only use the vocabulary that was used in the training set.
Now, write a python function to transform text column in imdb_tr.csv into a term-
document matrices using uni- gram model then train a Stochastic Gradient Descent
(SGD) classifier whose loss=“hinge” and penalty=“l1” on this data.
On the other hand, in the driver.py, you will also find the link
to "../resource/lib/publicdata/imdb_te.csv" which is our benchmark file for the performance
of the trained classifier. "imdb_te.csv" has two columns: "row_number" and "text". The
column "polarity" is excluded and your job is to use the trained SGD classifier to predict this
information. You should transform imdb_te.csv using unigram data model as well and use
the trained SGD to predict the converted test set. Predictions must be formatted line by line
and stored in "unigram.output.txt" in your Vocareum workspace. An example of the output
file is provided for your benefits.
If you wish to run the test in your local machine, download the following test file.
Bigram Representation
A more sophisticated data representation model is the bigram model where occurrences
depend on a sequence of two words rather than an individual one. Taking the same
example like before, v1 and v2 are now encoded as follow:
Instead of enumerating every individual words, bigram counts the number of instance a
word following after another one. In both d1 and d2 “intelligence” follows “artificial”
so v1(intelligence | artificial) = v2(intelligence | artificial) = 1. In contrast, “artificial”
does not follow “awesome” so v1(artificial | awesome) = v2(artificial | awesome) = 0.
Repeat the same exercise from Unigram for the Bigram Model Data Representation and
produce the test prediction file "bigram.output.txt" .
Tf-idf:
Sometimes, a very high word counting may not be meaningful. For example, a common
word like “say” may appear 10 times more frequent than a less-common word such as
“machine” but it does not mean “say” is 10 times more relevant to our sentiment
classifier. To alleviate this issue, we can instead use term frequency tf[t] = 1 + log(f[t,d]
) where f[t,d] is the count of term t in document d. The log function dampens the
unwanted influence of common English words.
Therefore, instead of just word frequency, tf-idf for each term t can be used, tf-idf[t] =
tf[t] ∗idf[t].
Repeat the same exercise as in the Unigram and Bigram data model but apply tf-idf this
time to produce test prediction
files, "unigramtfidf.output.txt" and "bigramtfidf.output.txt"
• unigram.output.txt
• unigramtfidf.output.txt
• bigram.output.txt
• bigramtfidf.output.txt
Be very precise with these file names because the auto-grader will rerun your driver.py and
look for them for evaluation. As usual, your program will be run as follows:
$python driver.py
If you want to use Python 3 then simply rename driver.py to driver_3.py and your program
will be executed as:
$python3 driver_3.py
It is highly recommended that before submission you should perform some sanity check so
you will not waste your time and opportunity to submit. Below are something you want to
keep in mind:
- The name of your program file correspond with the expected, exactly
- The libraries that you are using in your program be allowed (only standards lib)
- The way you read the training and testing data is correct (Be aware of headers! Do not get
off-by-one error!)
Note: Our grade will not call imdb_data_preprocess() ourselves. You will need to do data
processing under if __name__ == "__main__": by yourself in the driver.