0% found this document useful (0 votes)
17 views13 pages

Assign 5 TT

Uploaded by

Fatima Zameer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views13 pages

Assign 5 TT

Uploaded by

Fatima Zameer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

TOOLS AND TECHNIQUES

IN DATA SCIENCE
ASSIGNMENT: 05

SUBMITTED BY

FATIMA ZAMEER KHAN


01-249242-006
MS DS(1A)

SUBMITTED TO

DR FATIMA KHALIQUE

NOVEMBER 24, 2024


Text Classification
In this problem, you will be analyzing the Twitter data extracted from 2016. Tweets xtracted
were posted by the following six Twitter accounts: realDonaldTrump, mike_pence, GOP,
HillaryClinton, timkaine, TheDemocrats.
For every tweet, two pieces of information are collected:

• screen_name: the Twitter handle of the user tweeting and


• text: the content of the tweet.

Tweets are divided into two parts - the train and test sets. The training set contains both
the screen_name and text of each tweet; the test set only contains the text.
The goal of the problem is to infer the political inclination (whether Republican or Democratic)
of the author from the tweet text. The ground truth (i.e., true class labels) are determined from
the screen_name of the tweet as follows:

• R: realDonaldTrump, mike_pence, GOP


• D: HillaryClinton, timkaine, TheDemocrats

We can treat this as a binary classification problem. We'll follow this common structure to
tackling this problem:

1. preprocessing: clean up the raw tweet text using regular expressions, and produce class
labels
2. features: construct bag-of-words feature vectors
3. classification: learn a binary classification model using scikit-learn.

A. Text Processing

Q1 Preprocessing

Your first task is to fill in the following function which processes and tokenizes raw
text. You will need to preprocess the tokens by applying the operators in the following
order.
1. Convert the text to lower case.
2. Remove any URLs, which in this case will all be of the form http://t.co/<alphanumeric
characters>.
3. Remove all trailing 's characters, followed by other apostrophes:
o remove trailing 's: Children's becomes children
o omit other apostrophes: don't becomes dont
4. Remove all non-alphanumeric (i.e., A-Z, a-z, 0-9) characters (replacing them with a
single space)
5. Split the remaining text by whitespace into an array of individual words
6. Discard empty strings (i.e., if the string after processing above is equal to ""), return an
empty array [] rather than ['']

CODE:

OUTPUT:
Q2 Loading Data
Using this preprocess function, load the data from the relevant csv files and return a list
of the parsed tweets, plus a flag indicating whether or not the tweet is from a republican
(i.e., one of the three usernames mentioned above); for the test data, where no screen
name is given, provide None as the flag). Note that this function should take less than
a second if you've implemented the above preprocessing function efficiently.
B. Feature Construction
The next step is to derive feature vectors from the tokenized tweets. In this section, you
will be constructing a bag-of-words TF-IDF feature vector.

Q3 Word distributions
The number of possible words is prohibitively large, and not all words are useful for
our task. We will begin by filtering the vectors using a common heuristic: We calculate
a frequency distribution of words in the corpus and remove words at the head (most
frequent) and tail (least frequent) of the distribution. Most frequently used words (often
called stopwords) provide very little information about the similarity of two pieces of
text. Words with extremely low frequency tend to be typos.
We will now implement a function that counts the number of times that each token is
used in the training corpus. You should return a collections.Counter object with the
number of times that each word appears in the dataset.
Q4 Vectorizing
Now we have each tweet as a list of words, excluding words with high and low
frequencies. We want to convert these into a sparse feature matrix, where each row
corresponds to a tweet and each column to a possible word. We can use scikit-
learn's TfidfVectorizer to do this quite easily.
Instructions:
• By default, the TfidfVectorizer does its own tokenization, but we've already done it
above, so you need to pass preprocessor = lambda x : x, tokenization = lambda x : x,
token_pattern=None as arguments to the class constructor.
• The vectorizer can filter words that are too uncommon or too common: to do this, set
the min_df=5 argument (words must be contained in more than 5 tweets),
and max_df=0.4 argument (filter out words contained in more than 40% of tweets)
• You should use only the training data to fit or fit_transform the vectorizer.
C. Classification
• We are now ready to put it all together and train the classification model.
• You will be using the Support Vector Machine sklearn.svm.LinearSVC. This class
implements a linear SVM as we described in class, though of course, the details vary a
little bit with this particular implementation.

Q5 Training a classifier
Let's begin by training a classifier. You should specifically train a LinearSVC with a
given set of features and labels, plus the regularization parameter specified by C. You
can additionally include as arguments to the LinearSVC class the loss =
"hinge" argument (so that this is a typical SVM), and the random_state=0 argument (to
avoid any randomness in the training). Additionally, you should use
the max_iter=10000 argument to make sure that you run for enough iterations to
avoid any failure to converge given the regularization parameters we use.
Q6 Cross validation
After building the function to train this classifier, let's now use a validation set to pick
the optimal value of C, out of the choices of (0.01, 0.1, 1.0, 10.0). The basic approach
here will be to split the training set into the first 10000 samples for the training set, and
the remainder for the validation set, allowing you to choose the best parameter to use
on the training set. To evaluate the quality of the classifier, you will use the F1 score, a
common metric for text classification, which you can compute using
the sklearn.metrics.f1_score function.
Specifically, you should implement the function below, which will compute the training
and validation F1 score for different classifiers trained with different values of C.
Q7 Classifying new Tweets
Finally, let's put this all together. Using the best C value you found in the previous part
(i.e., build the classifiers and test which C value out of (0.01, 0.1, 1.0, 10., 100.) gives
the highest F1 score on the validation set (you can hardcode this value into the function
below), train a classifier on the entire training set, and make predictions for the test set.
You won't be able to evaluate how accurate these predictions are, of course, but you
can use this classifier to classify tweets as being from Republican or Democratic
sources (or perhaps more precisely, from being from one of the three aforementioned
Republicans or three Democrats during the 2016 election).

You might also like