Assign 5 TT

Uploaded by

Fatima Zameer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views13 pages

Assign 5 TT

Uploaded by

Fatima Zameer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

TOOLS AND TECHNIQUES

IN DATA SCIENCE
ASSIGNMENT: 05

SUBMITTED BY

FATIMA ZAMEER KHAN

01-249242-006
MS DS(1A)

SUBMITTED TO

DR FATIMA KHALIQUE

NOVEMBER 24, 2024

Text Classification
In this problem, you will be analyzing the Twitter data extracted from 2016. Tweets xtracted
were posted by the following six Twitter accounts: realDonaldTrump, mike_pence, GOP,
HillaryClinton, timkaine, TheDemocrats.
For every tweet, two pieces of information are collected:

• screen_name: the Twitter handle of the user tweeting and

• text: the content of the tweet.

Tweets are divided into two parts - the train and test sets. The training set contains both
the screen_name and text of each tweet; the test set only contains the text.
The goal of the problem is to infer the political inclination (whether Republican or Democratic)
of the author from the tweet text. The ground truth (i.e., true class labels) are determined from
the screen_name of the tweet as follows:

• R: realDonaldTrump, mike_pence, GOP

• D: HillaryClinton, timkaine, TheDemocrats

We can treat this as a binary classification problem. We'll follow this common structure to
tackling this problem:

1. preprocessing: clean up the raw tweet text using regular expressions, and produce class
labels
2. features: construct bag-of-words feature vectors
3. classification: learn a binary classification model using scikit-learn.

A. Text Processing

Q1 Preprocessing

Your first task is to fill in the following function which processes and tokenizes raw
text. You will need to preprocess the tokens by applying the operators in the following
order.
1. Convert the text to lower case.
2. Remove any URLs, which in this case will all be of the form http://t.co/<alphanumeric
characters>.
3. Remove all trailing 's characters, followed by other apostrophes:
o remove trailing 's: Children's becomes children
o omit other apostrophes: don't becomes dont
4. Remove all non-alphanumeric (i.e., A-Z, a-z, 0-9) characters (replacing them with a
single space)
5. Split the remaining text by whitespace into an array of individual words
6. Discard empty strings (i.e., if the string after processing above is equal to ""), return an
empty array [] rather than ['']

CODE:

OUTPUT:
Q2 Loading Data
Using this preprocess function, load the data from the relevant csv files and return a list
of the parsed tweets, plus a flag indicating whether or not the tweet is from a republican
(i.e., one of the three usernames mentioned above); for the test data, where no screen
name is given, provide None as the flag). Note that this function should take less than
a second if you've implemented the above preprocessing function efficiently.
B. Feature Construction
The next step is to derive feature vectors from the tokenized tweets. In this section, you
will be constructing a bag-of-words TF-IDF feature vector.

Q3 Word distributions
The number of possible words is prohibitively large, and not all words are useful for
our task. We will begin by filtering the vectors using a common heuristic: We calculate
a frequency distribution of words in the corpus and remove words at the head (most
frequent) and tail (least frequent) of the distribution. Most frequently used words (often
called stopwords) provide very little information about the similarity of two pieces of
text. Words with extremely low frequency tend to be typos.
We will now implement a function that counts the number of times that each token is
used in the training corpus. You should return a collections.Counter object with the
number of times that each word appears in the dataset.
Q4 Vectorizing
Now we have each tweet as a list of words, excluding words with high and low
frequencies. We want to convert these into a sparse feature matrix, where each row
corresponds to a tweet and each column to a possible word. We can use scikit-
learn's TfidfVectorizer to do this quite easily.
Instructions:
• By default, the TfidfVectorizer does its own tokenization, but we've already done it
above, so you need to pass preprocessor = lambda x : x, tokenization = lambda x : x,
token_pattern=None as arguments to the class constructor.
• The vectorizer can filter words that are too uncommon or too common: to do this, set
the min_df=5 argument (words must be contained in more than 5 tweets),
and max_df=0.4 argument (filter out words contained in more than 40% of tweets)
• You should use only the training data to fit or fit_transform the vectorizer.
C. Classification
• We are now ready to put it all together and train the classification model.
• You will be using the Support Vector Machine sklearn.svm.LinearSVC. This class
implements a linear SVM as we described in class, though of course, the details vary a
little bit with this particular implementation.

Q5 Training a classifier
Let's begin by training a classifier. You should specifically train a LinearSVC with a
given set of features and labels, plus the regularization parameter specified by C. You
can additionally include as arguments to the LinearSVC class the loss =
"hinge" argument (so that this is a typical SVM), and the random_state=0 argument (to
avoid any randomness in the training). Additionally, you should use
the max_iter=10000 argument to make sure that you run for enough iterations to
avoid any failure to converge given the regularization parameters we use.
Q6 Cross validation
After building the function to train this classifier, let's now use a validation set to pick
the optimal value of C, out of the choices of (0.01, 0.1, 1.0, 10.0). The basic approach
here will be to split the training set into the first 10000 samples for the training set, and
the remainder for the validation set, allowing you to choose the best parameter to use
on the training set. To evaluate the quality of the classifier, you will use the F1 score, a
common metric for text classification, which you can compute using
the sklearn.metrics.f1_score function.
Specifically, you should implement the function below, which will compute the training
and validation F1 score for different classifiers trained with different values of C.
Q7 Classifying new Tweets
Finally, let's put this all together. Using the best C value you found in the previous part
(i.e., build the classifiers and test which C value out of (0.01, 0.1, 1.0, 10., 100.) gives
the highest F1 score on the validation set (you can hardcode this value into the function
below), train a classifier on the entire training set, and make predictions for the test set.
You won't be able to evaluate how accurate these predictions are, of course, but you
can use this classifier to classify tweets as being from Republican or Democratic
sources (or perhaps more precisely, from being from one of the three aforementioned
Republicans or three Democrats during the 2016 election).

Fake News Detection
100% (1)
Fake News Detection
25 pages
Log
No ratings yet
Log
4,953 pages
Big Data Fundamentals
100% (2)
Big Data Fundamentals
235 pages
ACDSee 14 User Guide
No ratings yet
ACDSee 14 User Guide
260 pages
Universal Paper Wallet Generator For Bitcoin and Other Cryptocurrencies
0% (1)
Universal Paper Wallet Generator For Bitcoin and Other Cryptocurrencies
2 pages
Information Security Awareness - Refresher Course
100% (2)
Information Security Awareness - Refresher Course
83 pages
666 Computer Technology 3rd Sem
No ratings yet
666 Computer Technology 3rd Sem
24 pages
BSCS 221315 6th A Project
No ratings yet
BSCS 221315 6th A Project
9 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
Module I
No ratings yet
Module I
85 pages
Data Science Project
No ratings yet
Data Science Project
34 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
14 pages
Cache Coherence
No ratings yet
Cache Coherence
63 pages
Assignment 1 - Vũ Tiến Thành - 1670
No ratings yet
Assignment 1 - Vũ Tiến Thành - 1670
80 pages
INDEXReport Ayush
No ratings yet
INDEXReport Ayush
38 pages
Offensive Tweet Project Report
No ratings yet
Offensive Tweet Project Report
3 pages
Machine Learning Report
No ratings yet
Machine Learning Report
15 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
16 pages
Machine Learning Fake News Blocking
No ratings yet
Machine Learning Fake News Blocking
14 pages
Affidavit For A Criminal Complaint & Arrest Warrant For James Gordon Meek
No ratings yet
Affidavit For A Criminal Complaint & Arrest Warrant For James Gordon Meek
15 pages
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
No ratings yet
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
15 pages
OWASP LLM - GenAI Security Solutions Reference Guide v1.1.25
No ratings yet
OWASP LLM - GenAI Security Solutions Reference Guide v1.1.25
58 pages
Mlds5 Code
No ratings yet
Mlds5 Code
7 pages
Solution Manual For Managing Information Technology, 7/E 7th Edition - Read Online or Download Now
100% (18)
Solution Manual For Managing Information Technology, 7/E 7th Edition - Read Online or Download Now
29 pages
Hate Speech Detection Documentation With Code
No ratings yet
Hate Speech Detection Documentation With Code
4 pages
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
No ratings yet
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
12 pages
Project Report
No ratings yet
Project Report
12 pages
167-000113C GB Manual VAS 6161 Volkswagen PDF
No ratings yet
167-000113C GB Manual VAS 6161 Volkswagen PDF
27 pages
A Deep-Word and Character Based Approach To Offensive Language Identification
No ratings yet
A Deep-Word and Character Based Approach To Offensive Language Identification
5 pages
CHAPTER 4 and 5 New Hate Speech
No ratings yet
CHAPTER 4 and 5 New Hate Speech
21 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
Detecting of Fake News With Python and ML
57% (7)
Detecting of Fake News With Python and ML
17 pages
UsbFix Report
No ratings yet
UsbFix Report
16 pages
AST - 1 - Essentials of Ethical Hacking
No ratings yet
AST - 1 - Essentials of Ethical Hacking
38 pages
Unit 5 App Implementation in Cloud
No ratings yet
Unit 5 App Implementation in Cloud
10 pages
Hate Speech Detection
No ratings yet
Hate Speech Detection
6 pages
Fake News Classification - Ipynb - Colaboratory
No ratings yet
Fake News Classification - Ipynb - Colaboratory
6 pages
Assignment 5 - MLDS Lab
No ratings yet
Assignment 5 - MLDS Lab
4 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
38 pages
Chapter 26 Text Mining - Introduction To Data Science
No ratings yet
Chapter 26 Text Mining - Introduction To Data Science
20 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
13 pages
Methodology
No ratings yet
Methodology
9 pages
P2P Operation Manual V1.0.0
No ratings yet
P2P Operation Manual V1.0.0
13 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
COMP90049 2021S1 A3-Spec
No ratings yet
COMP90049 2021S1 A3-Spec
7 pages
Cyber Security: PROJECT: Fake News Detection
No ratings yet
Cyber Security: PROJECT: Fake News Detection
8 pages
FPGA Implementation of Convolutional Encoder and Hard Decision Viterbi Decoder
No ratings yet
FPGA Implementation of Convolutional Encoder and Hard Decision Viterbi Decoder
5 pages
JavaScript Looping Statements
No ratings yet
JavaScript Looping Statements
13 pages
CH 04
No ratings yet
CH 04
47 pages
Pavan
No ratings yet
Pavan
23 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Tacora D.23.0821.5 EN Web
No ratings yet
Tacora D.23.0821.5 EN Web
8 pages
IB Re-Architected - Technical Brief - 2018
No ratings yet
IB Re-Architected - Technical Brief - 2018
5 pages
NLP - (1) (1) .Ipynb - Colab
No ratings yet
NLP - (1) (1) .Ipynb - Colab
10 pages
Ansible - Automation Sibelius
No ratings yet
Ansible - Automation Sibelius
4 pages
Database Login Form Task Rutuja Shejul
No ratings yet
Database Login Form Task Rutuja Shejul
7 pages
SL. NO Qualification Institution Board/ University Year of Passing Percenta GE
No ratings yet
SL. NO Qualification Institution Board/ University Year of Passing Percenta GE
2 pages
ME170 Syllabus PDF
No ratings yet
ME170 Syllabus PDF
2 pages
Guardium v11.3 STAP Windows v11.3.0.159 Release Notes
No ratings yet
Guardium v11.3 STAP Windows v11.3.0.159 Release Notes
5 pages
Data Cloud Consultant Demo
No ratings yet
Data Cloud Consultant Demo
3 pages
Cs 229, Spring 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Spring 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
8 pages
8 Fold
No ratings yet
8 Fold
7 pages
DL Lab
No ratings yet
DL Lab
14 pages
The Main Objective Is To Detect The Fake News, Which Is A Classic Text Classification
No ratings yet
The Main Objective Is To Detect The Fake News, Which Is A Classic Text Classification
57 pages
MLP Week 6 NaiveBayesImplementation - Ipynb - Colaboratory
No ratings yet
MLP Week 6 NaiveBayesImplementation - Ipynb - Colaboratory
5 pages
ML Report Fake News Detection
No ratings yet
ML Report Fake News Detection
15 pages
Report Rohun Sjmoon
No ratings yet
Report Rohun Sjmoon
6 pages
Disaster Response Classification Using NLP
No ratings yet
Disaster Response Classification Using NLP
24 pages
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
No ratings yet
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
9 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Challenge 2024
No ratings yet
Challenge 2024
5 pages
Command Conquer Yuris Revenge Shortcuts
No ratings yet
Command Conquer Yuris Revenge Shortcuts
1 page
How To Read & Write Cummins CM2350 Data Via IDUTE
No ratings yet
How To Read & Write Cummins CM2350 Data Via IDUTE
2 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
No ratings yet
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
15 pages
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
No ratings yet
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
13 pages
Unstructured
No ratings yet
Unstructured
37 pages
Fake News Detection Project
No ratings yet
Fake News Detection Project
7 pages
C++ Programming Language
From Everand
C++ Programming Language
Younish Pathan
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
"C Programming for Beginners: A Step-by-Step Guide"
From Everand
"C Programming for Beginners: A Step-by-Step Guide"
Lov kush
No ratings yet
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python Interview Questions You'll Most Likely Be Asked
From Everand
Python Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
2/5 (1)
C# Interview Questions You'll Most Likely Be Asked
From Everand
C# Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
From Everand
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
Georgio Daccache
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Assign 5 TT

Uploaded by

Assign 5 TT

Uploaded by

TOOLS AND TECHNIQUES

FATIMA ZAMEER KHAN

NOVEMBER 24, 2024

• screen_name: the Twitter handle of the user tweeting and

• R: realDonaldTrump, mike_pence, GOP

You might also like