Assignment 5 - MLDS Lab

The document discusses performing sentiment analysis on Twitter data using a k-nearest neighbors classifier. It describes preprocessing Twitter data, extracting features from tweets, and using a kNN algorithm for classification. The objective is to analyze Twitter data for sentiment and apply machine learning techniques.

Uploaded by

Amruta More

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views4 pages

Assignment 5 - MLDS Lab

Uploaded by

Amruta More

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

ASSIGNMENT NO.

Title: Text classification for Sentiment analysis using KNN

Objectives:
1. To handle Twitter Data for performing computing.
2. To analyze data using R programming tools.
Theory:

Sentiment analysis refers to the use of natural language processing, text analysis, and
computational linguistics to systematically identify, extract, quantify, and study effective
states and subjective information. Sentiment analysis is widely applied to customer
materials such as reviews and survey responses. The most common type of sentiment
analysis is ‘polarity detection’ and involves classifying customer materials/reviews as
positive, negative or neutral.

Text Processing
With the increasing importance of computational text analysis in research, many
researchers face the challenge of learning how to use advanced software that enables this
text analysis. Text processing has a direct application to Natural Language Processing, also
known as NLP. NLP is aimed at processing the languages spoken or written by humans
when they communicate with one another. This is different from the communication
between a computer and a human where the communication is either a computer program
written by a human or some gesture by a human like clicking the mouse at some position.
NLP tries to understand the natural language spoken by humans and classify it, analyze it
as well if required to respond to it. Python has a rich set of libraries which cater to the needs
of NLP. The Natural Language ToolKit (NLTK) is a suite of such libraries which provides
the functionalities required for NLP..
Twitter Data
Twitter is an online microblogging tool that disseminates more than 400 million
messages per day, including vast amounts of information about almost all industries
from entertainment to sports, health to business etc. One of the best things about
Twitter — indeed, perhaps its greatest appeal - is in its accessibility. It’s easy to use both
for sharing information and for collecting it.Twitter provides unprecedented access to
our lawmakers and to our celebrities, as well as to news as it’s happening. Twitter
represents an important data source for the business models of huge companies as well.
All the above characteristics make twitter a best place to collect real time and latest data
to analyse and do any sought of research for real life situations.

DATASET DESCRIPTION
We are given a Twitter US Airline Sentiment dataset that contains around 14,601 tweets
about each major U.S. airline. The tweets are labelled as positive, negative, or neutral based
on the nature of the respective Twitter user’s feedback regarding the airline. The dataset is
further segregated into training and test sets in a stratified fashion. Train set contains 11,680
tweets whereas the test set contains 2,921 tweets.Our task is to develop and train a k-nearest
neighbors classifier on the training set and use it to predict sentiment classes of the tweets
present in the test set. Here is a sneak-peek into the training dataset that we have got at our
hands:

Pre-Processing
Raw tweets scraped from twitter generally result in a noisy dataset. This is due to the casual
nature of people’s usage of social media. Tweets have certain special characteristics such
as retweets, emoticons, user mentions, etc. which have to be suitably extracted. Therefore,
raw twitter data has to be normalized to create a dataset which can be easily learned by
various classifiers. We have applied an extensive number of pre-processing steps to
standardize the dataset and reduce its size. We first do some general pre-processing on
tweets which is as follows.
• Convert the tweet to lower case.
• Replace 2 or more dots (.) with space.
• Strip spaces and quotes (” and ’) from the ends of tweet.
• Replace 2 or more spaces with a single space.
Special twitter features as follows.
URL:
Users often share hyperlinks to other webpages in their tweets. Any particular URL
is not important for text classification as it would lead to very sparse features. Therefore,
we replace all the URLs in tweets with the word URL. The regular expression used to
match URLs is ((www\.[\S]+)|(https?://[\S]+)).
User Mention
Every twitter user has a handle associated with them. Users often mention other users
in their tweets by @handle. It replaces all user mentions with the word USER_MENTION.
The regular expression used to match user mention is @[\S]+.
K-Nearest Neighbours
K-Nearest Neighbours is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining and intrusion detection.It is widely
disposable in real-life scenarios since it is non-parametric, meaning, it does not make any
underlying assumptions about the distribution of data (as opposed to other algorithms such
as GMM, which assume a Gaussian distribution of the given data).
KNN algorithm is used to classify by finding the K nearest matches in training data and
then using the label of closest matches to predict. Traditionally, distance such as euclidean
is used to find the closest match.KNN algorithm at the training phase just stores the dataset
and when it gets new data, then it classifies that data into a category that is much similar to
the new data.

Feature Extraction
In the feature extraction step, we will need to represent each tweet as a bag-of-words
(BoW), i.e. an unordered set of words with their positions ignored and all of the emphasis
placed on the respective frequencies of each word. For example, consider these two tweets:
T1 = Welcome to machine learning, machine!
T2 = kNN is a powerful machine learning algorithm.
The bag-of-words representation (ignoring case and punctuation) for the above two tweets
are:

In order to create this bag-of-words representation, we would first need to extract out the
unique words from all of our tweets in the training dataset.

Conclusion:
Hence, we studied On Twitter Data performs computing using Business Intelligence analytical
tools electively.

Sentiment Analysis of Twitter Data My
75% (4)
Sentiment Analysis of Twitter Data My
14 pages
Sentiment Analysis Final Documentation Report
50% (2)
Sentiment Analysis Final Documentation Report
21 pages
Assignment No 4 - KNN Twitter
No ratings yet
Assignment No 4 - KNN Twitter
3 pages
Twitter Spam Detection
No ratings yet
Twitter Spam Detection
34 pages
20011F0008 Samba PRC3
No ratings yet
20011F0008 Samba PRC3
21 pages
Ijsrp p8252
No ratings yet
Ijsrp p8252
6 pages
Chapter II - Lecture 2 - KNN
No ratings yet
Chapter II - Lecture 2 - KNN
21 pages
Cmu CS QTR 127
No ratings yet
Cmu CS QTR 127
38 pages
ZAI MSC 2015 20 Luo
No ratings yet
ZAI MSC 2015 20 Luo
73 pages
COMP90049 2021S1 A3-Spec
No ratings yet
COMP90049 2021S1 A3-Spec
7 pages
Spam Identification On Facebook, Twitter and Email Using Machine Learning
No ratings yet
Spam Identification On Facebook, Twitter and Email Using Machine Learning
9 pages
Ml Projrct Article 2
No ratings yet
Ml Projrct Article 2
6 pages
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
No ratings yet
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
9 pages
Clustering Thesis
No ratings yet
Clustering Thesis
55 pages
Machine Learning KNN - Supervised
No ratings yet
Machine Learning KNN - Supervised
9 pages
ssrn-4314299
No ratings yet
ssrn-4314299
4 pages
Sentiment Analysis and Predictions of COVID 19 Tweets Using Natural Language Processing
No ratings yet
Sentiment Analysis and Predictions of COVID 19 Tweets Using Natural Language Processing
6 pages
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
No ratings yet
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
14 pages
Internship-Review Hiranmai 045
No ratings yet
Internship-Review Hiranmai 045
20 pages
Twitter Sentiment Analysis System
No ratings yet
Twitter Sentiment Analysis System
5 pages
INDEXReport Ayush (1)
No ratings yet
INDEXReport Ayush (1)
38 pages
CHAPTER TWO
No ratings yet
CHAPTER TWO
3 pages
Lab Report - CSE 816
No ratings yet
Lab Report - CSE 816
17 pages
Modified MLKNN Algorithm
No ratings yet
Modified MLKNN Algorithm
11 pages
K-Nearest Neighbor(KNN) 6
No ratings yet
K-Nearest Neighbor(KNN) 6
46 pages
Sentiment Analysis Task on Twitter Data
No ratings yet
Sentiment Analysis Task on Twitter Data
6 pages
Sentiment Analysis Presentationnotes
No ratings yet
Sentiment Analysis Presentationnotes
4 pages
Sentiment Analysis of Twitter Data: Sahar A. El - Rahman Feddah Alhumaidi Alotaibi Wejdan Abdullah Alshehri
No ratings yet
Sentiment Analysis of Twitter Data: Sahar A. El - Rahman Feddah Alhumaidi Alotaibi Wejdan Abdullah Alshehri
4 pages
assign 5 tt
No ratings yet
assign 5 tt
13 pages
Restricting Unsolicited Approaches and Counterfeit Users: Batch No: 28 Guided by Done by
No ratings yet
Restricting Unsolicited Approaches and Counterfeit Users: Batch No: 28 Guided by Done by
28 pages
Airline Tweets Classification Using Naive Bayes Classifier
No ratings yet
Airline Tweets Classification Using Naive Bayes Classifier
2 pages
(IJCST-V9I4P5) :G. Bala Krishna Priya, Dr. Jabeen Sultana, Prof. M. Usha Rani
No ratings yet
(IJCST-V9I4P5) :G. Bala Krishna Priya, Dr. Jabeen Sultana, Prof. M. Usha Rani
5 pages
PPPT
No ratings yet
PPPT
20 pages
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
No ratings yet
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
15 pages
ML U4
No ratings yet
ML U4
48 pages
Optimization_of_Classification_Algorithm_with_Grid
No ratings yet
Optimization_of_Classification_Algorithm_with_Grid
7 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
3 pages
A Comparative Study On Fake Profile Identification Using Different Machine Learning Techniques
No ratings yet
A Comparative Study On Fake Profile Identification Using Different Machine Learning Techniques
11 pages
Twiiter Sentiment Analysis
No ratings yet
Twiiter Sentiment Analysis
15 pages
Minor 1
No ratings yet
Minor 1
20 pages
Template For The First Slide of PPT Presentation1
No ratings yet
Template For The First Slide of PPT Presentation1
18 pages
artificial-intelligence-notes-unit-1
No ratings yet
artificial-intelligence-notes-unit-1
169 pages
Abstract
No ratings yet
Abstract
2 pages
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
No ratings yet
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
15 pages
Natural language processing-Section (5)
No ratings yet
Natural language processing-Section (5)
38 pages
Spam Detection Paper PDF
No ratings yet
Spam Detection Paper PDF
6 pages
Introduction
No ratings yet
Introduction
27 pages
DBMS - Unit 1 (Chapter 1) - PPT
No ratings yet
DBMS - Unit 1 (Chapter 1) - PPT
34 pages
Prediction of Movie Success Using Sentiment Analysis of Tweets
No ratings yet
Prediction of Movie Success Using Sentiment Analysis of Tweets
6 pages
CSL0777 L22
No ratings yet
CSL0777 L22
35 pages
Project Report
No ratings yet
Project Report
10 pages
Unit 3 - Supervise Learning Classification
No ratings yet
Unit 3 - Supervise Learning Classification
23 pages
Lecture_07_slides
No ratings yet
Lecture_07_slides
45 pages
fin_ijprems1714118825
No ratings yet
fin_ijprems1714118825
6 pages
Sentiment Analysis of Tweets Using Machine Learning
No ratings yet
Sentiment Analysis of Tweets Using Machine Learning
22 pages
Module Iii
No ratings yet
Module Iii
15 pages
Sentiment Classification System of Twitter Data For US Airline Service Analysis
No ratings yet
Sentiment Classification System of Twitter Data For US Airline Service Analysis
5 pages
KNN Updated
No ratings yet
KNN Updated
30 pages
Unit 3
No ratings yet
Unit 3
8 pages
Text Mining Project Report
No ratings yet
Text Mining Project Report
27 pages
STAT 479: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 479: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
23 pages
Assignment 2 MLDS Lab
No ratings yet
Assignment 2 MLDS Lab
3 pages

Assignment 5 - MLDS Lab

Uploaded by

Assignment 5 - MLDS Lab

Uploaded by

ASSIGNMENT NO.

Title: Text classification for Sentiment analysis using KNN

You might also like