0% found this document useful (0 votes)
21 views4 pages

Assignment 5 - MLDS Lab

The document discusses performing sentiment analysis on Twitter data using a k-nearest neighbors classifier. It describes preprocessing Twitter data, extracting features from tweets, and using a kNN algorithm for classification. The objective is to analyze Twitter data for sentiment and apply machine learning techniques.

Uploaded by

Amruta More
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views4 pages

Assignment 5 - MLDS Lab

The document discusses performing sentiment analysis on Twitter data using a k-nearest neighbors classifier. It describes preprocessing Twitter data, extracting features from tweets, and using a kNN algorithm for classification. The objective is to analyze Twitter data for sentiment and apply machine learning techniques.

Uploaded by

Amruta More
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

ASSIGNMENT NO.

Title: Text classification for Sentiment analysis using KNN


Objectives:
1. To handle Twitter Data for performing computing.
2. To analyze data using R programming tools.
Theory:

Sentiment analysis refers to the use of natural language processing, text analysis, and
computational linguistics to systematically identify, extract, quantify, and study effective
states and subjective information. Sentiment analysis is widely applied to customer
materials such as reviews and survey responses. The most common type of sentiment
analysis is ‘polarity detection’ and involves classifying customer materials/reviews as
positive, negative or neutral.

Text Processing
With the increasing importance of computational text analysis in research, many
researchers face the challenge of learning how to use advanced software that enables this
text analysis. Text processing has a direct application to Natural Language Processing, also
known as NLP. NLP is aimed at processing the languages spoken or written by humans
when they communicate with one another. This is different from the communication
between a computer and a human where the communication is either a computer program
written by a human or some gesture by a human like clicking the mouse at some position.
NLP tries to understand the natural language spoken by humans and classify it, analyze it
as well if required to respond to it. Python has a rich set of libraries which cater to the needs
of NLP. The Natural Language ToolKit (NLTK) is a suite of such libraries which provides
the functionalities required for NLP..
Twitter Data
Twitter is an online microblogging tool that disseminates more than 400 million
messages per day, including vast amounts of information about almost all industries
from entertainment to sports, health to business etc. One of the best things about
Twitter — indeed, perhaps its greatest appeal - is in its accessibility. It’s easy to use both
for sharing information and for collecting it.Twitter provides unprecedented access to
our lawmakers and to our celebrities, as well as to news as it’s happening. Twitter
represents an important data source for the business models of huge companies as well.
All the above characteristics make twitter a best place to collect real time and latest data
to analyse and do any sought of research for real life situations.

DATASET DESCRIPTION
We are given a Twitter US Airline Sentiment dataset that contains around 14,601 tweets
about each major U.S. airline. The tweets are labelled as positive, negative, or neutral based
on the nature of the respective Twitter user’s feedback regarding the airline. The dataset is
further segregated into training and test sets in a stratified fashion. Train set contains 11,680
tweets whereas the test set contains 2,921 tweets.Our task is to develop and train a k-nearest
neighbors classifier on the training set and use it to predict sentiment classes of the tweets
present in the test set. Here is a sneak-peek into the training dataset that we have got at our
hands:

Pre-Processing
Raw tweets scraped from twitter generally result in a noisy dataset. This is due to the casual
nature of people’s usage of social media. Tweets have certain special characteristics such
as retweets, emoticons, user mentions, etc. which have to be suitably extracted. Therefore,
raw twitter data has to be normalized to create a dataset which can be easily learned by
various classifiers. We have applied an extensive number of pre-processing steps to
standardize the dataset and reduce its size. We first do some general pre-processing on
tweets which is as follows.
• Convert the tweet to lower case.
• Replace 2 or more dots (.) with space.
• Strip spaces and quotes (” and ’) from the ends of tweet.
• Replace 2 or more spaces with a single space.
Special twitter features as follows.
URL:
Users often share hyperlinks to other webpages in their tweets. Any particular URL
is not important for text classification as it would lead to very sparse features. Therefore,
we replace all the URLs in tweets with the word URL. The regular expression used to
match URLs is ((www\.[\S]+)|(https?://[\S]+)).
User Mention
Every twitter user has a handle associated with them. Users often mention other users
in their tweets by @handle. It replaces all user mentions with the word USER_MENTION.
The regular expression used to match user mention is @[\S]+.
K-Nearest Neighbours
K-Nearest Neighbours is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining and intrusion detection.It is widely
disposable in real-life scenarios since it is non-parametric, meaning, it does not make any
underlying assumptions about the distribution of data (as opposed to other algorithms such
as GMM, which assume a Gaussian distribution of the given data).
KNN algorithm is used to classify by finding the K nearest matches in training data and
then using the label of closest matches to predict. Traditionally, distance such as euclidean
is used to find the closest match.KNN algorithm at the training phase just stores the dataset
and when it gets new data, then it classifies that data into a category that is much similar to
the new data.

Feature Extraction
In the feature extraction step, we will need to represent each tweet as a bag-of-words
(BoW), i.e. an unordered set of words with their positions ignored and all of the emphasis
placed on the respective frequencies of each word. For example, consider these two tweets:
T1 = Welcome to machine learning, machine!
T2 = kNN is a powerful machine learning algorithm.
The bag-of-words representation (ignoring case and punctuation) for the above two tweets
are:

In order to create this bag-of-words representation, we would first need to extract out the
unique words from all of our tweets in the training dataset.

Conclusion:
Hence, we studied On Twitter Data performs computing using Business Intelligence analytical
tools electively.

You might also like