Ayushi Data Science Final File
Ayushi Data Science Final File
Lab File
Degree (CSE-Redhat)
Section – f
3rd Year / 6th Semester
The Jupyter Notebook is an open-source web application that you can use to create and share documents
that contain live code, equations, visualizations, and text. Jupyter Notebook is maintained by the people
at Project Jupyter.
Jupyter Notebooks are a spin-off project from the IPython project, which used to have an IPython
Notebook project itself. The name, Jupyter, comes from the core supported programming languages that
it supports: Julia, Python, and R. Jupyter ships with the IPython kernel, which allows you to write your
programs in Python, but there are currently over 100 other kernels that you can also use.
Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for
scientific computing and data science.
Else you can create a new Jupyter notebook by clicking New Python3 Notebook or New Python2
Notebook at the bottom right corner.
Notebook’s Description:
On creating a new notebook, it will create a Jupyter notebook with Untitled0.ipynb and save it to your
google drive in a folder named Colab Notebooks. Now as it is essentially a Jupyter notebook, all
commands of Jupyter notebooks will work here.
Now, we are
good to go
with Google
colab.
Source Code-
OUTPUT-
Approches:
Our data preprocessing step involved 2 approaches, Bag of words and Term Frequency Inverse
Document Frequency (TFIDF).
information retrieval. In this approach, a text such as a sentence or a document is represented as the bag
(multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a
collection. It is used as a weighting factor in searches of information retrieval, text mining, and user
modeling.
Before we input this data into various algorithms, we have to clean it as the tweets contain many different
Data
(Source: https://www.kaggle.com/vkrahul/twitter-hate-speech)
Data Dictionary
A word cloud is created to get an idea of the most common words utilized in tweets. This was done for
both categories of hate and non-hate tweets. Next, we created a bar graph to visualize the utilization
frequencies between the most common words in both the positive as well as the negative sentiment.
Positive tweets
Negative tweets
Data Architecture
We perform strategic sampling and separate the data into a temporary set and test set. Note that since we
have performed strategic sampling, the ratio of good tweets to hate tweets is 93:7 for both the temporary
and test datasets. On the temporary data, we first tried to perform up sampling of hate tweets using
SMOTE (Synthetic Minority Oversampling Technique). Since the SMOTE packages don’t work directly
for textual data, we wrote our own code for it. The process is as follows:
We created a corpus of all the unique words present in hate tweets of the temporary dataset. Once we had
a matrix containing all possible words in hate tweets, we created a blank new dataset and started filling it
with new hate tweets. These new tweets were synthesized by selecting words at random from the corpus.
The lengths of these new tweets were determined on the basis of the lengths of the tweets from which the
We then repeated this process multiple times until the number of hate tweets in this synthetic data was
equal to the number of non-hate tweets we had in our temporary data. However, when we employed the
Bag of Words approach for feature generation, the number of features went up to 100,000. Due to an
extremely high number of features, we faced hardware and processing power limitation and hence had to
As it was not possible to up-sample hate tweets to balance the data, we decided to down-sample non-hate
tweets to make it even. We took a subset of only the non-hate tweets from the temporary dataset. From
this subset, we selected n random tweets, where n is the number of hate tweets in the temporary data. We
then joined this with the subset of hate tweets in the temporary data. This dataset is now the training data
The test data is still in a 93:7 ratio of good tweets to hate tweets as we did not perform any sampling on it.
Sampling was not performed as real world data comes in this ratio.
Approaches
We have looked at two major approaches for feature generation: Bag of Words (BOW) and Term
1. Bag of Words
Bag of words
A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling,
such as with machine learning algorithms. It is called a “bag” of words because any information about
the order or structure of words in the document is discarded. The model is only concerned with whether
known words occur in the document, not where in the document. The intuition is that documents are
similar if they have similar content. Further, from the content alone we can learn something about the
meaning of the document. The objective is to turn each document of free text into a vector that we can
use as input or output for a machine learning model. Because we know the vocabulary has 10 words, we
can use a fixed-length document representation of 10, with one position in the vector to score each word.
The simplest scoring method is to mark the presence of words as a Boolean value, 0 for absent, 1 for
present.
Using the arbitrary ordering of words listed above in our vocabulary, we can step through the first
document (“It was the best of times “) and convert it into a binary vector.
● “it” = 1
● “was” = 1
● “the” = 1
● “best” = 1
● “of” = 1
● “times” = 1
● “worst” = 0
● “age” = 0
● “wisdom” = 0
● “foolishness” = 0
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
If your dataset is small and context is domain specific, BoW may work better than Word Embedding.
Context is very domain specific which means that you cannot find corresponding Vector from pre-trained
TFIDF
TF*IDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse
document frequency (IDF). Each word or term has its respective TF and IDF score. The product of the
TF and IDF scores of a term is called the TF*IDF weight of that term.
Put simply, the higher the TF*IDF score (weight), the rarer the term and vice versa.
The TF*IDF algorithm is used to weigh a keyword in any content and assign the importance to that
keyword based on the number of times it appears in the document. More importantly, it checks how
For a term t in a document d, the weight Wt, d of term t in document d is given by:
Where:
How is TF*IDF calculated? The TF (term frequency) of a word is the frequency of a word (i.e. number of
times it appears) in a document. When you know it, you’re able to see if you’re using a term too much or
too little.
For example, when a 100-word document contains the term “cat” 12 times, the TF for the word ‘cat’ is
The IDF (inverse document frequency) of a word is the measure of how significant that term is in the
whole corpus.
Key Insights and Learnings: Politics, race and sexual tweets form a major chunk of hate tweets. Results
obtained when we played around with an imbalanced data set were inaccurate. The model predicts new
data to belong to the major class in the training set due to the skewed nature of the train data. The
weighted accuracy of a classifier is not the only metric to be looked at while evaluating the performance