0% found this document useful (0 votes)
91 views30 pages

Ayushi Data Science Final File

The document is a lab file submitted by student Ayushi Upadhyay for their Data Science course. It details 6 experiments conducted in Jupyter Notebook and Google Colab involving predictive modeling. The first experiment involves installing Anaconda and Jupyter Notebook. The second describes using Google Colab for Python notebooks. Later experiments include using machine learning models to predict flower types from attributes, loan approvals, and identifying hate tweets. The final experiment is a case study comparing bag-of-words and TF-IDF approaches for hate tweet classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views30 pages

Ayushi Data Science Final File

The document is a lab file submitted by student Ayushi Upadhyay for their Data Science course. It details 6 experiments conducted in Jupyter Notebook and Google Colab involving predictive modeling. The first experiment involves installing Anaconda and Jupyter Notebook. The second describes using Google Colab for Python notebooks. Later experiments include using machine learning models to predict flower types from attributes, loan approvals, and identifying hate tweets. The final experiment is a case study comparing bag-of-words and TF-IDF approaches for hate tweet classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Shri Vaishnav Vidyapeeth Vishwavidyalaya

Shri Vaishnav Institute of Information Technology


Department of Computer Science Engineering

Lab File
Degree (CSE-Redhat)
Section – f
3rd Year / 6th Semester

Submitted to: Mrs. Archana Choubey


Submitted by: Ayushi Upadhyay
Enrollment Number: 19100BTCSES05363
Course Name: Data Science
INDEX

Sr. No. EXPERIMENTS DATE

1 Study and installation of jupyter notebook anaconda. 25/03/2


2
2 Study of google colab. 01/04/2
2
3 Write a program in Python to predict the class of the 08/04/2
flower based on available attributes. 2

4 Write a program in Python to predict if a loan will get 22/04/22


approved or not.

5 Write a program in Python to identify the tweets which are 28/04/22


hate tweets and which are not.

6 Case study on Hate Tweets 29/04/22


BTCS 608 Data Science
EXPERIMENT – 1
OBJECTIVE –Study and installation of jupyter notebook anaconda.
Introduction of Jupyter Notebook:

The Jupyter Notebook is an open-source web application that you can use to create and share documents
that contain live code, equations, visualizations, and text. Jupyter Notebook is maintained by the people
at Project Jupyter.

Jupyter Notebooks are a spin-off project from the IPython project, which used to have an IPython
Notebook project itself. The name, Jupyter, comes from the core supported programming languages that
it supports: Julia, Python, and R. Jupyter ships with the IPython kernel, which allows you to write your
programs in Python, but there are currently over 100 other kernels that you can also use.

Installing Jupyter using Anaconda and conda in windows OS:

Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for
scientific computing and data science.

Use the following installation steps:

1. Download Anaconda’s latest Python 3 version from


https://www.anaconda.com/download/#windows (currently Python 3.7).
2. Install the version of Anaconda which you downloaded.
3. Double click on the installer that we have just downloaded.

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science
4. Launch Anaconda Navigator > Click on install jupyter notebook button to install.

5. Jupyter Notebook has been installed successfully.


6. To run the notebook:
jupyter notebook

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science
EXPERIMENT – 2
OBJECTIVE –Study of google colab.
Introduction of Google Colab:
Google colaboratory, or “colab” for short, is a product from Google Research. Colab allows anybody to
write and execute arbitrary python code through the browser, and is especially well suited to machine
learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that
requires no setup to use, while providing access free of charge to computing resources including GPUs.
How to Use Google colab:
To start working with Colab you first need to log in to your google account, then go to this
link https://colab.research.google.com .

Opening Jupyter Notebook:


On opening the website you will see a pop-up containing following tabs –

Else you can create a new Jupyter notebook by clicking New Python3 Notebook or New Python2
Notebook at the bottom right corner.

Notebook’s Description:

On creating a new notebook, it will create a Jupyter notebook with Untitled0.ipynb and save it to your
google drive in a folder named Colab Notebooks. Now as it is essentially a Jupyter notebook, all
commands of Jupyter notebooks will work here. 

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science
Change Runtime Environment:
Click the “Runtime” dropdown menu. Select “Change runtime type”. Select python2 or 3
from “Runtime type” dropdown menu.

Now, we are
good to go
with Google
colab.

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science
EXPERIMENT – 3
OBJECTIVE –Write a program in python to predict the class of the flower based on
available attributes.
SOURCE CODE-

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science
OUTPUT-

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science
EXPERIMENT – 4
OBJECTIVE –Write a program in python to predict if a loan will get approved or not.
SOURCE CODE-

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science
OUTPUT-

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science
EXPERIMENT – 5
OBJECTIVE –Write a program in Python to identify the tweets which are hate tweets
and which are not.

Source Code-

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

OUTPUT-

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science
EXPERIMENT – 6
OBJECTIVE –Case Study for Hate Tweets based on tw0o different types of approaches.
Business Problem:
Toxic online content has become a major issue in today’s world due to an exponential increase in the use
of the internet by people of different cultures and educational backgrounds. Differentiating hate speech
and offensive language is a key challenge in the automatic detection of toxic text content. Using the
Twitter dataset, we perform experiments by leveraging bag of words and the term frequency-inverse
document frequency (TFIDF) values to multiple machine learning models. We perform comparative
analysis of the models considering both of these approaches. After tuning the model giving the best
results, we achieved an accuracy of 89% and a recall of 84% upon applying the Logistic Regression
model. We also create a module using flask which serves as a real time application of our model.
Problem Statement:
Differentiating hate speech and offensive language in twitter. In this report, we propose an approach to
automatically classify tweets on Twitter into two classes: hate speech and non-hate speech. Using the
Twitter dataset, we perform experiments by leveraging bag of words and the term frequency-inverse
document frequency (TFIDF) values to multiple machine learning models.

Approches:
Our data preprocessing step involved 2 approaches, Bag of words and Term Frequency Inverse
Document Frequency (TFIDF).

The bag-of-words approach is a simplified representation used in natural language processing and

information retrieval. In this approach, a text such as a sentence or a document is represented as the bag

(multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a

collection. It is used as a weighting factor in searches of information retrieval, text mining, and user

modeling.

Before we input this data into various algorithms, we have to clean it as the tweets contain many different

tenses, grammatical errors, unknown symbols, hashtags, and Greek characters.

Data

(Source: https://www.kaggle.com/vkrahul/twitter-hate-speech)

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

Data Dictionary

Visualizations and Word Clouds

A word cloud is created to get an idea of the most common words utilized in tweets. This was done for

both categories of hate and non-hate tweets. Next, we created a bar graph to visualize the utilization

frequencies between the most common words in both the positive as well as the negative sentiment.

Positive tweets

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

Negative tweets

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

Data Architecture

We perform strategic sampling and separate the data into a temporary set and test set. Note that since we

have performed strategic sampling, the ratio of good tweets to hate tweets is 93:7 for both the temporary

and test datasets. On the temporary data, we first tried to perform up sampling of hate tweets using

SMOTE (Synthetic Minority Oversampling Technique). Since the SMOTE packages don’t work directly

for textual data, we wrote our own code for it. The process is as follows:

We created a corpus of all the unique words present in hate tweets of the temporary dataset. Once we had

a matrix containing all possible words in hate tweets, we created a blank new dataset and started filling it

with new hate tweets. These new tweets were synthesized by selecting words at random from the corpus.

The lengths of these new tweets were determined on the basis of the lengths of the tweets from which the

corpus was formed.

We then repeated this process multiple times until the number of hate tweets in this synthetic data was

equal to the number of non-hate tweets we had in our temporary data. However, when we employed the

Bag of Words approach for feature generation, the number of features went up to 100,000. Due to an

extremely high number of features, we faced hardware and processing power limitation and hence had to

discard the SMOTE oversampling method.

As it was not possible to up-sample hate tweets to balance the data, we decided to down-sample non-hate

tweets to make it even. We took a subset of only the non-hate tweets from the temporary dataset. From

this subset, we selected n random tweets, where n is the number of hate tweets in the temporary data. We

then joined this with the subset of hate tweets in the temporary data. This dataset is now the training data

that we use for our feature generation and modelling purposes.

The test data is still in a 93:7 ratio of good tweets to hate tweets as we did not perform any sampling on it.

Sampling was not performed as real world data comes in this ratio.

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

Approaches

We have looked at two major approaches for feature generation: Bag of Words (BOW) and Term

Frequency Inverse Document Frequency (TFIDF).

1. Bag of Words

2. Term Frequency Inverse Document Frequency

Bag of words

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling,

such as with machine learning algorithms. It is called a “bag” of words because any information about

the order or structure of words in the document is discarded. The model is only concerned with whether

known words occur in the document, not where in the document. The intuition is that documents are

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

similar if they have similar content. Further, from the content alone we can learn something about the

meaning of the document. The objective is to turn each document of free text into a vector that we can

use as input or output for a machine learning model. Because we know the vocabulary has 10 words, we

can use a fixed-length document representation of 10, with one position in the vector to score each word.

The simplest scoring method is to mark the presence of words as a Boolean value, 0 for absent, 1 for

present.

Using the arbitrary ordering of words listed above in our vocabulary, we can step through the first

document (“It was the best of times “) and convert it into a binary vector.

The scoring of the document would look as follows:

● “it” = 1

● “was” = 1

● “the” = 1

● “best” = 1

● “of” = 1

● “times” = 1

● “worst” = 0

● “age” = 0

● “wisdom” = 0

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

● “foolishness” = 0

As a binary vector, this would look as follows:

[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

The other three documents would look as follows:

1. “it was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

2. “it was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

3. “it was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

If your dataset is small and context is domain specific, BoW may work better than Word Embedding.

Context is very domain specific which means that you cannot find corresponding Vector from pre-trained

word embedding models (GloVe, fastText etc.)

TFIDF

TF*IDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse

document frequency (IDF). Each word or term has its respective TF and IDF score. The product of the

TF and IDF scores of a term is called the TF*IDF weight of that term.

Put simply, the higher the TF*IDF score (weight), the rarer the term and vice versa.

The TF*IDF algorithm is used to weigh a keyword in any content and assign the importance to that

keyword based on the number of times it appears in the document. More importantly, it checks how

relevant the keyword is throughout the web, which is referred to as corpus.

19100BTCSES05363 Ayushi Upadhyay


BTCS 608 Data Science

For a term t in a document d, the weight Wt, d of term t in document d is given by:

Wt, d = TFt, d log (N/DFt)

Where:

● TFt, d is the number of occurrences of t in document d.

● DFt is the number of documents containing the term t.

● N is the total number of documents in the corpus.

How is TF*IDF calculated? The TF (term frequency) of a word is the frequency of a word (i.e. number of

times it appears) in a document. When you know it, you’re able to see if you’re using a term too much or

too little.

For example, when a 100-word document contains the term “cat” 12 times, the TF for the word ‘cat’ is

TFcat = 12/100 i.e. 0.12

The IDF (inverse document frequency) of a word is the measure of how significant that term is in the

whole corpus.

Key Insights and Learnings: Politics, race and sexual tweets form a major chunk of hate tweets. Results

obtained when we played around with an imbalanced data set were inaccurate. The model predicts new

data to belong to the major class in the training set due to the skewed nature of the train data. The

weighted accuracy of a classifier is not the only metric to be looked at while evaluating the performance

of the model. Business context plays a vital role as well.

19100BTCSES05363 Ayushi Upadhyay

You might also like