0% found this document useful (0 votes)

72 views8 pages

Importing Packages: Id Label Tweet 0 1 2 3 4

This document discusses analyzing hate speech on Twitter using natural language processing techniques. It loads a Twitter dataset labeled for hate speech, cleans the text by removing URLs, handles, hashtags etc. It creates word clouds and calculates term frequencies. It uses TF-IDF to vectorize the text and train a logistic regression classifier to identify hate speech, evaluating performance with cross validation. Hyperparameter tuning is done using grid search with stratified k-fold cross validation to identify the best model.

Uploaded by

rajat raina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views8 pages

Importing Packages: Id Label Tweet 0 1 2 3 4

Uploaded by

rajat raina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

12/5/2020 TwitterHate_NLP.

ipynb - Colaboratory

Importing Packages

import pandas as pd
import regex as re
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS
from nltk.tokenize.treebank import TreebankWordDetokenizer
from sklearn.model_selection import train_test_split
from nltk.tokenize import TweetTokenizer
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...

[nltk_data] Package stopwords is already up-to-date!
True

Loading Twitter Dataset

sentiment_data = pd.read_csv('/content/TwitterHate.csv')
print(len(sentiment_data))
sentiment_data.head()

31962
id label tweet

0 1 0 @user when a father is dysfunctional and is s...

1 2 0 @user @user thanks for #lyft credit i can't us...

2 3 0 bihday your majesty

3 4 0 #model i love u take with u all the time in ...

4 5 0 factsguide: society now #motivation

https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 1/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory

sentiment_data['label'].value_counts()
#Imbalanced Dataset

0 29720
1 2242
Name: label, dtype: int64

# from imblearn.over_sampling import RandomOverSampler

/usr/local/lib/python3.6/dist-packages/sklearn/externals/six.py:31: FutureWarning: The m

"(https://pypi.org/project/six/).", FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:144: FutureWarning:
warnings.warn(message, FutureWarning)

Cleaning Text using Regex

def textcleanup(data):

tk = TweetTokenizer()
stop_words = set(stopwords.words('english'))
tweet_list = []
word_list = []
for tweet in list(data['tweet']):
tweet = tweet.encode('ascii', 'ignore').decode('ascii')
tweet = re.sub('[^ ]+\.[^ ]+','',tweet) # Remove URL
tweet = re.sub("[#'']",'',tweet) # Remove #
tweet = re.sub('\@\w+','',tweet) # Remove User handle
tweet = re.sub(r'^[RT]','',tweet)#remove RT-tags
tweet = re.sub("\W+\\+[A-Za-z0-9]+\d+\D|\\+[A-Za-z0-9]+\d+\D+\w",'',tweet) #Remove redu
tweet = re.sub("\b[a]+[m]+[p]\b",'',tweet)
tweet = tweet.lower().lstrip().rstrip()
tweet = tk.tokenize(tweet)
tweet = [word for word in tweet if word not in stop_words]
tweet = list(filter(lambda sentiment: len(sentiment) > 1, tweet))
tweet_list.append(tweet)
word_list.extend(tweet)

return tweet_list,word_list,stop_words

cleantext,wordlist,stop_words = textcleanup(sentiment_data)

Getting 10 most common terms after cleaning the text

https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 2/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory

word_count = Counter(wordlist)
word_count.most_common(10)

[('love', 2725),
('day', 2247),
('happy', 1673),
('im', 1155),
('time', 1115),
('life', 1114),
('like', 1089),
('today', 993),
('new', 989),
('positive', 934)]

wordcloud = WordCloud(width = 800, height = 800,

background_color ='white',
stopwords = stop_words,
min_font_size = 10).generate(str(wordlist))

# plot the WordCloud image

plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 3/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory

Joining the token back to form strings.

clean_sentiments = []

for sent in cleantext:

detokanized_sent = TreebankWordDetokenizer().detokenize(sent)
clean_sentiments.append(detokanized_sent)

clean_sentiments[0]

'father dysfunctional selfish drags kids dysfunction run'

newframe = {'labels' : sentiment_data['label'], 'clean_sentiments' : clean_sentiments }

sentiments_frame = pd.DataFrame(newframe)
sentiments_frame.head()

labels clean_sentiments

0 0 father dysfunctional selfish drags kids dysfun...

1 0 thanks lyft credit cant use cause dont offer w...

2 0 bihday majesty

3 0 model love take time ur

4 0 factsguide society motivation

Using TF-IDF values for the terms as a feature to get into a

vector space model

tfidf_vectorizer = TfidfVectorizer(
max_df=0.5,
min_df=10,
strip_accents='unicode',
max_features=5000
)
https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 4/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory
)

tfidf_data = tfidf_vectorizer.fit_transform(sentiments_frame['clean_sentiments'])

Splitting Data into train, test and Creating Model

#Splitting Data
i te
X_train, X_test, y_train, y_test = train_test_split(tfidf_data,sentiments_frame['labels'],

#Creating Model
model = LogisticRegression()
model.fit(X_train,y_train)

train_score = model.score(X_train,y_train)
test_score = model.score(X_test,y_test)

print(train_score)
print(test_score)

0.9557276389377762
0.9510402002189895

#Generating and Plotting Confusion Matrix

cf_matrix =confusion_matrix(y_test,model.predict(X_test))
plt.figure(figsize = (7,5))
sns.heatmap(cf_matrix, annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f2f86c36320>

#Classification Report for Test Data

print(classification_report(y_test, model.predict(X_test)))
https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 5/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory

precision recall f1-score support

0 0.95 1.00 0.97 5937

1 0.90 0.35 0.51 456

accuracy 0.95 6393

macro avg 0.93 0.68 0.74 6393
weighted avg 0.95 0.95 0.94 6393

#Classification Report for Train Data

print(classification_report(y_train, model.predict(X_train)))

precision recall f1-score support

0 0.96 1.00 0.98 23783

1 0.94 0.39 0.55 1786

accuracy 0.96 25569

macro avg 0.95 0.69 0.76 25569
weighted avg 0.96 0.96 0.95 25569

Using Grid Search and Strati ed Kfold for Hyperparameter

Tuning

parameters = [{'penalty': ['l1', 'l2'],

'C': [1, 10, 100, 1000],
'class_weight': ['auto','balanced']}]

grid_sr = GridSearchCV(
LogisticRegression(class_weight="balanced"), parameters, scoring='recal
)
grid_sr.fit(X_train, y_train)

grid_sr.best_params_

{'C': 1, 'class_weight': 'balanced', 'penalty': 'l2'}

kfold = StratifiedKFold(n_splits=4, shuffle=True, random_state=1)

# enumerate the splits and summarize the distributions
for train_ix, test_ix in kfold.split(tfidf_data,sentiments_frame['labels']):
# select rows
train_X, test_X = tfidf_data[train_ix], tfidf_data[test_ix]
train_y, test_y = sentiments_frame['labels'][train_ix], sentiments_frame['labels'][test_ix]
model_test = LogisticRegression(C= 1, class_weight= 'balanced', penalty= 'l2')
model test fit(X train y train)
https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 6/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory
model_test.fit(X_train,y_train)

train_score1 = model_test.score(train_X,train_y)
test_score1 = model_test.score(test_X,test_y)

print('Train Score', train_score1)

print('Test Score', test_score1)

if test_score1 > train_score1:

break

Train Score 0.9289141045429894

Test Score 0.9300463020898511

print(classification_report(test_y,model_test.predict(test_X)))

precision recall f1-score support

0 0.99 0.93 0.96 7430

1 0.50 0.91 0.65 561

accuracy 0.93 7991

macro avg 0.75 0.92 0.80 7991
weighted avg 0.96 0.93 0.94 7991

print(classification_report(train_y,model_test.predict(train_X)))

precision recall f1-score support

0 0.99 0.93 0.96 22290

1 0.50 0.93 0.65 1681

accuracy 0.93 23971

macro avg 0.75 0.93 0.80 23971
weighted avg 0.96 0.93 0.94 23971

Best Parameters : (C= 1, class_weight= 'balanced', penalty= '12')

Recall on the test set for the toxic comments : 93
f_1 Score on the test set for the toxic comments : 65

https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 7/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory

https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 8/8

College Notes Gallery
71% (7)
College Notes Gallery
8 pages
DeepSea 8620 Operators Manual
75% (4)
DeepSea 8620 Operators Manual
122 pages
NLP Transformer-Based Models Used For Sentiment Analysis: 1. BERT
No ratings yet
NLP Transformer-Based Models Used For Sentiment Analysis: 1. BERT
98 pages
Sentimental Analysis
No ratings yet
Sentimental Analysis
3 pages
Personalized Cancer Diagnosis
No ratings yet
Personalized Cancer Diagnosis
100 pages
NLP Transformer-Based Models Used For Sentiment Analysis
No ratings yet
NLP Transformer-Based Models Used For Sentiment Analysis
45 pages
Hatespeech Code Ipynb
No ratings yet
Hatespeech Code Ipynb
31 pages
AI Lab Report BIM
No ratings yet
AI Lab Report BIM
34 pages
Sentiment Analysis On User-Generated Tweets
No ratings yet
Sentiment Analysis On User-Generated Tweets
15 pages
DS - Lab Report.
No ratings yet
DS - Lab Report.
25 pages
Practical File OF Machine Learning
No ratings yet
Practical File OF Machine Learning
31 pages
Machine Learning Code Explanation
No ratings yet
Machine Learning Code Explanation
33 pages
Malignant Comments Classifier Project
No ratings yet
Malignant Comments Classifier Project
30 pages
Lab Report - CSE 816
No ratings yet
Lab Report - CSE 816
17 pages
Twitter Sentiment Analysis Dss
No ratings yet
Twitter Sentiment Analysis Dss
14 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
Yarn Structure
No ratings yet
Yarn Structure
10 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
16 pages
Few-Shot Learning Tutorial - Medium
No ratings yet
Few-Shot Learning Tutorial - Medium
16 pages
Sentiment Analysis Using Machine Learning Algorithms
No ratings yet
Sentiment Analysis Using Machine Learning Algorithms
23 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
13 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
14 pages
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
No ratings yet
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
12 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
9 pages
Tweet-Sentiment-Extraction - Exploratory Data Analysis
No ratings yet
Tweet-Sentiment-Extraction - Exploratory Data Analysis
11 pages
Ai Project File
No ratings yet
Ai Project File
11 pages
Apply Logistic Regression To Amazon Reviews Data Set (M)
No ratings yet
Apply Logistic Regression To Amazon Reviews Data Set (M)
11 pages
Cyberbullying Code
No ratings yet
Cyberbullying Code
6 pages
HateSpeech - Ipynb - Colab
No ratings yet
HateSpeech - Ipynb - Colab
8 pages
Template For The First Slide of PPT Presentation1
No ratings yet
Template For The First Slide of PPT Presentation1
18 pages
Document Dsbda Codes For Mini Project
No ratings yet
Document Dsbda Codes For Mini Project
9 pages
Hate Speech Detection
No ratings yet
Hate Speech Detection
6 pages
Super Visionado VSRegras
No ratings yet
Super Visionado VSRegras
6 pages
Methodology
No ratings yet
Methodology
9 pages
ML Week10.1
No ratings yet
ML Week10.1
5 pages
Natural Language Processing Assignment
No ratings yet
Natural Language Processing Assignment
3 pages
Revit Mep Fund
No ratings yet
Revit Mep Fund
5 pages
ML Projrct Article 2
No ratings yet
ML Projrct Article 2
6 pages
Tweet Emotion Recognition: NLP With Tensorflow
No ratings yet
Tweet Emotion Recognition: NLP With Tensorflow
10 pages
Sma Exp 10 Code Print
No ratings yet
Sma Exp 10 Code Print
7 pages
COMP 4650 6490 Assignment 3 2023-v1.1
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
6 pages
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
No ratings yet
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
13 pages
Sentiment Analysis Using LSTM
No ratings yet
Sentiment Analysis Using LSTM
5 pages
DL 3
No ratings yet
DL 3
5 pages
Solutions To Applied Data Science AI
No ratings yet
Solutions To Applied Data Science AI
9 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
7 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
Kindle Review Sentiment Analysis - Ipynb - Colab
No ratings yet
Kindle Review Sentiment Analysis - Ipynb - Colab
5 pages
Hate Speech Detection Documentation With Code
No ratings yet
Hate Speech Detection Documentation With Code
4 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Artificial Neural Network Code
No ratings yet
Artificial Neural Network Code
3 pages
Rajeek 7
No ratings yet
Rajeek 7
3 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
5 pages
Classification Is For Predicting Type and Regression Is For Predicting Value
No ratings yet
Classification Is For Predicting Type and Regression Is For Predicting Value
4 pages
IQBAL Fresher 19
No ratings yet
IQBAL Fresher 19
3 pages
8-Text Classification - Jupyter Notebook
No ratings yet
8-Text Classification - Jupyter Notebook
2 pages
Sma 5
No ratings yet
Sma 5
3 pages
CPE531 S18 MT Sol PDF
No ratings yet
CPE531 S18 MT Sol PDF
3 pages
Q 3
No ratings yet
Q 3
2 pages
gpt-2 Code
No ratings yet
gpt-2 Code
2 pages
Programming Assignment 3: Logistic Regression Instructions
No ratings yet
Programming Assignment 3: Logistic Regression Instructions
3 pages
CMS Requirements Document
No ratings yet
CMS Requirements Document
19 pages
Untitled
No ratings yet
Untitled
15 pages
Mids Practical 3
No ratings yet
Mids Practical 3
2 pages
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
Step-By-Step Trai Ni NG Manual: Engl I SH Versi On
No ratings yet
Step-By-Step Trai Ni NG Manual: Engl I SH Versi On
42 pages
Automatic Examination Seating Arrangement System
No ratings yet
Automatic Examination Seating Arrangement System
41 pages
Web Technologies Miniproject
No ratings yet
Web Technologies Miniproject
64 pages
Dahua Open Source Software Notice
No ratings yet
Dahua Open Source Software Notice
33 pages
Case Study Ques
0% (1)
Case Study Ques
4 pages
Living in The IT Era Module 1 - Introduction To Information and Communication
No ratings yet
Living in The IT Era Module 1 - Introduction To Information and Communication
8 pages
Rest Dissertation Roy Fielding
100% (2)
Rest Dissertation Roy Fielding
5 pages
Declarative & Programmatic Security
No ratings yet
Declarative & Programmatic Security
12 pages
ADE Accenta G3
No ratings yet
ADE Accenta G3
7 pages
Long Quiz: Quarter 3 03-22-23 #Purplewednesdays
No ratings yet
Long Quiz: Quarter 3 03-22-23 #Purplewednesdays
33 pages
Simatic Industrial Software SIMATIC Safety V14 Readme
No ratings yet
Simatic Industrial Software SIMATIC Safety V14 Readme
8 pages
Library Manager Manual Version 2.30 Human Font
No ratings yet
Library Manager Manual Version 2.30 Human Font
26 pages
22 10 Con FN
No ratings yet
22 10 Con FN
3 pages
Atm Machine Code
No ratings yet
Atm Machine Code
6 pages
A Survey On Wandering Behavior Management
No ratings yet
A Survey On Wandering Behavior Management
16 pages
Otsm
No ratings yet
Otsm
2 pages
Ube Express, Inc.: Statement of Account
No ratings yet
Ube Express, Inc.: Statement of Account
12 pages
Cream Cascade For G8M Dickator: Utorial
No ratings yet
Cream Cascade For G8M Dickator: Utorial
3 pages
0x08 Python - More Classes and Objects
No ratings yet
0x08 Python - More Classes and Objects
16 pages
Ralph+Lauren+semi-automatic+meas+chart+manual+v3 0
No ratings yet
Ralph+Lauren+semi-automatic+meas+chart+manual+v3 0
10 pages
HCI Previous
No ratings yet
HCI Previous
4 pages
Export PDF Form Data CSV
No ratings yet
Export PDF Form Data CSV
2 pages
Personal Data Sheet: Mendoza Analyn Rubio
No ratings yet
Personal Data Sheet: Mendoza Analyn Rubio
6 pages
Forensics Windowsregistry Cheat Sheet 161221024032
No ratings yet
Forensics Windowsregistry Cheat Sheet 161221024032
1 page
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet