0% found this document useful (0 votes)
54 views13 pages

Twitter Sentiment Analysis

The document discusses sentiment analysis techniques for classifying tweets as containing racist or sexist sentiment. It introduces sentiment analysis as a branch of natural language processing used to determine subjective information like sentiment, emotions and attitudes from text. The primary goal of sentiment analysis is to automatically classify the sentiment of a document like a tweet into categories like positive, negative or neutral. Sentiment analysis techniques leverage machine learning algorithms, lexicon-based methods and rule-based systems to examine patterns in text and determine sentiment orientation.

Uploaded by

amnwq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views13 pages

Twitter Sentiment Analysis

The document discusses sentiment analysis techniques for classifying tweets as containing racist or sexist sentiment. It introduces sentiment analysis as a branch of natural language processing used to determine subjective information like sentiment, emotions and attitudes from text. The primary goal of sentiment analysis is to automatically classify the sentiment of a document like a tweet into categories like positive, negative or neutral. Sentiment analysis techniques leverage machine learning algorithms, lexicon-based methods and rule-based systems to examine patterns in text and determine sentiment orientation.

Uploaded by

amnwq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

TWITTER SENTIMENT ANALYSIS

Introduction
Detecting hate speech in tweets involves classifying tweets as either containing racist or sexist sentiment or not. To accomplish this using Python libraries, we can employ NLP techniques and machine learning algorithms. By analyzing the text content and applying sentiment analysis models, we can train a classifier to distinguish between tweets with hate speech
and those without.

Sentiment Analysis
Sentiment analysis, also known as opinion mining, is a branch of natural language processing (NLP) that involves the use of computational techniques to determine and extract subjective information from text data. It aims to analyze and understand the sentiment, emotions, attitudes, and opinions expressed within a given piece of text.

The primary goal of sentiment analysis is to automatically classify the sentiment of a text document, such as a tweet, review, or customer feedback, into different categories, typically positive, negative, or neutral. However, sentiment analysis can also include more fine-grained sentiment classifications, such as very positive, positive, neutral, negative, and very
negative.

Sentiment analysis techniques leverage various approaches, including machine learning algorithms, lexicon-based methods, and rule-based systems. These techniques process text data by examining patterns, semantic structures, linguistic features, and context to determine the sentiment orientation.

IMPORT NECESSARRY LIBRARIES


In [135]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [136]: from sklearn import model_selection, preprocessing, linear_model, metrics


from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from sklearn import ensemble
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, roc_curve
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from nltk.corpus import stopwords
from textblob import TextBlob
from textblob import Word
import nltk
from nltk.stem import PorterStemmer


from textblob import TextBlob
from termcolor import colored
from warnings import filterwarnings
filterwarnings('ignore')
import re
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn import set_config
set_config(print_changed_only = False)

In [137]: test_set = pd.read_csv(r"C:\Desktop\Data Analyst Project\Sentiment Analysis\test.csv", encoding = "utf-8",


engine = "python",
header = 0)
train_set = pd.read_csv(r"C:\Desktop\Data Analyst Project\Sentiment Analysis\train.csv", encoding = "utf-8",
engine = "python",
header = 0)

In [138]: ​
print(colored("\nDATASETS WERE SUCCESFULLY LOADED...", color = "orange", attrs = ["dark", "bold"]))

DATASETS WERE SUCCESFULLY LOADED...

first five rows from train data set

In [139]: train_set.head(n = 5).style.background_gradient(cmap = "PiYG")

Out[139]:
  id label tweet

0 1 0 @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run

1 2 0 @user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked

2 3 0 bihday your majesty

3 4 0 #model i love u take with u all the time in ur📱!!! 😙😎👄👅💦💦💦

4 5 0 factsguide: society now #motivation

First Five rows from test dataset


In [140]: test_set.head(n=5).style.background_gradient(cmap='PiYG')

Out[140]:
  id tweet

0 31963 #studiolife #aislife #requires #passion #dedication #willpower to find #newmaterials…

1 31964 @user #white #supremacists want everyone to see the new ‘ #birds’ #movie — and here’s why

2 31965 safe ways to heal your #acne!! #altwaystoheal #healthy #healing!!

3 31966 is the hp and the cursed child book up for reservations already? if yes, where? if no, when? 😍😍😍 #harrypotter #pottermore #favorite

4 31967 3rd #bihday to my amazing, hilarious #nephew eli ahmir! uncle dave loves you and misses…

In [141]: #shape

In [142]: format(train_set.shape)

Out[142]: '(31962, 3)'

In [143]: #format(test_set.shape)

In [144]: train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 31962 non-null int64
1 label 31962 non-null int64
2 tweet 31962 non-null object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB

In [145]: test_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17197 entries, 0 to 17196
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 17197 non-null int64
1 tweet 17197 non-null object
dtypes: int64(1), object(1)
memory usage: 268.8+ KB

In [146]: train_set.groupby("label").count().style.background_gradient(cmap="autumn")

Out[146]:
  id tweet

label    

0 29720 29720

1 2242 2242

In [147]: train_set_len = train_set['tweet'].str.len()


test_set_len = test_set['tweet'].str.len()
print("train data length :" , train_set_len)
print("test data length :" , test_set_len)

train data length : 0 102


1 122
2 21
3 86
4 39
...
31957 68
31958 131
31959 63
31960 67
31961 32
Name: tweet, Length: 31962, dtype: int64
test data length : 0 90
1 101
2 71
3 142
4 93
...
17192 108
17193 96
17194 145
17195 104
17196 64
Name: tweet, Length: 17197, dtype: int64

In [148]: pos = 100*len(train_set.loc[train_set['label']==0, 'label'])/len(train_set['label'])


neg=100*len(train_set.loc[train_set['label']==1, 'label'])/len(train_set['label'])
In [149]: print(f'Percentage of Negative Sentiment tweets is {pos}')
print(f'Percentage of Postitive Sentiment tweets is {neg}')
print('\nClearly, herre we can see the data ')

Percentage of Negative Sentiment tweets is 92.98542018647143


Percentage of Postitive Sentiment tweets is 7.014579813528565

Clearly, herre we can see the data

data Exploration

In [150]: plt.hist(train_set_len, bins=22, label ='train_tweet')


plt.hist(test_set_len, bins=22, label = 'train_tweet')
plt.legend()
plt.show()

In [151]: sns.countplot(data=train_set, x='label', hue='label')


plt.title('Types of comments : 0 - > Non Rasict/Sexist , 1 - > Rasict/Sexist')
plt.xlabel('Tweets')
plt.show()
In [152]: clength_train = train_set['tweet'].str.len().plot.hist(color = 'blue', figsize = (6, 4))
length_test = test_set['tweet'].str.len().plot.hist(color = 'pink', figsize = (6, 4))

In [153]: c=CountVectorizer(stop_words='english')
word=c.fit_transform(train_set.tweet)
summation=word.sum(axis=0)
print(summation)

[[ 51 28 2 ... 272 1 2]]

In [154]: freq=[(word,summation[0,i]) for word,i in c.vocabulary_.items()]


freq=sorted(freq,key=lambda x:x[1],reverse=True)
frequency = pd.DataFrame(freq, columns=['word', 'freq'])
print(frequency)

word freq
0 user 17577
1 love 2749
2 day 2311
3 amp 1776
4 happy 1686
... ... ...
41099 isz 1
41100 airwaves 1
41101 mantle 1
41102 shirley 1
41103 chisolm 1

[41104 rows x 2 columns]


In [155]: #most frequentlyy used words
df=frequency.head(20).plot(x='word', y='freq', kind='bar', figsize=(15, 7), color = 'green')
plt.title("20 most frequently used words in twitter")
plt.show()

In [156]: #Count number of words


def num_of_words(df):
df['word_count'] = df['tweet'].apply(lambda x : len(str(x).split(" ")))
print(df[['tweet','word_count']].head())

In [157]: num_of_words(train_set)

tweet word_count
0 @user when a father is dysfunctional and is s... 21
1 @user @user thanks for #lyft credit i can't us... 22
2 bihday your majesty 5
3 #model i love u take with u all the time in ... 17
4 factsguide: society now #motivation 8

In [158]: num_of_words(test_set)

tweet word_count
0 #studiolife #aislife #requires #passion #dedic... 12
1 @user #white #supremacists want everyone to s... 20
2 safe ways to heal your #acne!! #altwaystohe... 15
3 is the hp and the cursed child book up for res... 24
4 3rd #bihday to my amazing, hilarious #nephew... 18

In [159]: #Count number of characters


def num_of_chars(train_set):
train_set['char_count'] = train_set['tweet'].str.len() ## this also includes spaces
print(train_set[['tweet','char_count']].head())

In [160]: num_of_chars(train_set)

tweet char_count
0 @user when a father is dysfunctional and is s... 102
1 @user @user thanks for #lyft credit i can't us... 122
2 bihday your majesty 21
3 #model i love u take with u all the time in ... 86
4 factsguide: society now #motivation 39

In [161]: num_of_chars(test_set)

tweet char_count
0 #studiolife #aislife #requires #passion #dedic... 90
1 @user #white #supremacists want everyone to s... 101
2 safe ways to heal your #acne!! #altwaystohe... 71
3 is the hp and the cursed child book up for res... 142
4 3rd #bihday to my amazing, hilarious #nephew... 93
In [162]: #Number of stopwords
set(stopwords.words('english'))
once ,
'only',
'or',
'other',
'our',
'ours',
'ourselves',
'out',
'over',
'own',
're',
's',
'same',
'shan',
"shan't",
'she',
"she's",
'should',
"should've",
'shouldn',
" h ld 't"
In [163]: stop = stopwords.words('english')

In [164]: def stop_words(df):


df['stopwords'] = df['tweet'].apply(lambda x: len([x for x in x.split() if x in stop]))
print(df[['tweet','stopwords']].head())

In [165]: stop_words(train_set)

tweet stopwords
0 @user when a father is dysfunctional and is s... 10
1 @user @user thanks for #lyft credit i can't us... 5
2 bihday your majesty 1
3 #model i love u take with u all the time in ... 5
4 factsguide: society now #motivation 1

In [166]: stop_words(test_set)

tweet stopwords
0 #studiolife #aislife #requires #passion #dedic... 1
1 @user #white #supremacists want everyone to s... 4
2 safe ways to heal your #acne!! #altwaystohe... 2
3 is the hp and the cursed child book up for res... 8
4 3rd #bihday to my amazing, hilarious #nephew... 4

Number of special characters

In [167]: def hash_tags(df) :


df['hashtags'] = df['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
print(df[['tweet', 'hashtags']].head())

In [168]: hash_tags(train_set)

tweet hashtags
0 @user when a father is dysfunctional and is s... 1
1 @user @user thanks for #lyft credit i can't us... 3
2 bihday your majesty 0
3 #model i love u take with u all the time in ... 1
4 factsguide: society now #motivation 1

In [169]: hash_tags(test_set)

tweet hashtags
0 #studiolife #aislife #requires #passion #dedic... 7
1 @user #white #supremacists want everyone to s... 4
2 safe ways to heal your #acne!! #altwaystohe... 4
3 is the hp and the cursed child book up for res... 3
4 3rd #bihday to my amazing, hilarious #nephew... 2

number of numerics

In [170]: def num_numerics(df):


df['numerics'] = df['tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
print(df[['tweet', 'numerics']].head())

In [171]: num_numerics(train_set)

tweet numerics
0 @user when a father is dysfunctional and is s... 0
1 @user @user thanks for #lyft credit i can't us... 0
2 bihday your majesty 0
3 #model i love u take with u all the time in ... 0
4 factsguide: society now #motivation 0
In [172]: num_numerics(test_set)

tweet numerics
0 #studiolife #aislife #requires #passion #dedic... 0
1 @user #white #supremacists want everyone to s... 0
2 safe ways to heal your #acne!! #altwaystohe... 0
3 is the hp and the cursed child book up for res... 0
4 3rd #bihday to my amazing, hilarious #nephew... 0

In [ ]: ​

Clean and Process Dataset

In [173]: #convert upper case to lower case

In [174]: train_set["tweet"]= train_set["tweet"].apply(lambda x: " ". join (x.lower() for x in x.split()))


test_set["tweet"] = test_set["tweet"].apply(lambda x: " ". join (x.lower() for x in x.split()))

In [175]: print(colored("\nDELETED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))

DELETED SUCCESFULLY...

In [176]: #delete punctuations

In [177]: train_set["tweet"]=train_set["tweet"].str.replace('[^\w\s]', '')


test_set["tweet"]=test_set["tweet"].str.replace('[^\w\s]','')
train_set['tweet'] = train_set['tweet'].str.replace('\d','')
test_set['tweet'] = test_set['tweet'].str.replace('\d', '')

In [178]: #delete stopwords from tweet

In [179]: ​
sw = stopwords.words("english")
train_set['tweet'] = train_set['tweet'].apply(lambda x: " ".join (x for x in x.split() if x not in sw))
test_set['tweet'] = test_set['tweet'].apply(lambda x: " ".join (x for x in x.split() if x not in sw))
print(colored("\nSTOPWORDS DELETED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))

STOPWORDS DELETED SUCCESFULLY...

In [180]: train_set = train_set.drop("id", axis=1)


test_set = test_set.drop("id", axis=1)
print(colored("\n 'ID' Columns Dropped Successfully", color="green", attrs=["dark", "bold"]))

'ID' Columns Dropped Successfully

CountVectorization

CounterVectorization is a SciKitLearn library takes any text document and returns each unique word as a feature with the count of number of times that word occurs

In [181]: corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [182]: ​
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]

In [183]: vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))


X2 = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names())

['and this', 'document is', 'first document', 'is the', 'is this', 'second document', 'the first', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the']

In [184]: print(X2.toarray())

[[0 0 1 1 0 0 1 0 0 0 0 1 0]
[0 1 0 1 0 1 0 1 0 0 1 0 0]
[1 0 0 1 0 0 0 0 1 1 0 1 0]
[0 0 1 0 1 0 1 0 0 0 0 0 1]]
Hashing Vectorizer

Hashing Vectorizer converts text to a matrix of occurrences using the hashing trick it converts a collection of text documents to a matrix of token occurrences.

In [185]: corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = HashingVectorizer(n_features=2**4)
X = vectorizer.fit_transform(corpus)
print(X.shape)

(4, 16)

Lower Casing

Another pre-processing step which we will do is to transform our tweets into lower case. This avoids having multiple copies of the same words. For example, while calculating the word count, ‘Lower’ and ‘lower’ will be taken as different words.

In [186]: def lower_case(df):


df['tweet'] = df['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
print(df['tweet'].head())

In [187]: lower_case(train_set)

0 user father dysfunctional selfish drags kids d...


1 user user thanks lyft credit cant use cause do...
2 bihday majesty
3 model love u take u time urð ðððð ððð
4 factsguide society motivation
Name: tweet, dtype: object

In [188]: lower_case(test_set)

0 studiolife aislife requires passion dedication...


1 user white supremacists want everyone see new ...
2 safe ways heal acne altwaystoheal healthy healing
3 hp cursed child book reservations already yes ...
4 rd bihday amazing hilarious nephew eli ahmir u...
Name: tweet, dtype: object

frequent words removal

In [189]: freq = pd.Series(' '.join(train_set['tweet']).split()).value_counts()[:10]


freq

Out[189]: user 17473


love 2648
ð 2516
day 2230
â 1867
happy 1663
amp 1588
u 1141
im 1139
time 1110
dtype: int64

In [190]: freq=list(freq.index)

In [191]: def frequent_words_removal(df):


df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
print(df['tweet'].head())

In [192]: frequent_words_removal(train_set)

0 father dysfunctional selfish drags kids dysfun...


1 thanks lyft credit cant use cause dont offer w...
2 bihday majesty
3 model take urð ðððð ððð
4 factsguide society motivation
Name: tweet, dtype: object

In [193]: frequent_words_removal(test_set)

0 studiolife aislife requires passion dedication...


1 white supremacists want everyone see new birds...
2 safe ways heal acne altwaystoheal healthy healing
3 hp cursed child book reservations already yes ...
4 rd bihday amazing hilarious nephew eli ahmir u...
Name: tweet, dtype: object

In [194]: #Rare words removal


In [195]: freq= pd.Series(' '.join(train_set['tweet']).split()).value_counts()[-10:]
freq

Out[195]: socalled 1
haleððââðð 1
becauseyouturnedintoarat 1
cryingforever 1
anitgay 1
threads 1
destroyingpotential 1
onlyrelatives 1
myfamilysucks 1
chisolm 1
dtype: int64

In [196]: freq = list(freq.index)

In [197]: def rare_words_removal(df):


df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
print(df['tweet'].head())

In [198]: rare_words_removal(train_set)

0 father dysfunctional selfish drags kids dysfun...


1 thanks lyft credit cant use cause dont offer w...
2 bihday majesty
3 model take urð ðððð ððð
4 factsguide society motivation
Name: tweet, dtype: object

In [199]: rare_words_removal(test_set)

0 studiolife aislife requires passion dedication...


1 white supremacists want everyone see new birds...
2 safe ways heal acne altwaystoheal healthy healing
3 hp cursed child book reservations already yes ...
4 rd bihday amazing hilarious nephew eli ahmir u...
Name: tweet, dtype: object

Spelling Correction

In [200]: def spell_correction(df):


return df['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))

In [201]: spell_correction(train_set)

Out[201]: 0 father dysfunctional selfish drags kiss dysfun...


1 thanks left credit can use cause dont offer wh...
2 midday majesty
3 model take or ðððð ððð
4 factsguide society motivation
Name: tweet, dtype: object

In [202]: spell_correction(test_set)

Out[202]: 0 studiolife dislike requires passion education ...


1 white supremacists want everyone see new birds...
2 safe ways heal acne altwaystoheal healthy healing
3 he cursed child book reservations already yes ...
4 rd midday amazing hilarious nephew epi their u...
Name: tweet, dtype: object

Tokenization

In [203]: def tokens(df):


return TextBlob(df['tweet'][1]).words

In [204]: tokens(train_set)

Out[204]: WordList(['thanks', 'lyft', 'credit', 'cant', 'use', 'cause', 'dont', 'offer', 'wheelchair', 'vans', 'pdx', 'disapointed', 'getthanked'])

In [205]: tokens(test_set
)

Out[205]: WordList(['white', 'supremacists', 'want', 'everyone', 'see', 'new', 'birdsâ', 'movie', 'hereâs'])

Stemming

In [206]: st = PorterStemmer()

In [207]: def stemming(df):


return df['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
In [208]: stemming(train_set)

Out[208]: 0 father dysfunct selfish drag kid dysfunct run


1 thank lyft credit cant use caus dont offer whe...
2 bihday majesti
3 model take urð ðððð ððð
4 factsguid societi motiv
Name: tweet, dtype: object

In [209]: stemming(test_set)

Out[209]: 0 studiolif aislif requir passion dedic willpow ...


1 white supremacist want everyon see new birdsâ ...
2 safe way heal acn altwaystoh healthi heal
3 hp curs child book reserv alreadi ye ððð harry...
4 rd bihday amaz hilari nephew eli ahmir uncl da...
Name: tweet, dtype: object

In [210]: #Lemmatization
#Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect me

In [211]: def lemmatization(df):


df['tweet'] = df['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
print(df['tweet'].head())

In [212]: lemmatization(train_set)

0 father dysfunctional selfish drag kid dysfunct...


1 thanks lyft credit cant use cause dont offer w...
2 bihday majesty
3 model take urð ðððð ððð
4 factsguide society motivation
Name: tweet, dtype: object

In [213]: lemmatization(test_set)

0 studiolife aislife requires passion dedication...


1 white supremacist want everyone see new birdsâ...
2 safe way heal acne altwaystoheal healthy healing
3 hp cursed child book reservation already yes ð...
4 rd bihday amazing hilarious nephew eli ahmir u...
Name: tweet, dtype: object

N-Grams

N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on. Unigrams do not usually contain as much information as compared to bigrams and trigrams. The basic principle behind n-grams is that they capture the language structure, like what letter or word is likely to
follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

In [214]: def combination_of_words(df):


return (TextBlob(df['tweet'][0]).ngrams(2))

In [215]: combination_of_words(train_set)

Out[215]: [WordList(['father', 'dysfunctional']),


WordList(['dysfunctional', 'selfish']),
WordList(['selfish', 'drag']),
WordList(['drag', 'kid']),
WordList(['kid', 'dysfunction']),
WordList(['dysfunction', 'run'])]

In [216]: combination_of_words(test_set)

Out[216]: [WordList(['studiolife', 'aislife']),


WordList(['aislife', 'requires']),
WordList(['requires', 'passion']),
WordList(['passion', 'dedication']),
WordList(['dedication', 'willpower']),
WordList(['willpower', 'find']),
WordList(['find', 'newmaterialsâ'])]

Term Frequent

Term frequency is simply the ratio of the count of a word present in a sentence, to the length of the sentence.

In [217]: def term_frequency(df):


tf1 = (df['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
return tf1.head()
In [218]: term_frequency(train_set)

Out[218]:
words tf

0 thanks 1

1 lyft 1

2 credit 1

3 cant 1

4 use 1

In [219]: term_frequency(test_set)

Out[219]:
words tf

0 white 1

1 supremacist 1

2 want 1

3 everyone 1

4 see 1

Bag of words

In [ ]: ​

Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data. The intuition behind this is that two similar text fields will contain similar kind of words, and will therefore have a similar bag of words. Further, that from the text alone we can learn something about the meaning of the document.

In [220]: bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")


train_bow = bow.fit_transform(train_set['tweet'])
train_bow

Out[220]: <31962x1000 sparse matrix of type '<class 'numpy.int64'>'


with 128663 stored elements in Compressed Sparse Row format>

Sentiment Analysis

In [221]: def polarity_subjectivity(df):


return df['tweet'][:5].apply(lambda x: TextBlob(x).sentiment)

In [222]: polarity_subjectivity(train_set)

Out[222]: 0 (-0.3, 0.5354166666666667)


1 (0.2, 0.2)
2 (0.0, 0.0)
3 (0.0, 0.0)
4 (0.0, 0.0)
Name: tweet, dtype: object

In [223]: polarity_subjectivity(test_set)

Out[223]: 0 (0.0, 0.0)


1 (0.06818181818181818, 0.22727272727272727)
2 (0.5, 0.5)
3 (0.5, 1.0)
4 (0.5333333333333333, 0.8333333333333334)
Name: tweet, dtype: object

In [224]: def sentiment_analysis(df):


df['sentiment'] = df['tweet'].apply(lambda x: TextBlob(x).sentiment[0] )
return df[['tweet','sentiment']].head()

In [225]: sentiment_analysis(train_set)

Out[225]:
tweet sentiment

0 father dysfunctional selfish drag kid dysfunct... -0.3

1 thanks lyft credit cant use cause dont offer w... 0.2

2 bihday majesty 0.0

3 model take urð ðððð ððð 0.0

4 factsguide society motivation 0.0


In [226]: sentiment_analysis(test_set)

Out[226]:
tweet sentiment

0 studiolife aislife requires passion dedication... 0.000000

1 white supremacist want everyone see new birdsâ... 0.068182

2 safe way heal acne altwaystoheal healthy healing 0.500000

3 hp cursed child book reservation already yes ð... 0.500000

4 rd bihday amazing hilarious nephew eli ahmir u... 0.533333

In [227]: #latest condition of dataset

In [228]: train_set.head(n=10)

Out[228]:
label tweet word_count char_count stopwords hashtags numerics sentiment

0 0 father dysfunctional selfish drag kid dysfunct... 21 102 10 1 0 -0.3

1 0 thanks lyft credit cant use cause dont offer w... 22 122 5 3 0 0.2

2 0 bihday majesty 5 21 1 0 0 0.0

3 0 model take urð ðððð ððð 17 86 5 1 0 0.0

4 0 factsguide society motivation 8 39 1 1 0 0.0

5 0 huge fan fare big talking leave chaos pay disp... 21 116 6 1 0 0.2

6 0 camping tomorrow dannyâ 12 74 0 0 0 0.0

7 0 next school year year examsð cant think school... 23 143 6 7 0 -0.4

8 0 land allin cavs champion cleveland clevelandca... 13 87 2 5 0 0.0

9 0 welcome gr 15 50 3 1 0 0.8

In [229]: #Divide Dataset


x = train_set['tweet']
y = train_set['label']
train_x, test_x, train_y, test_y = model_selection.train_test_split(x, y, test_size = 0.20, shuffle = True, random_state = 11)
print(colored("\nDIVIDED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))

DIVIDED SUCCESFULLY...

VECTORIZE DATA

Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics.

"Count Vectors" method

In [230]: vectorizer = CountVectorizer()


vectorizer.fit(train_x)

x_train_count = vectorizer.transform(train_x)
x_test_count = vectorizer.transform(test_x)

x_train_count.toarray()

---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_1616\618821171.py in <module>
5 x_test_count = vectorizer.transform(test_x)
6
----> 7 x_train_count.toarray()

~\anaconda3\lib\site-packages\scipy\sparse\_compressed.py in toarray(self, order, out)


1049 if out is None and order is None:
1050 order = self._swap('cf')[0]
-> 1051 out = self._process_toarray_args(order, out)
1052 if not (out.flags.c_contiguous or out.flags.f_contiguous):
1053 raise ValueError('Output array must be C or F contiguous')

~\anaconda3\lib\site-packages\scipy\sparse\_base.py in _process_toarray_args(self, order, out)


1296 return out
1297 else:
-> 1298 return np.zeros(self.shape, dtype=self.dtype, order=order)
1299
1300

MemoryError: Unable to allocate 6.74 GiB for an array with shape (25569, 35355) and data type int64

Logistic Regression Model


In [ ]: log = linear_model.LogisticRegression()
log_model = log.fit(x_train_count, train_y)
accuracy = model_selection.cross_val_score(log_model,
x_test_count,
test_y,
cv = 20).mean()

print(colored("\nLogistic regression model with 'count-vectors' method", color = "red", attrs = ["dark", "bold"]))
print(colored("Accuracy ratio: ", color = "black", attrs = ["dark", "bold"]), accuracy)

XGBoost model with "count-vectors" method

In [ ]: xgb = XGBClassifier()


xgb_model = xgb.fit(x_train_count,train_y)
accuracy = model_selection.cross_val_score(xgb_model,
x_test_count,
test_y,
cv = 20).mean()

print(colored("\nXGBoost model with 'count-vectors' method", color = "red", attrs = ["dark", "bold"]))
print(colored("Accuracy ratio: ", color = "red", attrs = ["dark", "bold"]), accuracy)

Visualization with word cloud

In [ ]: #tw_mask = np.array(Image.open('../input/masksforwordclouds/twitter_mask3.jpg'))



text = " ".join(i for i in train_set.tweet)

wc = WordCloud(background_color = "white",
width = 600,
height = 600,
contour_width = 0,
contour_color = "red",
max_words = 1000,
scale = 1,
collocations = False,
repeat = True,
min_font_size = 1)

wc.generate(text)

plt.figure(figsize = [15, 15])
plt.imshow(wc)
plt.axis("off")
plt.show()

Thank You

You might also like