0% found this document useful (0 votes)

54 views13 pages

Twitter Sentiment Analysis

The document discusses sentiment analysis techniques for classifying tweets as containing racist or sexist sentiment. It introduces sentiment analysis as a branch of natural language processing used to determine subjective information like sentiment, emotions and attitudes from text. The primary goal of sentiment analysis is to automatically classify the sentiment of a document like a tweet into categories like positive, negative or neutral. Sentiment analysis techniques leverage machine learning algorithms, lexicon-based methods and rule-based systems to examine patterns in text and determine sentiment orientation.

Uploaded by

amnwq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views13 pages

Twitter Sentiment Analysis

Uploaded by

amnwq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

TWITTER SENTIMENT ANALYSIS

Introduction
Detecting hate speech in tweets involves classifying tweets as either containing racist or sexist sentiment or not. To accomplish this using Python libraries, we can employ NLP techniques and machine learning algorithms. By analyzing the text content and applying sentiment analysis models, we can train a classifier to distinguish between tweets with hate speech
and those without.

Sentiment Analysis
Sentiment analysis, also known as opinion mining, is a branch of natural language processing (NLP) that involves the use of computational techniques to determine and extract subjective information from text data. It aims to analyze and understand the sentiment, emotions, attitudes, and opinions expressed within a given piece of text.

The primary goal of sentiment analysis is to automatically classify the sentiment of a text document, such as a tweet, review, or customer feedback, into different categories, typically positive, negative, or neutral. However, sentiment analysis can also include more fine-grained sentiment classifications, such as very positive, positive, neutral, negative, and very
negative.

Sentiment analysis techniques leverage various approaches, including machine learning algorithms, lexicon-based methods, and rule-based systems. These techniques process text data by examining patterns, semantic structures, linguistic features, and context to determine the sentiment orientation.

IMPORT NECESSARRY LIBRARIES

In [135]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [136]: from sklearn import model_selection, preprocessing, linear_model, metrics

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from sklearn import ensemble
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, roc_curve
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from nltk.corpus import stopwords
from textblob import TextBlob
from textblob import Word
import nltk
from nltk.stem import PorterStemmer

from textblob import TextBlob
from termcolor import colored
from warnings import filterwarnings
filterwarnings('ignore')
import re
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn import set_config
set_config(print_changed_only = False)

In [137]: test_set = pd.read_csv(r"C:\Desktop\Data Analyst Project\Sentiment Analysis\test.csv", encoding = "utf-8",

engine = "python",
header = 0)
train_set = pd.read_csv(r"C:\Desktop\Data Analyst Project\Sentiment Analysis\train.csv", encoding = "utf-8",
engine = "python",
header = 0)

In [138]:
print(colored("\nDATASETS WERE SUCCESFULLY LOADED...", color = "orange", attrs = ["dark", "bold"]))

DATASETS WERE SUCCESFULLY LOADED...

first five rows from train data set

In [139]: train_set.head(n = 5).style.background_gradient(cmap = "PiYG")

Out[139]:
id label tweet

0 1 0 @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run

1 2 0 @user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked

2 3 0 bihday your majesty

3 4 0 #model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦

4 5 0 factsguide: society now #motivation

First Five rows from test dataset

In [140]: test_set.head(n=5).style.background_gradient(cmap='PiYG')

Out[140]:
id tweet

0 31963 #studiolife #aislife #requires #passion #dedication #willpower to find #newmaterialsâ¦

1 31964 @user #white #supremacists want everyone to see the new â #birdsâ #movie â and hereâs why

2 31965 safe ways to heal your #acne!! #altwaystoheal #healthy #healing!!

3 31966 is the hp and the cursed child book up for reservations already? if yes, where? if no, when? ððð #harrypotter #pottermore #favorite

4 31967 3rd #bihday to my amazing, hilarious #nephew eli ahmir! uncle dave loves you and missesâ¦

In [141]: #shape

In [142]: format(train_set.shape)

Out[142]: '(31962, 3)'

In [143]: #format(test_set.shape)

In [144]: train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 31962 non-null int64
1 label 31962 non-null int64
2 tweet 31962 non-null object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB

In [145]: test_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17197 entries, 0 to 17196
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 17197 non-null int64
1 tweet 17197 non-null object
dtypes: int64(1), object(1)
memory usage: 268.8+ KB

In [146]: train_set.groupby("label").count().style.background_gradient(cmap="autumn")

Out[146]:
id tweet

label

0 29720 29720

1 2242 2242

In [147]: train_set_len = train_set['tweet'].str.len()

test_set_len = test_set['tweet'].str.len()
print("train data length :" , train_set_len)
print("test data length :" , test_set_len)

train data length : 0 102

1 122
2 21
3 86
4 39
...
31957 68
31958 131
31959 63
31960 67
31961 32
Name: tweet, Length: 31962, dtype: int64
test data length : 0 90
1 101
2 71
3 142
4 93
...
17192 108
17193 96
17194 145
17195 104
17196 64
Name: tweet, Length: 17197, dtype: int64

In [148]: pos = 100*len(train_set.loc[train_set['label']==0, 'label'])/len(train_set['label'])

neg=100*len(train_set.loc[train_set['label']==1, 'label'])/len(train_set['label'])
In [149]: print(f'Percentage of Negative Sentiment tweets is {pos}')
print(f'Percentage of Postitive Sentiment tweets is {neg}')
print('\nClearly, herre we can see the data ')

Percentage of Negative Sentiment tweets is 92.98542018647143

Percentage of Postitive Sentiment tweets is 7.014579813528565

Clearly, herre we can see the data

data Exploration

In [150]: plt.hist(train_set_len, bins=22, label ='train_tweet')

plt.hist(test_set_len, bins=22, label = 'train_tweet')
plt.legend()
plt.show()

In [151]: sns.countplot(data=train_set, x='label', hue='label')

plt.title('Types of comments : 0 - > Non Rasict/Sexist , 1 - > Rasict/Sexist')
plt.xlabel('Tweets')
plt.show()
In [152]: clength_train = train_set['tweet'].str.len().plot.hist(color = 'blue', figsize = (6, 4))
length_test = test_set['tweet'].str.len().plot.hist(color = 'pink', figsize = (6, 4))

In [153]: c=CountVectorizer(stop_words='english')
word=c.fit_transform(train_set.tweet)
summation=word.sum(axis=0)
print(summation)

[[ 51 28 2 ... 272 1 2]]

In [154]: freq=[(word,summation[0,i]) for word,i in c.vocabulary_.items()]

freq=sorted(freq,key=lambda x:x[1],reverse=True)
frequency = pd.DataFrame(freq, columns=['word', 'freq'])
print(frequency)

word freq
0 user 17577
1 love 2749
2 day 2311
3 amp 1776
4 happy 1686
... ... ...
41099 isz 1
41100 airwaves 1
41101 mantle 1
41102 shirley 1
41103 chisolm 1

[41104 rows x 2 columns]

In [155]: #most frequentlyy used words
df=frequency.head(20).plot(x='word', y='freq', kind='bar', figsize=(15, 7), color = 'green')
plt.title("20 most frequently used words in twitter")
plt.show()

In [156]: #Count number of words

def num_of_words(df):
df['word_count'] = df['tweet'].apply(lambda x : len(str(x).split(" ")))
print(df[['tweet','word_count']].head())

In [157]: num_of_words(train_set)

tweet word_count
0 @user when a father is dysfunctional and is s... 21
1 @user @user thanks for #lyft credit i can't us... 22
2 bihday your majesty 5
3 #model i love u take with u all the time in ... 17
4 factsguide: society now #motivation 8

In [158]: num_of_words(test_set)

tweet word_count
0 #studiolife #aislife #requires #passion #dedic... 12
1 @user #white #supremacists want everyone to s... 20
2 safe ways to heal your #acne!! #altwaystohe... 15
3 is the hp and the cursed child book up for res... 24
4 3rd #bihday to my amazing, hilarious #nephew... 18

In [159]: #Count number of characters

def num_of_chars(train_set):
train_set['char_count'] = train_set['tweet'].str.len() ## this also includes spaces
print(train_set[['tweet','char_count']].head())

In [160]: num_of_chars(train_set)

tweet char_count
0 @user when a father is dysfunctional and is s... 102
1 @user @user thanks for #lyft credit i can't us... 122
2 bihday your majesty 21
3 #model i love u take with u all the time in ... 86
4 factsguide: society now #motivation 39

In [161]: num_of_chars(test_set)

tweet char_count
0 #studiolife #aislife #requires #passion #dedic... 90
1 @user #white #supremacists want everyone to s... 101
2 safe ways to heal your #acne!! #altwaystohe... 71
3 is the hp and the cursed child book up for res... 142
4 3rd #bihday to my amazing, hilarious #nephew... 93
In [162]: #Number of stopwords
set(stopwords.words('english'))
once ,
'only',
'or',
'other',
'our',
'ours',
'ourselves',
'out',
'over',
'own',
're',
's',
'same',
'shan',
"shan't",
'she',
"she's",
'should',
"should've",
'shouldn',
" h ld 't"
In [163]: stop = stopwords.words('english')

In [164]: def stop_words(df):

df['stopwords'] = df['tweet'].apply(lambda x: len([x for x in x.split() if x in stop]))
print(df[['tweet','stopwords']].head())

In [165]: stop_words(train_set)

tweet stopwords
0 @user when a father is dysfunctional and is s... 10
1 @user @user thanks for #lyft credit i can't us... 5
2 bihday your majesty 1
3 #model i love u take with u all the time in ... 5
4 factsguide: society now #motivation 1

In [166]: stop_words(test_set)

tweet stopwords
0 #studiolife #aislife #requires #passion #dedic... 1
1 @user #white #supremacists want everyone to s... 4
2 safe ways to heal your #acne!! #altwaystohe... 2
3 is the hp and the cursed child book up for res... 8
4 3rd #bihday to my amazing, hilarious #nephew... 4

Number of special characters

In [167]: def hash_tags(df) :

df['hashtags'] = df['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
print(df[['tweet', 'hashtags']].head())

In [168]: hash_tags(train_set)

tweet hashtags
0 @user when a father is dysfunctional and is s... 1
1 @user @user thanks for #lyft credit i can't us... 3
2 bihday your majesty 0
3 #model i love u take with u all the time in ... 1
4 factsguide: society now #motivation 1

In [169]: hash_tags(test_set)

tweet hashtags
0 #studiolife #aislife #requires #passion #dedic... 7
1 @user #white #supremacists want everyone to s... 4
2 safe ways to heal your #acne!! #altwaystohe... 4
3 is the hp and the cursed child book up for res... 3
4 3rd #bihday to my amazing, hilarious #nephew... 2

number of numerics

In [170]: def num_numerics(df):

df['numerics'] = df['tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
print(df[['tweet', 'numerics']].head())

In [171]: num_numerics(train_set)

tweet numerics
0 @user when a father is dysfunctional and is s... 0
1 @user @user thanks for #lyft credit i can't us... 0
2 bihday your majesty 0
3 #model i love u take with u all the time in ... 0
4 factsguide: society now #motivation 0
In [172]: num_numerics(test_set)

tweet numerics
0 #studiolife #aislife #requires #passion #dedic... 0
1 @user #white #supremacists want everyone to s... 0
2 safe ways to heal your #acne!! #altwaystohe... 0
3 is the hp and the cursed child book up for res... 0
4 3rd #bihday to my amazing, hilarious #nephew... 0

In [ ]:

Clean and Process Dataset

In [173]: #convert upper case to lower case

In [174]: train_set["tweet"]= train_set["tweet"].apply(lambda x: " ". join (x.lower() for x in x.split()))

test_set["tweet"] = test_set["tweet"].apply(lambda x: " ". join (x.lower() for x in x.split()))

In [175]: print(colored("\nDELETED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))

DELETED SUCCESFULLY...

In [176]: #delete punctuations

In [177]: train_set["tweet"]=train_set["tweet"].str.replace('[^\w\s]', '')

test_set["tweet"]=test_set["tweet"].str.replace('[^\w\s]','')
train_set['tweet'] = train_set['tweet'].str.replace('\d','')
test_set['tweet'] = test_set['tweet'].str.replace('\d', '')

In [178]: #delete stopwords from tweet

In [179]:
sw = stopwords.words("english")
train_set['tweet'] = train_set['tweet'].apply(lambda x: " ".join (x for x in x.split() if x not in sw))
test_set['tweet'] = test_set['tweet'].apply(lambda x: " ".join (x for x in x.split() if x not in sw))
print(colored("\nSTOPWORDS DELETED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))

STOPWORDS DELETED SUCCESFULLY...

In [180]: train_set = train_set.drop("id", axis=1)

test_set = test_set.drop("id", axis=1)
print(colored("\n 'ID' Columns Dropped Successfully", color="green", attrs=["dark", "bold"]))

'ID' Columns Dropped Successfully

CountVectorization

CounterVectorization is a SciKitLearn library takes any text document and returns each unique word as a feature with the count of number of times that word occurs

In [181]: corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [182]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]

In [183]: vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))

X2 = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names())

['and this', 'document is', 'first document', 'is the', 'is this', 'second document', 'the first', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the']

In [184]: print(X2.toarray())

[[0 0 1 1 0 0 1 0 0 0 0 1 0]
[0 1 0 1 0 1 0 1 0 0 1 0 0]
[1 0 0 1 0 0 0 0 1 1 0 1 0]
[0 0 1 0 1 0 1 0 0 0 0 0 1]]
Hashing Vectorizer

Hashing Vectorizer converts text to a matrix of occurrences using the hashing trick it converts a collection of text documents to a matrix of token occurrences.

In [185]: corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = HashingVectorizer(n_features=2**4)
X = vectorizer.fit_transform(corpus)
print(X.shape)

(4, 16)

Lower Casing

Another pre-processing step which we will do is to transform our tweets into lower case. This avoids having multiple copies of the same words. For example, while calculating the word count, ‘Lower’ and ‘lower’ will be taken as different words.

In [186]: def lower_case(df):

df['tweet'] = df['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
print(df['tweet'].head())

In [187]: lower_case(train_set)

0 user father dysfunctional selfish drags kids d...

1 user user thanks lyft credit cant use cause do...
2 bihday majesty
3 model love u take u time urð ðððð ððð
4 factsguide society motivation
Name: tweet, dtype: object

In [188]: lower_case(test_set)

0 studiolife aislife requires passion dedication...

1 user white supremacists want everyone see new ...
2 safe ways heal acne altwaystoheal healthy healing
3 hp cursed child book reservations already yes ...
4 rd bihday amazing hilarious nephew eli ahmir u...
Name: tweet, dtype: object

frequent words removal

In [189]: freq = pd.Series(' '.join(train_set['tweet']).split()).value_counts()[:10]

freq

Out[189]: user 17473

love 2648
ð 2516
day 2230
â 1867
happy 1663
amp 1588
u 1141
im 1139
time 1110
dtype: int64

In [190]: freq=list(freq.index)

In [191]: def frequent_words_removal(df):

df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
print(df['tweet'].head())

In [192]: frequent_words_removal(train_set)

0 father dysfunctional selfish drags kids dysfun...

1 thanks lyft credit cant use cause dont offer w...
2 bihday majesty
3 model take urð ðððð ððð
4 factsguide society motivation
Name: tweet, dtype: object

In [193]: frequent_words_removal(test_set)

0 studiolife aislife requires passion dedication...

1 white supremacists want everyone see new birds...
2 safe ways heal acne altwaystoheal healthy healing
3 hp cursed child book reservations already yes ...
4 rd bihday amazing hilarious nephew eli ahmir u...
Name: tweet, dtype: object

In [194]: #Rare words removal

In [195]: freq= pd.Series(' '.join(train_set['tweet']).split()).value_counts()[-10:]
freq

Out[195]: socalled 1
haleððââðð 1
becauseyouturnedintoarat 1
cryingforever 1
anitgay 1
threads 1
destroyingpotential 1
onlyrelatives 1
myfamilysucks 1
chisolm 1
dtype: int64

In [196]: freq = list(freq.index)

In [197]: def rare_words_removal(df):

df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
print(df['tweet'].head())

In [198]: rare_words_removal(train_set)

0 father dysfunctional selfish drags kids dysfun...

1 thanks lyft credit cant use cause dont offer w...
2 bihday majesty
3 model take urð ðððð ððð
4 factsguide society motivation
Name: tweet, dtype: object

In [199]: rare_words_removal(test_set)

0 studiolife aislife requires passion dedication...

Spelling Correction

In [200]: def spell_correction(df):

return df['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))

In [201]: spell_correction(train_set)

Out[201]: 0 father dysfunctional selfish drags kiss dysfun...

1 thanks left credit can use cause dont offer wh...
2 midday majesty
3 model take or ðððð ððð
4 factsguide society motivation
Name: tweet, dtype: object

In [202]: spell_correction(test_set)

Out[202]: 0 studiolife dislike requires passion education ...

1 white supremacists want everyone see new birds...
2 safe ways heal acne altwaystoheal healthy healing
3 he cursed child book reservations already yes ...
4 rd midday amazing hilarious nephew epi their u...
Name: tweet, dtype: object

Tokenization

In [203]: def tokens(df):

return TextBlob(df['tweet'][1]).words

In [204]: tokens(train_set)

Out[204]: WordList(['thanks', 'lyft', 'credit', 'cant', 'use', 'cause', 'dont', 'offer', 'wheelchair', 'vans', 'pdx', 'disapointed', 'getthanked'])

In [205]: tokens(test_set
)

Out[205]: WordList(['white', 'supremacists', 'want', 'everyone', 'see', 'new', 'birdsâ', 'movie', 'hereâs'])

Stemming

In [206]: st = PorterStemmer()

In [207]: def stemming(df):

return df['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
In [208]: stemming(train_set)

Out[208]: 0 father dysfunct selfish drag kid dysfunct run

1 thank lyft credit cant use caus dont offer whe...
2 bihday majesti
3 model take urð ðððð ððð
4 factsguid societi motiv
Name: tweet, dtype: object

In [209]: stemming(test_set)

Out[209]: 0 studiolif aislif requir passion dedic willpow ...

1 white supremacist want everyon see new birdsâ ...
2 safe way heal acn altwaystoh healthi heal
3 hp curs child book reserv alreadi ye ððð harry...
4 rd bihday amaz hilari nephew eli ahmir uncl da...
Name: tweet, dtype: object

In [210]: #Lemmatization
#Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect me

In [211]: def lemmatization(df):

df['tweet'] = df['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
print(df['tweet'].head())

In [212]: lemmatization(train_set)

0 father dysfunctional selfish drag kid dysfunct...

1 thanks lyft credit cant use cause dont offer w...
2 bihday majesty
3 model take urð ðððð ððð
4 factsguide society motivation
Name: tweet, dtype: object

In [213]: lemmatization(test_set)

0 studiolife aislife requires passion dedication...

1 white supremacist want everyone see new birdsâ...
2 safe way heal acne altwaystoheal healthy healing
3 hp cursed child book reservation already yes ð...
4 rd bihday amazing hilarious nephew eli ahmir u...
Name: tweet, dtype: object

N-Grams

N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on. Unigrams do not usually contain as much information as compared to bigrams and trigrams. The basic principle behind n-grams is that they capture the language structure, like what letter or word is likely to
follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

In [214]: def combination_of_words(df):

return (TextBlob(df['tweet'][0]).ngrams(2))

In [215]: combination_of_words(train_set)

Out[215]: [WordList(['father', 'dysfunctional']),

WordList(['dysfunctional', 'selfish']),
WordList(['selfish', 'drag']),
WordList(['drag', 'kid']),
WordList(['kid', 'dysfunction']),
WordList(['dysfunction', 'run'])]

In [216]: combination_of_words(test_set)

Out[216]: [WordList(['studiolife', 'aislife']),

WordList(['aislife', 'requires']),
WordList(['requires', 'passion']),
WordList(['passion', 'dedication']),
WordList(['dedication', 'willpower']),
WordList(['willpower', 'find']),
WordList(['find', 'newmaterialsâ'])]

Term Frequent

Term frequency is simply the ratio of the count of a word present in a sentence, to the length of the sentence.

In [217]: def term_frequency(df):

tf1 = (df['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
return tf1.head()
In [218]: term_frequency(train_set)

Out[218]:
words tf

0 thanks 1

1 lyft 1

2 credit 1

3 cant 1

4 use 1

In [219]: term_frequency(test_set)

Out[219]:
words tf

0 white 1

1 supremacist 1

2 want 1

3 everyone 1

4 see 1

Bag of words

In [ ]:

Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data. The intuition behind this is that two similar text fields will contain similar kind of words, and will therefore have a similar bag of words. Further, that from the text alone we can learn something about the meaning of the document.

In [220]: bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")

train_bow = bow.fit_transform(train_set['tweet'])
train_bow

Out[220]: <31962x1000 sparse matrix of type '<class 'numpy.int64'>'

with 128663 stored elements in Compressed Sparse Row format>

Sentiment Analysis

In [221]: def polarity_subjectivity(df):

return df['tweet'][:5].apply(lambda x: TextBlob(x).sentiment)

In [222]: polarity_subjectivity(train_set)

Out[222]: 0 (-0.3, 0.5354166666666667)

1 (0.2, 0.2)
2 (0.0, 0.0)
3 (0.0, 0.0)
4 (0.0, 0.0)
Name: tweet, dtype: object

In [223]: polarity_subjectivity(test_set)

Out[223]: 0 (0.0, 0.0)

1 (0.06818181818181818, 0.22727272727272727)
2 (0.5, 0.5)
3 (0.5, 1.0)
4 (0.5333333333333333, 0.8333333333333334)
Name: tweet, dtype: object

In [224]: def sentiment_analysis(df):

df['sentiment'] = df['tweet'].apply(lambda x: TextBlob(x).sentiment[0] )
return df[['tweet','sentiment']].head()

In [225]: sentiment_analysis(train_set)

Out[225]:
tweet sentiment

0 father dysfunctional selfish drag kid dysfunct... -0.3

1 thanks lyft credit cant use cause dont offer w... 0.2

2 bihday majesty 0.0

3 model take urð ðððð ððð 0.0

4 factsguide society motivation 0.0

In [226]: sentiment_analysis(test_set)

Out[226]:
tweet sentiment

0 studiolife aislife requires passion dedication... 0.000000

1 white supremacist want everyone see new birdsâ... 0.068182

2 safe way heal acne altwaystoheal healthy healing 0.500000

3 hp cursed child book reservation already yes ð... 0.500000

4 rd bihday amazing hilarious nephew eli ahmir u... 0.533333

In [227]: #latest condition of dataset

In [228]: train_set.head(n=10)

Out[228]:
label tweet word_count char_count stopwords hashtags numerics sentiment

0 0 father dysfunctional selfish drag kid dysfunct... 21 102 10 1 0 -0.3

1 0 thanks lyft credit cant use cause dont offer w... 22 122 5 3 0 0.2

2 0 bihday majesty 5 21 1 0 0 0.0

3 0 model take urð ðððð ððð 17 86 5 1 0 0.0

4 0 factsguide society motivation 8 39 1 1 0 0.0

5 0 huge fan fare big talking leave chaos pay disp... 21 116 6 1 0 0.2

6 0 camping tomorrow dannyâ 12 74 0 0 0 0.0

7 0 next school year year examsð cant think school... 23 143 6 7 0 -0.4

8 0 land allin cavs champion cleveland clevelandca... 13 87 2 5 0 0.0

9 0 welcome gr 15 50 3 1 0 0.8

In [229]: #Divide Dataset

x = train_set['tweet']
y = train_set['label']
train_x, test_x, train_y, test_y = model_selection.train_test_split(x, y, test_size = 0.20, shuffle = True, random_state = 11)
print(colored("\nDIVIDED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))

DIVIDED SUCCESFULLY...

VECTORIZE DATA

Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics.

"Count Vectors" method

In [230]: vectorizer = CountVectorizer()

vectorizer.fit(train_x)

x_train_count = vectorizer.transform(train_x)
x_test_count = vectorizer.transform(test_x)

x_train_count.toarray()

---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_1616\618821171.py in <module>
5 x_test_count = vectorizer.transform(test_x)
6
----> 7 x_train_count.toarray()

~\anaconda3\lib\site-packages\scipy\sparse\_compressed.py in toarray(self, order, out)

1049 if out is None and order is None:
1050 order = self._swap('cf')[0]
-> 1051 out = self._process_toarray_args(order, out)
1052 if not (out.flags.c_contiguous or out.flags.f_contiguous):
1053 raise ValueError('Output array must be C or F contiguous')

~\anaconda3\lib\site-packages\scipy\sparse\_base.py in _process_toarray_args(self, order, out)

1296 return out
1297 else:
-> 1298 return np.zeros(self.shape, dtype=self.dtype, order=order)
1299
1300

MemoryError: Unable to allocate 6.74 GiB for an array with shape (25569, 35355) and data type int64

Logistic Regression Model

In [ ]: log = linear_model.LogisticRegression()
log_model = log.fit(x_train_count, train_y)
accuracy = model_selection.cross_val_score(log_model,
x_test_count,
test_y,
cv = 20).mean()

print(colored("\nLogistic regression model with 'count-vectors' method", color = "red", attrs = ["dark", "bold"]))
print(colored("Accuracy ratio: ", color = "black", attrs = ["dark", "bold"]), accuracy)

XGBoost model with "count-vectors" method

In [ ]: xgb = XGBClassifier()

xgb_model = xgb.fit(x_train_count,train_y)
accuracy = model_selection.cross_val_score(xgb_model,
x_test_count,
test_y,
cv = 20).mean()

print(colored("\nXGBoost model with 'count-vectors' method", color = "red", attrs = ["dark", "bold"]))
print(colored("Accuracy ratio: ", color = "red", attrs = ["dark", "bold"]), accuracy)

Visualization with word cloud

In [ ]: #tw_mask = np.array(Image.open('../input/masksforwordclouds/twitter_mask3.jpg'))

text = " ".join(i for i in train_set.tweet)

wc = WordCloud(background_color = "white",
width = 600,
height = 600,
contour_width = 0,
contour_color = "red",
max_words = 1000,
scale = 1,
collocations = False,
repeat = True,
min_font_size = 1)

wc.generate(text)

plt.figure(figsize = [15, 15])
plt.imshow(wc)
plt.axis("off")
plt.show()

Thank You

David E. Leary (Ed.) - Metaphors in The History of Psychology-Cambridge University Press (1990) PDF
100% (2)
David E. Leary (Ed.) - Metaphors in The History of Psychology-Cambridge University Press (1990) PDF
396 pages
INDEXReport Ayush
No ratings yet
INDEXReport Ayush
38 pages
Data Science Project
No ratings yet
Data Science Project
34 pages
Prototype 1
No ratings yet
Prototype 1
10 pages
NLP Transformer-Based Models Used For Sentiment Analysis
No ratings yet
NLP Transformer-Based Models Used For Sentiment Analysis
45 pages
Miniproject NLP
No ratings yet
Miniproject NLP
22 pages
Sample 1
No ratings yet
Sample 1
22 pages
Sentiment Analysis Using Machine Learning Algorithms
No ratings yet
Sentiment Analysis Using Machine Learning Algorithms
23 pages
NLP Transformer-Based Models Used For Sentiment Analysis: 1. BERT
No ratings yet
NLP Transformer-Based Models Used For Sentiment Analysis: 1. BERT
98 pages
ChatGPT Twitter Sentiment Analyzer
No ratings yet
ChatGPT Twitter Sentiment Analyzer
50 pages
Emotion Classification With DistilBERT
No ratings yet
Emotion Classification With DistilBERT
25 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
14 pages
C1 W2 Assignment
No ratings yet
C1 W2 Assignment
18 pages
Sentiment Analysis On User-Generated Tweets
No ratings yet
Sentiment Analysis On User-Generated Tweets
15 pages
IICS Cloud & PC Scenario Real Time Interview Questions
67% (3)
IICS Cloud & PC Scenario Real Time Interview Questions
4 pages
Document Dsbda Codes For Mini Project
No ratings yet
Document Dsbda Codes For Mini Project
9 pages
IC-RTETM Final Sentiment Analysis
No ratings yet
IC-RTETM Final Sentiment Analysis
13 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
16 pages
TWITTER SENTIMENT NLP Projectt
No ratings yet
TWITTER SENTIMENT NLP Projectt
19 pages
Sma Exp 09 Code Print
No ratings yet
Sma Exp 09 Code Print
5 pages
Part C Assignment No 2 Mini Project On Twitter 1
No ratings yet
Part C Assignment No 2 Mini Project On Twitter 1
9 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
5 pages
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
No ratings yet
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
15 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
3 pages
Mlds5 Code
No ratings yet
Mlds5 Code
7 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
ML Projrct Article 2
No ratings yet
ML Projrct Article 2
6 pages
Hate Speech Detection Documentation With Code
No ratings yet
Hate Speech Detection Documentation With Code
4 pages
Twitter Sentiment Analysis Dss
No ratings yet
Twitter Sentiment Analysis Dss
14 pages
Hate Speech Detection
No ratings yet
Hate Speech Detection
6 pages
Sentiment Analysis Presentationnotes
No ratings yet
Sentiment Analysis Presentationnotes
4 pages
Assign 5 TT
No ratings yet
Assign 5 TT
13 pages
Tweet-Sentiment-Extraction - Exploratory Data Analysis
No ratings yet
Tweet-Sentiment-Extraction - Exploratory Data Analysis
11 pages
Cyberbullying Tweet Recognition Project 1677256740
No ratings yet
Cyberbullying Tweet Recognition Project 1677256740
17 pages
Hate Speech
No ratings yet
Hate Speech
4 pages
Ai Casestudy
No ratings yet
Ai Casestudy
16 pages
Sentiment Analysis Final Documentation Report
50% (2)
Sentiment Analysis Final Documentation Report
21 pages
Introduction
No ratings yet
Introduction
27 pages
Twitter Sentiment Analysis System
No ratings yet
Twitter Sentiment Analysis System
5 pages
NLP - (1) (1) .Ipynb - Colab
No ratings yet
NLP - (1) (1) .Ipynb - Colab
10 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
No ratings yet
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
14 pages
AminaRahmanK DL Lab5
No ratings yet
AminaRahmanK DL Lab5
11 pages
Malignant Comments Classifier Project
No ratings yet
Malignant Comments Classifier Project
30 pages
Twiiter Sentiment Analysis
No ratings yet
Twiiter Sentiment Analysis
15 pages
Analyzing Social Media Data in Python Chapter2
No ratings yet
Analyzing Social Media Data in Python Chapter2
30 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Authors:: Apoorv Agarwal Boyi Xie Ilia Vovsha Owen Rambow Rebecca Passonneau
No ratings yet
Authors:: Apoorv Agarwal Boyi Xie Ilia Vovsha Owen Rambow Rebecca Passonneau
9 pages
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
No ratings yet
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
9 pages
Importing Packages: Id Label Tweet 0 1 2 3 4
No ratings yet
Importing Packages: Id Label Tweet 0 1 2 3 4
8 pages
Minor 1
No ratings yet
Minor 1
20 pages
Template For The First Slide of PPT Presentation1
No ratings yet
Template For The First Slide of PPT Presentation1
18 pages
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
No ratings yet
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
15 pages
PPPT
No ratings yet
PPPT
20 pages
Dsbda
No ratings yet
Dsbda
12 pages
Operation Research
0% (3)
Operation Research
3 pages
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
No ratings yet
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
15 pages
Basic Communication Skills: Presented by
No ratings yet
Basic Communication Skills: Presented by
17 pages
Twitter Sentiment Analysis Using Python
No ratings yet
Twitter Sentiment Analysis Using Python
21 pages
Sentiment Analysis of Tweets Using Machine Learning
No ratings yet
Sentiment Analysis of Tweets Using Machine Learning
22 pages
Huawei ICT Competition 2023-2024 Exam Outline - Cloud Track
0% (1)
Huawei ICT Competition 2023-2024 Exam Outline - Cloud Track
1 page
Sms Spam Detection
No ratings yet
Sms Spam Detection
23 pages
Different Forms of Communication
50% (2)
Different Forms of Communication
13 pages
Assignment 2.1
No ratings yet
Assignment 2.1
3 pages
(Second Language Acquisition Research Series) Numa Markee-Conversation Analysis-Routledge (2000)
No ratings yet
(Second Language Acquisition Research Series) Numa Markee-Conversation Analysis-Routledge (2000)
287 pages
Chapter 3.3
No ratings yet
Chapter 3.3
61 pages
Recovery and Indexing
No ratings yet
Recovery and Indexing
60 pages
Analysis of Anti-Windup Techniques in PID Controle of Process With Measurement
No ratings yet
Analysis of Anti-Windup Techniques in PID Controle of Process With Measurement
6 pages
Semi-Detailed Lesson Plan
100% (3)
Semi-Detailed Lesson Plan
5 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
ARALD Arabic Annotation Using Linked Data
No ratings yet
ARALD Arabic Annotation Using Linked Data
8 pages
IBM Watson Cognitive Solutions
No ratings yet
IBM Watson Cognitive Solutions
25 pages
Prediction of Compressive Strength of Research Paper
No ratings yet
Prediction of Compressive Strength of Research Paper
9 pages
Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
No ratings yet
Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
13 pages
Assignment No.1: Unit 1. Soft Computing Basics
No ratings yet
Assignment No.1: Unit 1. Soft Computing Basics
12 pages
An Approach For Logo Detection and Retrieval
No ratings yet
An Approach For Logo Detection and Retrieval
10 pages
ANN Assignment
No ratings yet
ANN Assignment
3 pages
A Survey Paper On Sign Language Recognition System Using OpenCV and Convolutional Neural Network
No ratings yet
A Survey Paper On Sign Language Recognition System Using OpenCV and Convolutional Neural Network
7 pages
6th Test 2nd Year Chap Five
No ratings yet
6th Test 2nd Year Chap Five
2 pages
ME 464 Exam Paper Final-Vetted
No ratings yet
ME 464 Exam Paper Final-Vetted
5 pages
Deep Learning Curve 1693642530
No ratings yet
Deep Learning Curve 1693642530
10 pages
Oral Communication Reviewer (Midterms)
No ratings yet
Oral Communication Reviewer (Midterms)
3 pages
LAB 2: Design of Lead-Compensator Controller For Ball and Beam System Objectives
No ratings yet
LAB 2: Design of Lead-Compensator Controller For Ball and Beam System Objectives
5 pages
IJERTV8IS100002
No ratings yet
IJERTV8IS100002
5 pages
Schedule 1
No ratings yet
Schedule 1
1 page
Systems
No ratings yet
Systems
4 pages