Twitter Sentiment Analysis
Twitter Sentiment Analysis
Introduction
Detecting hate speech in tweets involves classifying tweets as either containing racist or sexist sentiment or not. To accomplish this using Python libraries, we can employ NLP techniques and machine learning algorithms. By analyzing the text content and applying sentiment analysis models, we can train a classifier to distinguish between tweets with hate speech
and those without.
Sentiment Analysis
Sentiment analysis, also known as opinion mining, is a branch of natural language processing (NLP) that involves the use of computational techniques to determine and extract subjective information from text data. It aims to analyze and understand the sentiment, emotions, attitudes, and opinions expressed within a given piece of text.
The primary goal of sentiment analysis is to automatically classify the sentiment of a text document, such as a tweet, review, or customer feedback, into different categories, typically positive, negative, or neutral. However, sentiment analysis can also include more fine-grained sentiment classifications, such as very positive, positive, neutral, negative, and very
negative.
Sentiment analysis techniques leverage various approaches, including machine learning algorithms, lexicon-based methods, and rule-based systems. These techniques process text data by examining patterns, semantic structures, linguistic features, and context to determine the sentiment orientation.
In [138]:
print(colored("\nDATASETS WERE SUCCESFULLY LOADED...", color = "orange", attrs = ["dark", "bold"]))
Out[139]:
id label tweet
0 1 0 @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1 2 0 @user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
Out[140]:
id tweet
1 31964 @user #white #supremacists want everyone to see the new â #birdsâ #movie â and hereâs why
3 31966 is the hp and the cursed child book up for reservations already? if yes, where? if no, when? ððð #harrypotter #pottermore #favorite
4 31967 3rd #bihday to my amazing, hilarious #nephew eli ahmir! uncle dave loves you and missesâ¦
In [141]: #shape
In [142]: format(train_set.shape)
In [143]: #format(test_set.shape)
In [144]: train_set.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 31962 non-null int64
1 label 31962 non-null int64
2 tweet 31962 non-null object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB
In [145]: test_set.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17197 entries, 0 to 17196
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 17197 non-null int64
1 tweet 17197 non-null object
dtypes: int64(1), object(1)
memory usage: 268.8+ KB
In [146]: train_set.groupby("label").count().style.background_gradient(cmap="autumn")
Out[146]:
id tweet
label
0 29720 29720
1 2242 2242
data Exploration
In [153]: c=CountVectorizer(stop_words='english')
word=c.fit_transform(train_set.tweet)
summation=word.sum(axis=0)
print(summation)
word freq
0 user 17577
1 love 2749
2 day 2311
3 amp 1776
4 happy 1686
... ... ...
41099 isz 1
41100 airwaves 1
41101 mantle 1
41102 shirley 1
41103 chisolm 1
In [157]: num_of_words(train_set)
tweet word_count
0 @user when a father is dysfunctional and is s... 21
1 @user @user thanks for #lyft credit i can't us... 22
2 bihday your majesty 5
3 #model i love u take with u all the time in ... 17
4 factsguide: society now #motivation 8
In [158]: num_of_words(test_set)
tweet word_count
0 #studiolife #aislife #requires #passion #dedic... 12
1 @user #white #supremacists want everyone to s... 20
2 safe ways to heal your #acne!! #altwaystohe... 15
3 is the hp and the cursed child book up for res... 24
4 3rd #bihday to my amazing, hilarious #nephew... 18
In [160]: num_of_chars(train_set)
tweet char_count
0 @user when a father is dysfunctional and is s... 102
1 @user @user thanks for #lyft credit i can't us... 122
2 bihday your majesty 21
3 #model i love u take with u all the time in ... 86
4 factsguide: society now #motivation 39
In [161]: num_of_chars(test_set)
tweet char_count
0 #studiolife #aislife #requires #passion #dedic... 90
1 @user #white #supremacists want everyone to s... 101
2 safe ways to heal your #acne!! #altwaystohe... 71
3 is the hp and the cursed child book up for res... 142
4 3rd #bihday to my amazing, hilarious #nephew... 93
In [162]: #Number of stopwords
set(stopwords.words('english'))
once ,
'only',
'or',
'other',
'our',
'ours',
'ourselves',
'out',
'over',
'own',
're',
's',
'same',
'shan',
"shan't",
'she',
"she's",
'should',
"should've",
'shouldn',
" h ld 't"
In [163]: stop = stopwords.words('english')
In [165]: stop_words(train_set)
tweet stopwords
0 @user when a father is dysfunctional and is s... 10
1 @user @user thanks for #lyft credit i can't us... 5
2 bihday your majesty 1
3 #model i love u take with u all the time in ... 5
4 factsguide: society now #motivation 1
In [166]: stop_words(test_set)
tweet stopwords
0 #studiolife #aislife #requires #passion #dedic... 1
1 @user #white #supremacists want everyone to s... 4
2 safe ways to heal your #acne!! #altwaystohe... 2
3 is the hp and the cursed child book up for res... 8
4 3rd #bihday to my amazing, hilarious #nephew... 4
In [168]: hash_tags(train_set)
tweet hashtags
0 @user when a father is dysfunctional and is s... 1
1 @user @user thanks for #lyft credit i can't us... 3
2 bihday your majesty 0
3 #model i love u take with u all the time in ... 1
4 factsguide: society now #motivation 1
In [169]: hash_tags(test_set)
tweet hashtags
0 #studiolife #aislife #requires #passion #dedic... 7
1 @user #white #supremacists want everyone to s... 4
2 safe ways to heal your #acne!! #altwaystohe... 4
3 is the hp and the cursed child book up for res... 3
4 3rd #bihday to my amazing, hilarious #nephew... 2
number of numerics
In [171]: num_numerics(train_set)
tweet numerics
0 @user when a father is dysfunctional and is s... 0
1 @user @user thanks for #lyft credit i can't us... 0
2 bihday your majesty 0
3 #model i love u take with u all the time in ... 0
4 factsguide: society now #motivation 0
In [172]: num_numerics(test_set)
tweet numerics
0 #studiolife #aislife #requires #passion #dedic... 0
1 @user #white #supremacists want everyone to s... 0
2 safe ways to heal your #acne!! #altwaystohe... 0
3 is the hp and the cursed child book up for res... 0
4 3rd #bihday to my amazing, hilarious #nephew... 0
In [ ]:
DELETED SUCCESFULLY...
In [179]:
sw = stopwords.words("english")
train_set['tweet'] = train_set['tweet'].apply(lambda x: " ".join (x for x in x.split() if x not in sw))
test_set['tweet'] = test_set['tweet'].apply(lambda x: " ".join (x for x in x.split() if x not in sw))
print(colored("\nSTOPWORDS DELETED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))
CountVectorization
CounterVectorization is a SciKitLearn library takes any text document and returns each unique word as a feature with the count of number of times that word occurs
In [181]: corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
In [182]:
print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
['and this', 'document is', 'first document', 'is the', 'is this', 'second document', 'the first', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the']
In [184]: print(X2.toarray())
[[0 0 1 1 0 0 1 0 0 0 0 1 0]
[0 1 0 1 0 1 0 1 0 0 1 0 0]
[1 0 0 1 0 0 0 0 1 1 0 1 0]
[0 0 1 0 1 0 1 0 0 0 0 0 1]]
Hashing Vectorizer
Hashing Vectorizer converts text to a matrix of occurrences using the hashing trick it converts a collection of text documents to a matrix of token occurrences.
In [185]: corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = HashingVectorizer(n_features=2**4)
X = vectorizer.fit_transform(corpus)
print(X.shape)
(4, 16)
Lower Casing
Another pre-processing step which we will do is to transform our tweets into lower case. This avoids having multiple copies of the same words. For example, while calculating the word count, ‘Lower’ and ‘lower’ will be taken as different words.
In [187]: lower_case(train_set)
In [188]: lower_case(test_set)
In [190]: freq=list(freq.index)
In [192]: frequent_words_removal(train_set)
In [193]: frequent_words_removal(test_set)
Out[195]: socalled 1
haleððââðð 1
becauseyouturnedintoarat 1
cryingforever 1
anitgay 1
threads 1
destroyingpotential 1
onlyrelatives 1
myfamilysucks 1
chisolm 1
dtype: int64
In [198]: rare_words_removal(train_set)
In [199]: rare_words_removal(test_set)
Spelling Correction
In [201]: spell_correction(train_set)
In [202]: spell_correction(test_set)
Tokenization
In [204]: tokens(train_set)
Out[204]: WordList(['thanks', 'lyft', 'credit', 'cant', 'use', 'cause', 'dont', 'offer', 'wheelchair', 'vans', 'pdx', 'disapointed', 'getthanked'])
In [205]: tokens(test_set
)
Out[205]: WordList(['white', 'supremacists', 'want', 'everyone', 'see', 'new', 'birdsâ', 'movie', 'hereâs'])
Stemming
In [206]: st = PorterStemmer()
In [209]: stemming(test_set)
In [210]: #Lemmatization
#Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect me
In [212]: lemmatization(train_set)
In [213]: lemmatization(test_set)
N-Grams
N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on. Unigrams do not usually contain as much information as compared to bigrams and trigrams. The basic principle behind n-grams is that they capture the language structure, like what letter or word is likely to
follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.
In [215]: combination_of_words(train_set)
In [216]: combination_of_words(test_set)
Term Frequent
Term frequency is simply the ratio of the count of a word present in a sentence, to the length of the sentence.
Out[218]:
words tf
0 thanks 1
1 lyft 1
2 credit 1
3 cant 1
4 use 1
In [219]: term_frequency(test_set)
Out[219]:
words tf
0 white 1
1 supremacist 1
2 want 1
3 everyone 1
4 see 1
Bag of words
In [ ]:
Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data. The intuition behind this is that two similar text fields will contain similar kind of words, and will therefore have a similar bag of words. Further, that from the text alone we can learn something about the meaning of the document.
Sentiment Analysis
In [222]: polarity_subjectivity(train_set)
In [223]: polarity_subjectivity(test_set)
In [225]: sentiment_analysis(train_set)
Out[225]:
tweet sentiment
1 thanks lyft credit cant use cause dont offer w... 0.2
Out[226]:
tweet sentiment
In [228]: train_set.head(n=10)
Out[228]:
label tweet word_count char_count stopwords hashtags numerics sentiment
1 0 thanks lyft credit cant use cause dont offer w... 22 122 5 3 0 0.2
5 0 huge fan fare big talking leave chaos pay disp... 21 116 6 1 0 0.2
7 0 next school year year examsð cant think school... 23 143 6 7 0 -0.4
9 0 welcome gr 15 50 3 1 0 0.8
DIVIDED SUCCESFULLY...
VECTORIZE DATA
Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics.
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_1616\618821171.py in <module>
5 x_test_count = vectorizer.transform(test_x)
6
----> 7 x_train_count.toarray()
MemoryError: Unable to allocate 6.74 GiB for an array with shape (25569, 35355) and data type int64
Thank You