0% found this document useful (0 votes)
18 views

Chapter 4

Uploaded by

Ramdhan Firdaus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Chapter 4

Uploaded by

Ramdhan Firdaus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Let's predict the

sentiment!
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Classification problems
Product and movie reviews: positive or negative sentiment (binary classi cation)

Tweets about airline companies: positive, neutral and negative (multi-class classi cation)

SENTIMENT ANALYSIS IN PYTHON


Linear and logistic regressions

SENTIMENT ANALYSIS IN PYTHON


Logistic function
Linear regression: numeric outcome

Logistic regression: probability:

P robability(sentiment = positive∣review)

SENTIMENT ANALYSIS IN PYTHON


Logistic regression in Python
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression().fit(X, y)

SENTIMENT ANALYSIS IN PYTHON


Measuring model performance
Accuracy: Fraction of predictions our model got right.

The higher and closer the accuracy is to 1, the be er

# Accuracy using score


score = log_reg.score(X, y)
print(score)

0.9009

SENTIMENT ANALYSIS IN PYTHON


Using accuracy score
# Accuracy using accuracy_score
from sklearn.metrics import accuracy_score

y_predicted = log_reg.predict(X)
acurracy = accuracy_score(y, y_predicted)

0.9009

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Did we really predict
the sentiment well?
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Train/test split

Training set: used to train the model (70-80% of the whole data)

Testing set: used to evaluate the performance of the model

SENTIMENT ANALYSIS IN PYTHON


Train/test in Python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y)

X : features

y : labels

test_size: proportion of data used in testing

random_state: seed generator used to make the split

stratify: proportion of classes in the sample produced will be the same as the proportion of
values provided to this parameter

SENTIMENT ANALYSIS IN PYTHON


Logistic regression with train/test split
log_reg = LogisticRegression().fit(X_train, y_train)

print('Accuracy on training data: ', log_reg.score(X_train, y_train))

0.76

print('Accuracy on testing data: ', log_reg.score(X_test, y_test))

0.73

SENTIMENT ANALYSIS IN PYTHON


Accuracy score with train/test split
from sklearn.metrics import accuracy_score

log_reg = LogisticRegression().fit(X_train, y_train)

y_predicted = log_reg.predict(X_test)
print('Accuracy score on test data: ', accuracy_score(y_test, y_predicted))

0.73

SENTIMENT ANALYSIS IN PYTHON


Confusion matrix

SENTIMENT ANALYSIS IN PYTHON


Confusion matrix in Python
from sklearn.metrics import confusion_matrix

log_reg = LogisticRegression().fit(X_train, y_train)


y_predicted = log_reg.predict(X_test)

print(confusion_matrix(y_test, y_predicted)/len(y_test))

[[0.3788 0.1224]
[0.1352 0.3636]]

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Logistic regression:
revisted
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Complex models and regularization
Complex models:
Complex model that captures the noise in the data (over ing)

Having a large number of features or parameters

Regularization:
A way to simplify and ensure we have a less complex model

SENTIMENT ANALYSIS IN PYTHON


Regularization in a logistic regression
from sklearn.linear_model import LogisticRegression

# Regularization arguments
LogisticRegression(penalty='l2', C=1.0)

L2: shrinks all coe cients towards zero

High values of C: low penalization, model ts the training data well.

Low values of C: high penalization, model less exible.

SENTIMENT ANALYSIS IN PYTHON


Predicting a probability vs. predicting a class
log_reg = LogisticRegression().fit(X_train, y_train)

# Predict labels
y_predicted = log_reg.predict(X_test)

# Predict probability
y_probab = log_reg.predict_proba(X_test)

SENTIMENT ANALYSIS IN PYTHON


Predicting a probability vs. predicting a class
y_probab
array([[0.5002245, 0.4997755],
[0.4900345, 0.5099655],
...,
[0.7040499, 0.2959501]])

# Select the probabilities of class 1


y_probab = log_reg.predict_proba(X_test)[:, 1]

array([0.4997755, 0.5099655 ..., 0.2959501]])

SENTIMENT ANALYSIS IN PYTHON


Model metrics with predicted probabilities
Raise ValueError when applied with probabilities.

Accuracy score and confusion matrix work with classes.

# Default probability encoding:


# If probability >= 0.5, then class 1 Else class 0

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Bringing it all
together
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
The Sentiment Analysis problem
Sentiment analysis as the process of understanding the opinion of an author about a
subject

Movie reviews

Amazon product reviews

Twi er airline sentiment

Various emotionally charged literary examples

SENTIMENT ANALYSIS IN PYTHON


Exploration of the reviews
Basic information about size of reviews

Word clouds

Features for the length of reviews: number of words, number of sentences

Feature detecting the language of a review

SENTIMENT ANALYSIS IN PYTHON


Numeric transformations of sentiment-carrying
columns
Bag-of-words

TfIdf vectorization

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Vectorizer syntax
vect = CountVectorizer().fit(data.text_column)
X = vect.transform(data.text_column)

SENTIMENT ANALYSIS IN PYTHON


Arguments of the vectorizers
stop words: non-informative, frequently occurring words

n-gram range: use phrases not only single words

control size of vocabulary: max_features, max_df, min_df

capturing a pa ern of tokens: remove digits or certain characters

Important but NOT arguments to the vectorizers

lemmas and stems

SENTIMENT ANALYSIS IN PYTHON


Supervised learning model
Logistic regression classi er to predict the sentiment

Evaluated with accuracy and confusion matrix

Importance of train/test split

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Wrap up
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
The Sentiment Analysis world

SENTIMENT ANALYSIS IN PYTHON


Sentiment analysis types

SENTIMENT ANALYSIS IN PYTHON


The automated sentiment analysis system

SENTIMENT ANALYSIS IN PYTHON


Congratulations!
S E N T I M E N T A N A LY S I S I N P Y T H O N

You might also like