Chapter 4
Chapter 4
sentiment!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Violeta Misheva
Data Scientist
Classification problems
Product and movie reviews: positive or negative sentiment (binary classi cation)
Tweets about airline companies: positive, neutral and negative (multi-class classi cation)
P robability(sentiment = positive∣review)
log_reg = LogisticRegression().fit(X, y)
0.9009
y_predicted = log_reg.predict(X)
acurracy = accuracy_score(y, y_predicted)
0.9009
Violeta Misheva
Data Scientist
Train/test split
Training set: used to train the model (70-80% of the whole data)
X : features
y : labels
stratify: proportion of classes in the sample produced will be the same as the proportion of
values provided to this parameter
0.76
0.73
y_predicted = log_reg.predict(X_test)
print('Accuracy score on test data: ', accuracy_score(y_test, y_predicted))
0.73
print(confusion_matrix(y_test, y_predicted)/len(y_test))
[[0.3788 0.1224]
[0.1352 0.3636]]
Violeta Misheva
Data Scientist
Complex models and regularization
Complex models:
Complex model that captures the noise in the data (over ing)
Regularization:
A way to simplify and ensure we have a less complex model
# Regularization arguments
LogisticRegression(penalty='l2', C=1.0)
# Predict labels
y_predicted = log_reg.predict(X_test)
# Predict probability
y_probab = log_reg.predict_proba(X_test)
Violeta Misheva
Data Scientist
The Sentiment Analysis problem
Sentiment analysis as the process of understanding the opinion of an author about a
subject
Movie reviews
Word clouds
TfIdf vectorization
# Vectorizer syntax
vect = CountVectorizer().fit(data.text_column)
X = vect.transform(data.text_column)
Violeta Misheva
Data Scientist
The Sentiment Analysis world