Sentiment Analysis
Sentiment Analysis
● Python Libraries:
○ NLTK/TextBlob: For simple sentiment scoring.
○ VADER Sentiment Analysis: Fine-tuned for financial texts.
○ HuggingFace Transformers: Advanced models like BERT for
sentiment classification.
● Third-Party Tools:
○ AWS Comprehend or IBM Watson: For automated sentiment
analysis with pre-built dashboards.
● Visualization:
○ Use tools like Tableau or Python’s Matplotlib/Seaborn to visualize
sentiment trends across calls or companies.
Building a Machine Learning (ML) pipeline for sentiment analysis involves multiple
stages, from data acquisition to deployment. Here’s a detailed and in-depth explanation of
how to design such a pipeline for con call sentiment analysis:
1. Data Collection
Tasks:
● Source Data: Collect concall transcripts from earnings call recordings or publicly
available datasets. Use web scraping or APIs to gather transcripts if they are not
pre-collected.
● Structure Data: Ensure the transcripts include metadata like speaker roles (e.g.,
management, analysts), timestamp, and sentiment labels (if available).
Tools:
2. Data Preprocessing
Tasks:
1. Text Cleaning:
○ Remove irrelevant elements like timestamps, filler words, and HTML tags.
○ Normalize text (convert to lowercase, remove punctuation).
2. Tokenization:
○ Break sentences into words or phrases.
○ Example: "The revenue increased by 10%." → ["The", "revenue", "increased",
"by", "10%"]
3. Stopword Removal:
○ Remove common words like “the,” “is,” “and” that don’t add meaning.
4. Part-of-Speech (POS) Tagging:
○ Identify verbs, nouns, etc., to focus on meaningful terms.
5. Lemmatization/Stemming:
○ Convert words to their root forms (e.g., “running” → “run”).
Tools:
Tasks:
1. TF-IDF Vectorization:
○ Represent text numerically by calculating the importance of words in a
document relative to the entire corpus.
2. Word Embeddings:
○ Use pre-trained embeddings (e.g., GloVe, Word2Vec, BERT) to capture
contextual meaning.
3. Sentiment Scoring:
○ Use rule-based methods (like VADER or TextBlob) for an initial sentiment
score.
4. Metadata Inclusion:
○ Include non-textual features like speaker role (management vs. analyst),
duration of speech, and topic relevance.
Tools:
4. Model Training
Tasks:
1. Choose Model:
○ Start with traditional models like Logistic Regression or SVM for baseline
results.
○ Advance to deep learning models like LSTMs, GRUs, or transformers (e.g.,
BERT, RoBERTa) for contextual sentiment analysis.
2. Data Splitting:
○ Split data into training, validation, and test sets (e.g., 70:20:10 ratio).
3. Hyperparameter Tuning:
○ Optimize parameters using Grid Search, Random Search, or tools like
Optuna.
4. Cross-Validation:
○ Use k-fold cross-validation to evaluate model robustness.
Tools:
Metrics to Use:
Tasks:
Tools:
6. Deployment Pipeline
Tasks:
1. Model Packaging:
○ Save the trained model (e.g., pickle, ONNX format).
2. API Development:
○ Wrap the model in an API using Flask, FastAPI, or Django.
○ Example: Send text data to the API and receive sentiment scores.
3. Monitoring and Logging:
○ Track performance in production using tools like Prometheus, Grafana, or
AWS CloudWatch.
Tools:
Tasks:
1. Create Dashboards:
○ Visualize sentiment trends over time, e.g., positive vs. negative sentiment
distribution for different companies or sectors.
2. Provide Insights:
○ Highlight key phrases contributing to each sentiment.
○ Present findings on areas of risk, growth, or strategic focus.
Tools:
By following this pipeline, you can perform sentiment analysis on concalls effectively,
enabling deeper insights into company performance and market trends.
Deployment with
code
Step 1: Importing Libraries
python
Copy code
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import re
import string
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
import PyPDF2
from google.colab import drive
import io
from afinn import Afinn
● This code reads the data from an Excel file (NEULAND.xlsx) into a Pandas
DataFrame concall.
● The file is uploaded in the Colab environment, and io.BytesIO is used to read it as
a byte stream.
● Here, the PDF file is opened, and PyPDF2.PdfReader is used to read the PDF file.
● The number of pages in the PDF is stored in page_count.
● A single page (pageObj) is selected for text extraction.
● This loop iterates over all PDF files and extracts text from every page.
● Text is concatenated and saved into the concall DataFrame under the columns
PAGE_COUNT (number of pages) and CONTENT (extracted text).
concall['WORD_COUNT'] = concall['CONTENT'].apply(word_count)
concall
● This function counts the number of words in the extracted text by splitting the text into
words and counting them.
● It applies this function to the CONTENT column of the DataFrame and creates a new
column WORD_COUNT.
● frequent words are identified by counting the occurrence of each word in the entire
dataset and selecting the top 20 most frequent words.
● These frequent words are then removed from the CONTENT column by filtering them
out.
● Similarly, rare words (those that appear very few times) are identified and removed
by selecting the least frequent words from the dataset.
Step 8: Stemming
python
Copy code
from nltk.stem import PorterStemmer
st = PorterStemmer()
concall['CONTENT'] = concall['CONTENT'].apply(lambda x: "
".join([st.stem(word) for word in x.split()]))
● Stemming is performed to reduce words to their root form (e.g., "running" -> "run").
● This helps in standardizing words to their base form and reducing the dimensionality
of the text.
● Stopwords are commonly used words (e.g., "the", "is", "in") that are typically
removed before text analysis.
● This block of code counts how many stopwords are present in the CONTENT column
for each entry.
● The stopwords are removed from the CONTENT column by filtering out words that are
in the stopwords list.
def sentiment_analyzser_scores(text):
score = analyzer.polarity_scores(text)
print(text)
print(score)
text_pos = concall['CONTENT'][1]
sentiment_analyzser_scores(text_pos)
def get_neg_word(x):
text = x['CONTENT']
tokenized_text = nltk.word_tokenize(text)
neg_word_list = []
for word in tokenized_text:
if analyzer.polarity_scores(word)['compound'] <= -0.5:
neg_word_list.append(word)
return set(neg_word_list)
● Positive and negative words are identified based on the sentiment score of each
word. Words with a positive score greater than or equal to 0.5 are considered
positive, and those with a negative score less than or equal to -0.5 are considered
negative.
● The AFINN scores are normalized by dividing the score by the word count and
multiplying by 100. This adjusts the sentiment score according to the length of the
text.
● The sentiment scores for each entry in the CONTENT column are computed and
stored in a new DataFrame sentiment_df.
Final Output
The final output contains a DataFrame that includes the content, word count, stopwords
count, sentiment scores, and additional sentiment-related metrics such as positive and
negative words.
Next Steps
● Save the cleaned and analyzed data into a new Excel file for further analysis or
reporting.
● Enhance sentiment analysis by combining both VADER and Afinn or using more
advanced models like transformers for better accuracy.
1. Data Collection
Since you already have the PDFs downloaded locally, you can directly access them from
your local directory.
Tasks:
● Directory Setup: Ensure that the PDFs are organized in a specific directory on your
local machine.
● Path Handling: Use Python to iterate over the files in that directory.
Tools:
● Libraries: Use os to handle file paths and iterate through PDF files.
Example:
python
Copy code
import os
2. Data Preprocessing
Tasks:
● Text Extraction: Use PyPDF2 or pdfminer.six to extract text from each PDF.
● Text Cleaning: Clean the text by removing unwanted characters, converting to
lowercase, and tokenizing the text.
● Stopword Removal and Lemmatization: Process the text further to remove
stopwords and perform lemmatization.
Tools:
● Libraries: PyPDF2 for PDF text extraction, NLTK for tokenization and stopword
removal.
Example:
python
Copy code
from PyPDF2 import PdfReader
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
def clean_pdf_text(pdf_path):
# Read PDF content
pdf_reader = PdfReader(pdf_path)
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
3. Feature Engineering
Now, convert the cleaned text into features that can be fed into your machine learning
model.
Tasks:
Tools:
Example:
python
Copy code
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd
# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(cleaned_texts)
# Optionally, you can also combine the TF-IDF features and sentiment
scores
features_df = pd.DataFrame(tfidf_matrix.toarray(),
columns=vectorizer.get_feature_names_out())
features_df = pd.concat([features_df, sentiment_df], axis=1)
4. Model Training
Now, train a machine learning model to classify sentiment based on the features.
Tasks:
● Train-Test Split: Split the data into training and testing sets.
● Model Training: Use a classification model like Logistic Regression, Support Vector
Machine, or a deep learning model if required.
Tools:
Example:
python
Copy code
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Make predictions
y_pred = clf.predict(X_test)
5. Model Evaluation
Evaluate the model's performance using metrics like accuracy, precision, recall, and
F1-score.
Tasks:
Tools:
Example:
python
Copy code
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Classification report
print(classification_report(y_test, y_pred))
6. Deployment
Once the model is trained and evaluated, you can deploy it to analyze sentiment from new
PDFs.
Tasks:
● Deployment: Use a web framework like Flask or FastAPI to serve the model via
an API endpoint where users can upload PDFs and get sentiment predictions.
Tools:
python
Copy code
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
# Get the uploaded file
file = request.files['file']
if __name__ == '__main__':
app.run(debug=True)
Final Remarks:
This modified pipeline ensures that you're working directly with the PDF files stored locally.
The steps involve reading PDFs, preprocessing the data, vectorizing the text, training a
sentiment analysis model, and optionally deploying it through an API for real-time
predictions.
Key Questions and Lines for Sentiment Analysis
Management Commentary:
Analyst Questions:
● Concerns: “What steps are you taking to address rising input costs?”
○ Neutral/Negative: Suggests existing challenges or risks.
● Opportunities: “Can you elaborate on the impact of your new product launch?”
○ Neutral/Positive: Explores potential growth areas.
● Clarifications: “Could you provide more details on the revenue miss this quarter?”
○ Neutral/Negative: Points to gaps in performance or transparency.
Management Responses:
● Defensive or vague answers: “We are monitoring the situation closely and believe it
will stabilize soon.”
○ Negative: Indicates uncertainty or lack of clarity.
● Confident, detailed answers: “We’ve secured new suppliers to mitigate the issue, and
we anticipate resolving it by Q2.”
○ Positive: Shows proactive measures and control.
Example:
Imagine analyzing a tech company’s earnings call where the CEO states:
1. Positive Statements:
○ "We launched three new products this quarter, contributing to a 25% revenue
growth."
○ "Customer feedback has been overwhelmingly positive."
2. Negative Statements:
○ "We encountered delays in our supply chain due to unforeseen
circumstances."
○ "Our operating margins have been impacted by rising material costs."
By applying sentiment analysis:
Outcome:
As an investor, you could focus on whether the company's positive growth potential
outweighs its operational risks.
1. Positive Sentiment
What It Means:
The company is confident and optimistic about its performance and future.
1. Invest More:
○ If the company is doing well and the numbers back it up, think about buying
more shares.
2. Check the Details:
○ Make sure the company’s claims match its financial performance (e.g., profits,
sales growth).
3. Compare with Others:
○ See if competitors are doing as well or if this company is leading the market.
4. Be Cautious of Overconfidence:
○ Watch out for management sounding too positive without real proof.
5. Look for Opportunities:
○ Focus on areas they highlight as growing, like new products or markets.
2. Neutral Sentiment
What It Means:
The company doesn’t sound very positive or negative. They might be cautious or uncertain.
1. Dig Deeper:
○ Look into the financial reports to figure out what’s going on.
2. Ask Questions:
○ If you’re an analyst, ask for more details about things that seem unclear.
3. Check the Trend:
○ Compare this call with previous ones. If they’re always neutral, the company
might not be growing much.
4. Watch for Hidden Risks:
○ Neutral sentiment can sometimes mean they’re hiding problems. Check
industry trends for clues.
5. Wait and Watch:
○ If you’re unsure, hold your investment for now and see how things develop.
3. Negative Sentiment
What It Means: