Reading Material - Module-5 - Introduction To Special Topics
Reading Material - Module-5 - Introduction To Special Topics
FINANCIAL ANALYTICS
Curated by Kiran Kumar K V
Content Structure
Mathematical Formulation
The logistic regression model can be represented as follows:
The parameters of the logistic regression model are estimated using maximum
likelihood estimation (MLE). The objective is to maximize the likelihood of observing the given
binary outcomes (0 or 1) given the input features and the model parameters.
Model Interpretation
Coefficients Interpretation - The coefficients (𝛽 values) of the logistic regression model
represent the change in the log-odds of the outcome variable for a one-unit change
in the corresponding predictor variable, holding other variables constant.
Odds Ratio - The exponentiated coefficients (𝑒𝛽 values) represent the odds ratio, which
quantifies the change in the odds of the positive outcome for a one-unit change in the
predictor variable.
Model Evaluation
Metrics - Logistic regression models are evaluated using various metrics such as
accuracy, precision, recall, F1-score, ROC curve, and AUC-ROC (Area Under the ROC
Curve). These metrics assess the model's performance in correctly classifying instances
into the appropriate classes.
Cross-Validation - Cross-validation techniques such as k-fold cross-validation are used
to assess the generalization performance of the model and detect overfitting.
Case of Credit Default Prediction using Logistic Regression
In this project, we aim to build a credit default prediction model using logistic regression on
a dataset containing various features related to borrowers' credit behavior. The dataset
includes information such as credit utilization, age, income, past payment history, and number
of dependents. The target variable, 'SeriousDlqin2yrs', indicates whether a person
experienced a 90-day past due delinquency or worse.
Dataset Description
SeriousDlqin2yrs: Binary variable indicating whether the borrower experienced a 90-day
past due delinquency or worse (Y/N).
In this script:
Line 1-8 - Import necessary libraries/packages for data manipulation, visualization, model
building, and evaluation.
Line 11 - Read the loan dataset from a CSV file into a pandas DataFrame and display its
dimensions.
Line 13 - Group the data by the target variable ('SeriousDlqin2yrs') and display the count
of each class.
Line 16-17 - Define the independent variables (features) and dependent variable (target
variable) by splitting the dataset.
Line 20-21 - Split the data into training and test sets with 70% training data and 30% test
data.
Line 24 - Initialize the Logistic Regression model.
Line 27 - Fit the logistic regression model to the training data to estimate the coefficients.
Line 30 - Make predictions on the test data using the trained logistic regression model.
Line 33 - Generate and plot the confusion matrix to evaluate the model's performance.
Line 36-38 - Generate and print the classification report containing precision, recall, F1-
score, and support for each class.
Confusion Matrix
The confusion matrix provides a detailed breakdown of the model's predictions compared to
the actual class labels.
True Positive (TP): 41279 - The model correctly predicted 41279 instances as positive
(default) that are actually positive.
True Negative (TN): 53 - The model correctly predicted 53 instances as negative (non-
default) that are actually negative.
False Positive (FP): 35 - The model incorrectly predicted 35 instances as positive
(default) that are actually negative (non-default).
False Negative (FN): 3001 - The model incorrectly predicted 3001 instances as negative
(non-default) that are actually positive (default).
Accuracy measures the overall correctness of the model's predictions and is calculated as the
ratio of correctly predicted instances to the total number of instances.
The model achieves an accuracy of approximately 93.19%, indicating that it correctly predicts
the class labels for 93.19% of the instances in the dataset.
Precision measures the proportion of true positive predictions among all positive predictions
and is calculated as the ratio of true positives to the sum of true positives and false.
he precision is approximately 99.92%, meaning that among all instances predicted as positive
(default), almost all (99.92%) are actually positive.
Recall (Sensitivity) measures the proportion of true positive predictions among all actual
positive instances and is calculated as the ratio of true positives to the sum of true positives
and false negatives.
The recall is approximately 93.21%, indicating that the model correctly identifies around
93.21% of all actual positive (default) instances.
We can also look at the Classification Report to evaluate the model:
The classification report presents the performance metrics of a binary classification model.
Here's the interpretation:
Precision - Precision measures the proportion of true positive predictions among all
positive predictions. For class 0 (non-default), the precision is 0.93, indicating that 93% of
the instances predicted as non-default are actually non-default. For class 1 (default), the
precision is 0.60, meaning that only 60% of the instances predicted as default are actually
default.
Recall (Sensitivity) - Recall measures the proportion of true positive predictions among all
actual positive instances. For class 0, the recall is 1.00, indicating that the model correctly
identifies all non-default instances. However, for class 1, the recall is very low at 0.02,
indicating that the model misses a significant number of actual default instances.
F1-score - The F1-score is the harmonic mean of precision and recall and provides a
balance between the two metrics. For class 0, the F1-score is 0.96, reflecting a high level
of accuracy in predicting non-default instances. However, for class 1, the F1-score is only
0.03, indicating poor performance in predicting default instances.
Accuracy - Accuracy measures the overall correctness of the model's predictions. In this
case, the accuracy is 0.93, indicating that the model correctly predicts the class labels for
93% of the instances in the test dataset.
Macro Average - The macro average calculates the average of precision, recall, and F1-
score across all classes. In this case, the macro average precision is 0.77, recall is 0.51, and
F1-score is 0.50.
Weighted Average - The weighted average calculates the average of precision, recall, and
F1-score weighted by the number of instances in each class. In this case, the weighted
average precision is 0.91, recall is 0.93, and F1-score is 0.90.
Overall, the model performs well in predicting non-default instances (class 0) with high
precision, recall, and F1-score. However, it struggles to correctly identify default instances
(class 1), leading to low recall and F1-score for this class. The high accuracy is mainly driven
by the large number of non-default instances in the dataset, but the model's performance on
default instances is unsatisfactory. Further improvement is needed to enhance the model's
ability to predict default cases accurately.
Monte Carlo simulation is extensively used in finance for pricing options, simulating asset
prices, and assessing portfolio risk. It helps in understanding the potential range of returns
and the likelihood of different financial scenarios.
Case of Stock Price Prediction using Monte Carlo Simulation
Problem Statement - Predicting the future stock price of a given company using historical
price data and Monte Carlo simulation.
Steps to Follow
Step-1. Data Collection - Gather historical stock price data for the company of interest. This
data typically includes the date and closing price of the stock over a specified period.
Step-2. Calculate Returns - Compute the daily returns of the stock using the historical price
data. Daily returns are calculated as the percentage change in stock price from one
day to the next.
Step-3. Calculate Mean and Standard Deviation - Calculate the mean and standard deviation
of the daily returns. These parameters will be used to model the behavior of the
stock price.
Step-4. Generate Random Price Paths - Use Monte Carlo simulation to generate multiple
random price paths based on the calculated mean, standard deviation, and the
current stock price. Each price path represents a possible future trajectory of the
stock price.
Step-5. Analyze Results - Analyze the distribution of simulated price paths to understand
the range of potential outcomes and assess the likelihood of different scenarios.
We start by collecting historical stock price data and calculating the daily returns. Using the
daily returns, we compute the mean (mu) and standard deviation (sigma) of returns, which
serve as parameters for the Monte Carlo simulation. We specify the number of simulations
and the number of days to simulate into the future. In the Monte Carlo simulation loop, we
generate random daily returns based on a normal distribution with mean mu and standard
deviation sigma. Using these daily returns, we calculate the simulated price paths for each
simulation. Finally, we plot the simulated price paths along with the historical prices to
visualize the range of potential outcomes.
Rules - In addition to the lexicon, VADER incorporates a set of rules and heuristics to
handle sentiment in text data more accurately. These rules account for various linguistic
features such as capitalization, punctuation, degree modifiers, conjunctions, and
emoticons.
Sentiment Scores - VADER produces sentiment scores for each input text, including -
Positive Score - The proportion of words in the text that are classified as positive.
Negative Score - The proportion of words in the text that are classified as negative.
Neutral Score - The proportion of words in the text that are classified as neutral.
Compound Score - A single score that represents the overall sentiment of the text,
calculated by summing the valence scores of each word in the text, adjusted for
intensity and polarity. The compound score is a lexicon metric that calculates the
sum of all positive and negative sentiments normalized between -1(negative) to +1
(positive). If compound >0 is categorized as positive, compound<0 is categorized
as negative, and compound=0 categorized as neutral sentiments
Case of VADER Sentiment Analysis on News Headlines of a Stock
Problem Statement – Use the NewsAPI and VADER sentiment analysis tool to streamline the
process of gathering, analyzing, and visualizing sentiment from news headlines related to a
specific stock symbol.
*NewsAPI is a tool for accessing a vast array of news articles and headlines from around the
world. It provides developers with a simple and intuitive interface to search and retrieve news
content based on various criteria such as keywords, language, sources, and publication dates.
With its extensive coverage of news sources, including major publications and local news
outlets, NewsAPI offers users the ability to stay updated on the latest developments across a
wide range of topics and industries. Whether for research, analysis, or staying informed,
NewsAPI facilitates seamless access to timely and relevant news content, making it an
invaluable resource for developers, researchers, journalists, and anyone seeking access to up-
to-date news information. (https://newsapi.org/ )
Steps followed
User inputs a stock symbol of interest.
The Python script interacts with the NewsAPI to fetch top headlines related to the
specified stock symbol.
Using the VADER sentiment analysis tool, the script analyzes the sentiment of each
headline, categorizing it as positive, neutral, or negative.
The sentiment distribution among the collected headlines is visualized through a bar
chart, providing insights into market sentiment trends.
Additionally, the script generates a word cloud based on the collected headlines,
highlighting the most frequently occurring words and visualizing key themes.
Python Script
Output
These numbers represent the sentiment distribution among the analyzed headlines.
Specifically:
There were 17 headlines with a negative sentiment.
There were 40 headlines with a neutral sentiment.
There were 37 headlines with a positive sentiment.
In the histogram, the x-axis represents the range of compound scores, while the y-axis
represents the frequency or count of headlines falling within each score range. The histogram
is divided into bins, with each bin representing a range of compound scores.
When the majority of observations are around 0 on the histogram, it indicates that a
significant proportion of the analyzed headlines have a neutral sentiment. This means that
these headlines neither convey strongly positive nor strongly negative sentiment. Instead,
they are likely reporting factual information or presenting a balanced view of the topic.
In financial news analysis, it's common to observe a clustering of headlines around a neutral
sentiment, as news reporting often aims to provide objective and factual information to
investors. However, the presence of headlines with extreme positive or negative sentiment
scores can also be indicative of noteworthy developments or market sentiment shifts that
investors may find important to consider in their decision-making process.
WORDCLOUD as a Sentiment Analysis Tool
Word clouds are graphical representations of text data where the size of each word indicates
its frequency or importance within the text. Word clouds provide a visually appealing way to
identify and visualize the most frequently occurring words in a collection of text data. In
finance, this can include news headlines, financial reports, analyst opinions, or social media
chatter related to stocks, companies, or market trends.
By analyzing the words that appear most frequently in a word cloud, analysts can identify
terms that are strongly associated with positive, negative, or neutral sentiment. For example,
words like "profit," "growth," and "bullish" may indicate positive sentiment, while words like
"loss," "decline," and "bearish" may indicate negative sentiment.
The size or prominence of each word in the word cloud reflects its frequency or importance
within the text data. This allows analysts to gauge the intensity of sentiment associated with
certain terms. Larger words typically represent terms that are more prevalent or impactful in
conveying sentiment.
Word clouds provide context by displaying words in relation to one another, allowing analysts
to understand the broader context of sentiment within the text data. For example, positive
and negative terms may appear together, providing insights into nuanced sentiment or
conflicting viewpoints.
Case of WORDCLOUD generation for Sentiment Analysis on News Headlines of a Stock
Problem Statement - Generate a visual representation of the most frequently occurring words
in the collected headlines, thereby providing users with insights into prevalent themes and
sentiment trends.
Steps followed
Step-1. Prompt the user to enter a search query (e.g., stock name or topic of interest).
Step-2. Utilize the NewsAPI to fetch top headlines related to the user-entered search query.
Step-3. Extract the headlines and publication dates from the fetched news articles.
Step-4. Concatenate the extracted headlines into a single text string.
Step-5. Exclude the queried word from the text to avoid bias
Step-6. Generate a word cloud from the concatenated text using the WordCloud library.
Step-7. Display the generated word cloud as a visual representation of the most frequently
occurring words in the news headlines.
Python Script
Output
The number of times bootstrapping is performed, typically around 1,000 times, can provide
a high level of certainty about the reliability of statistics. However, bootstrapping can also be
accomplished with fewer samples, such as 50 samples. Understanding the probability of items
being chosen in random samples helps determine the size of the train and test sets, with
approximately one-third of the dataset remaining unchosen or out-of-bag.
Case of Applying Bootstrapping in Average & Standard Deviation of Stock Returns
Problem Statement - Apply bootstrapping technique to estimate the mean and standard
deviation of stock returns based on historical price data
Steps Followed
Step-1. Use the Yahoo Finance API to collect historical stock price data for a specified
number of days.
Step-2. Compute the daily returns from the collected stock price data.
Step-3. Bootstrapping - Implement the bootstrapping technique to estimate the mean and
standard deviation of returns.
Step-4. Generate multiple bootstrapped samples by resampling with replacement.
Step-5. Compute the mean and standard deviation of returns for each bootstrapped
sample.
Step-6. Present the results in a tabular format, including the stock name, period of data
collection, mean of returns, standard deviation of returns, and confidence intervals.
Step-7. Optionally, visualize the data and results using plots or charts for better
understanding.
Python Code
Output
Cross-Validation
Cross-validation is a widely used technique in machine learning and statistical modeling for
assessing the performance and generalization ability of predictive models. It is particularly
useful when dealing with a limited amount of data or when trying to avoid overfitting.
In cross-validation, the available data is split into multiple subsets or folds. The model is
trained on a subset of the data, called the training set, and then evaluated on the remaining
data, called the validation set or test set. This process is repeated multiple times, with each
subset serving as both the training and validation sets in different iterations.
The most common type of cross-validation is k-fold cross-validation, where the data is
divided into k equal-sized folds. The model is trained k times, each time using k-1 folds for
training and the remaining fold for validation. The performance metrics (e.g., accuracy, error)
obtained from each iteration are then averaged to provide a more reliable estimate of the
model's performance.
Cross-validation helps to:
Reduce Overfitting - By evaluating the model's performance on multiple subsets of data,
cross-validation provides a more accurate assessment of how well the model will
generalize to unseen data.
Utilize Data Efficiently - It allows for the maximum utilization of available data by using
each data point for both training and validation.
Tune Model Hyperparameters - Cross-validation is often used in hyperparameter tuning to
find the optimal set of hyperparameters that yield the best model performance.
There are variations of cross-validation techniques, such as stratified k-fold cross-validation
(ensuring that each fold preserves the proportion of class labels), leave-one-out cross-
validation (each data point serves as a separate validation set), and nested cross-validation
(used for model selection and hyperparameter tuning within each fold).
Steps followed
Step-1. Utilize the Yahoo Finance API to collect historical stock price data for a specified stock
symbol over a defined time period.
Step-2. Model Training and Evaluation - Implement a linear regression model to predict stock
prices based on historical data.
Step-3. Perform k-fold cross-validation to evaluate the model's predictive performance.
Step-4. Split the historical data into k subsets (folds) and train the model on k-1 subsets while
evaluating its performance on the remaining subset.
Step-5. Compute evaluation metrics (e.g., R-squared score) for each fold to assess the
model's accuracy and generalization ability.
Step-6. Results Analysis - Calculate the mean and standard deviation of evaluation metrics
(e.g., mean R-squared score) across all folds.
Step-7. Interpret the results to determine the model's predictive performance and reliability.
Python Script
Output
The R-squared score measures the proportion of the variance in the dependent variable (stock
prices) that is explained by the independent variable(s) (features used in the model). A higher
mean R-squared score (closer to 1) indicates that the model explains a larger proportion of
the variance in the target variable and is better at predicting stock prices. In this case, the
mean R-squared score of 0.75 suggests that the linear regression model performs well in
explaining the variation in stock prices, capturing about 75% of the variance in the data.
The standard deviation of the R-squared score provides information about the variability
or consistency of the model's performance across different folds. A lower standard deviation
indicates less variability in performance across folds. In this case, the standard deviation of
approximately 0.02 suggests that the model's performance is relatively consistent across
different folds.
Extract the features (independent variables) and the target variable (stock price).
Apply standard scaling to the features to normalize the data.
Split the dataset into training and testing sets.
Train a linear regression model using the training data.
Prepare data for predicting the stock price for the next 5 quarters.
Predict the stock prices for the next 5 quarters using the trained model.
Visualize historical prices and predicted prices on a line graph.
Display the graph to compare historical and predicted stock prices.
Python Code
Output (As the data is a randomly generated sample, the output looks abnormal)
~~~~~~~~~~