Supervised Learning copy
Supervised Learning copy
problems.
• It performs classification and regression tasks.
• It allows estimating or mapping the result to a new sample.
• We have complete control over choosing the number of classes we want in the training data.
• Linear Regression
Linear regression is a type of supervised machine learning algorithm that computes the linear
relationship between the dependent variable and one or more independent features by fitting a linear
equation to observed data. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such as sales, salary, age,
product price, etc.
When there is only one independent feature, it is known as Simple Linear Regression, and when
there is more than one feature, it is known as Multiple Linear Regression.
Similarly, when there is only one dependent variable, it is considered Univariate Linear Regression,
while when there are more than one dependent variables, it is known as Multivariate Regression.
Y=β0+β1X+ϵ
Components:
• Y: Predicted value of the dependent variable.
• β1: Slope, indicating the rate of change in YYY for a unit change in XXX.
• X: Independent variable.
• ϵ: Error term representing the difference between actual and predicted values.
Assumptions of Simple Regression
1. Linearity:
o Assumption: The relationship between the independent variable (X) and dependent
variable (Y) is linear.
o Example: Predicting house prices (Y) based on area size (X). A scatterplot should
show a straight-line trend.
2. Independence of Errors:
o Assumption: Residuals (differences between observed and predicted values) are
not correlated.
o Example: Predicting stock prices using previous day’s data. If residuals show a pattern,
this assumption is violated (e.g., trends in residuals over time).
3. Homoscedasticity:
o Assumption: The variance of residuals is constant across all values of X.
o Example: If you predict income (Y) based on education level (X), residuals should not
widen or narrow as X increases. If they do, heteroscedasticity is present.
4. Normality of Errors:
o Assumption: Residuals follow a normal distribution.
o Example: When predicting sales (Y) based on advertising spend (X), plot a histogram
of residuals. A bell-shaped curve indicates normality.
5. No Multicollinearity:
o Assumption: Not applicable in simple regression since there is only one independent
variable.
o Example: In predicting sales (Y) based on price (X), there’s no interaction with other
predictors because only one exists.
6. No Perfect Correlation:
o Assumption: The dependent variable (Y) should not perfectly correlate with the
independent variable (X).
o Example: If predicting temperature in Celsius (Y) from Fahrenheit (X), the correlation
is perfect (R2=1R^2 = 1R2=1), which may overfit.
7. Measurement Accuracy:
o Assumption: X values are measured accurately without significant errors.
o Example: If predicting weight (Y) based on height (X), imprecise height
measurements could violate this assumption.
Residual Analysis Example
When checking assumptions:
• Linearity: Scatterplot of X vs. Y shows a linear trend.
• Homoscedasticity: Residual plot shows no funnel shape (constant spread).
• Normality: Histogram or Q-Q plot of residuals shows a normal distribution.
1. Scatter Plot
• Purpose: To visually represent the data points and observe the relationship between the
independent variable (X) and the dependent variable (Y).
• How to Create:
o Plot the values of the independent variable (X) on the horizontal axis (X-axis).
o Plot the values of the dependent variable (Y) on the vertical axis (Y-axis).
o Each data point is represented as a dot (X, Y).
• Insight:
o A linear relationship is indicated by points roughly forming a straight line.
o A non-linear relationship shows a curve or other patterns in the scatter plot.
o Outliers (points far from the general trend) may also be visible in the plot.
Example:
• In a dataset predicting salary based on years of experience, a scatter plot can show how salary
increases as experience grows. If the points form a straight line, it suggests a linear
relationship.
3. Residual Plot
• Purpose: To visualize the residuals, which are the differences between observed (actual)
values and predicted values from the regression model. A residual plot is useful to check
whether the model assumptions are met.
• How to Create:
o Calculate the residuals: Residual=Observed value−Predicted value.
o Plot the residuals on the Y-axis and either the independent variable (X) or the predicted
values on the X-axis.
o The residual plot will show whether the residuals are randomly scattered or if there
are patterns.
• Insight:
o Random Scatter: If the residuals are randomly scattered around zero, the linear model
is appropriate, and the assumptions of linearity, independence, and homoscedasticity
(constant variance of errors) are likely met.
o Patterns in Residuals: If you observe patterns (e.g., a curve), it indicates that a
linear model might not be the best fit for the data, suggesting non-linearity or model
misspecification.
Example:
• If you observe a "fan shape" in the residual plot (wider spread of residuals as X increases), it
indicates that the variance of the residuals is not constant, violating the assumption of
homoscedasticity.
4. Histogram of Residuals
• Purpose: To check if the residuals are normally distributed, which is one of the assumptions
of simple linear regression. Normally distributed residuals indicate that the regression model
is valid.
• How to Create:
o Calculate the residuals.
o Plot a histogram with the residuals on the Y-axis and the frequency on the X-axis.
• Insight:
o If the residuals follow a bell-shaped curve, it suggests that the model's assumptions
about normality are met.
o Non-normal distribution: If the residuals show skewness or other patterns, this
suggests that the model might not adequately represent the data, or that
transformations of the variables might be necessary.
Example:
• A normal distribution of residuals might indicate that the relationship between the independent
and dependent variables is well-modeled, whereas a skewed or bimodal histogram suggests
other underlying issues.
Summary
Visualization techniques are critical in diagnosing, evaluating, and interpreting simple regression
models. Scatter plots, regression lines, and residual plots help assess the linear relationship,
while histograms and QQ plots allow for checking the assumptions of normality and
homoscedasticity. These visual tools ensure that the model fits the data well and meet underlying
assumptions, making them an integral part of the regression analysis process.
Multiple Linear Regression
Definition:
Multiple linear regression extends simple regression by modeling the relationship between one
dependent variable (Y) and two or more independent variables (X1, X2, ..., Xn).
Objective:
To find a linear equation that predicts the dependent variable (Y) based on multiple independent
variables, minimizing errors between observed and predicted values.
• Insights:
o VIF > 10 indicates severe multicollinearity.
o No visualization, but tabulated results help interpret correlations.
6. Predicted vs. Actual Values Plot
• Purpose: Assesses model accuracy and fit.
• How to Use: Plot actual values (y-axis) against predicted values (x-axis).
• Insights:
o A straight diagonal line indicates a good fit.
o Deviations suggest poor predictive performance.
• Regression Trees
Definition of Regression Trees:
A regression tree is a decision tree specifically designed for predicting continuous numerical
values rather than categorical outcomes. It is a type of supervised learning algorithm that splits
the dataset based on input features to create decision rules that minimize the error in the
predicted continuous values. Regression trees are part of the decision tree family and follow the
same basic structure, but their target output differs.
• Structure: A regression tree has an internal structure consisting of nodes (decision points),
branches (splits based on features), and leaf nodes (final predictions). The goal is to partition
the data into subsets where the output variable (the one being predicted) is as consistent
as possible within each subset.
• Example: Predicting house prices based on features such as the number of rooms, square
footage, and location. The tree divides the data into subgroups (based on features like "square
footage > 1500") to make more accurate predictions.
• How It Differs from Classification Trees: While classification trees predict categorical
outcomes (e.g., class labels like "spam" or "not spam"), regression trees predict continuous
values (e.g., numeric values like price or temperature).
In regression trees, the splitting criterion at each node aims to minimize the mean squared
error (MSE), whereas in classification trees, the goal is to maximize information gain (using
metrics like Gini index or entropy).
• Regression Trees:
o Non-Linear Model: Regression trees are non-parametric models. They do not
assume any specific relationship between the features and the target variable.
Instead, they partition the data into subsets using decision rules (splits) and predict the
mean target value within each subset.
o Tree Structure: The model is built as a binary tree, where each internal node
represents a decision rule (e.g., "Is X > 10?"), and each leaf node contains the
predicted target value.
o Flexibility: Regression trees can model complex, non-linear relationships. They
can capture interactions between features that linear regression may miss.
• Linear Regression:
o Linear Model: Linear regression assumes a linear relationship between the features
and the target variable. The model predicts the target by finding the best-fit line
that minimizes the sum of squared residuals (errors).
o Equation: The prediction is of the form:
y=β0+β1x1+β2x2+⋯+βnxny
where y is the target variable, and β0,β1,…,βn are the model coefficients.
• Regression Trees:
o Easy to Interpret: Each decision path from the root to a leaf in a regression tree can
be traced back to a clear, logical decision rule. This makes regression trees relatively
easy to interpret, especially for non-technical stakeholders.
o Visual Representation: The tree structure offers a visual way to represent how
predictions are made based on feature splits.
o Limitation: However, very deep trees can become difficult to interpret, and large
trees may overfit the data.
• Linear Regression:
o Simple Interpretation: Linear regression coefficients are easy to interpret in terms of
the relationship between each feature and the target variable. The model’s simplicity
allows direct understanding of the impact of each predictor.
o Limitations: The simplicity also means that linear regression may not capture
complex relationships or interactions among features, which regression trees
handle naturally.
• Regression Trees:
o Non-Linear Relationships: Regression trees excel at capturing non-linear
relationships between features and the target variable. They can model piecewise
constant relationships, where the model may behave differently for different ranges of
feature values.
o Interactions: They automatically capture feature interactions, as the tree can split
on multiple features in different ways at each level.
• Linear Regression:
o Linear Assumption: Linear regression only works well when the relationship
between the predictors and the target variable is linear. If the true relationship is non-
linear, linear regression will likely underperform unless complex
transformations are applied to the features.
o Interactions: Interactions between features must be explicitly included as interaction
terms (e.g., x1×x2) in the model, which requires prior knowledge of the data.
• Regression Trees:
o Robust to Outliers: Regression trees are less sensitive to outliers compared to linear
regression because they split data based on feature values and are less influenced by
extreme values in the target variable.
o Handling Missing Data: Regression trees can handle missing values effectively by
using surrogate splits (splitting based on other features when one is missing).
• Linear Regression:
o Sensitive to Outliers: Linear regression is highly sensitive to outliers. Outliers can
disproportionately influence the slope of the regression line, leading to biased
predictions.
o Missing Data: Linear regression typically requires handling missing data before
modeling, as it cannot inherently deal with missing values.
• Regression Trees:
o Risk of Overfitting: If the tree grows too deep, it can fit the training data very
well but fail to generalize to new data (overfitting). Pruning is necessary to avoid
this.
o Model Complexity: Regression trees can become very complex with deep trees, but
they are able to handle large, complex datasets effectively.
• Linear Regression:
o Lower Risk of Overfitting: Linear regression has a lower risk of overfitting
compared to regression trees, especially when the number of predictors is not
very large. Regularization techniques (like Lasso or Ridge regression) can also help
prevent overfitting.
o Simplicity: Linear regression tends to be simpler and more stable, but it might fail if
the data has complex relationships
• Regression Trees:
o Training Time: Training a regression tree is generally faster than fitting a linear
regression model when dealing with large datasets with complex feature
interactions.
o Computational Complexity: Tree-building involves recursive partitioning, and the
complexity increases with the depth of the tree.
• Linear Regression:
o Training Time: Linear regression is computationally efficient to train, especially
with fewer features. It has closed-form solutions (e.g., least squares), making it quick
to fit in simple cases.
o Scalability: Linear regression is less resource-intensive for large datasets compared
to decision trees, especially when the relationships are linear.
• Regression Trees: Ideal for capturing non-linear relationships, handling large datasets,
and performing well with complex feature interactions. However, they can become
complex and prone to overfitting without proper pruning.
• Linear Regression: Best suited for datasets where relationships between features and the
target are linear. It is simple, interpretable, and computationally efficient, but may
underperform in the presence of complex relationships or non-linearity.
Both techniques have their strengths and weaknesses, and the choice between them depends on the
nature of the data and the problem being solved.
2. Key Terminology in Regression Trees:
1. Nodes:
o Definition: Nodes are decision points where the data is split into two or more
branches based on the feature values. Each node tests a condition on a feature (e.g.,
"Square footage > 1500?").
o Purpose: Nodes serve to partition the data, and in regression trees, this is done to
reduce the variance of the predicted continuous values within each subset.
2. Leaf Nodes (Terminal Nodes):
o Definition: Leaf nodes are the end points of the tree, where a final decision
(prediction) is made. In regression trees, the predicted value at each leaf is typically
the mean (or average) of the target variable (e.g., house prices) for all data points
that reach that leaf.
o Example: If 100 house prices in a leaf node are between 300,000 and 350,000, the
leaf node will predict a value around the average price of those houses.
3. Splitting:
o Definition: Splitting is the process of dividing the data at each node based on the
values of a specific feature. The objective is to find the feature that best divides
the data into subsets with the smallest variance in the target variable.
o Criteria: In regression trees, mean squared error (MSE) or variance reduction is
used to determine the best feature to split the data at each node. The split
minimizes the prediction error within each subset.
4. Mean Squared Error (MSE):
o Definition: MSE is the most common metric for evaluating the quality of a regression
tree. It calculates the average squared difference between the predicted value and
the true value. The lower the MSE, the better the model is at predicting the target
variable.
5. Pruning:
o Definition: Pruning is the process of trimming back a tree that has grown too
complex, which can reduce overfitting. After building the tree, we might prune some
branches if they do not improve the model's generalizability to unseen data.
o Purpose: Pruning removes unnecessary branches that might fit noise in the data,
improving model performance and reducing variance.
o Methods: Techniques like cost-complexity pruning (also known as weakest link
pruning) can be used to prune trees by evaluating the tradeoff between the complexity
of the tree and its predictive power.
6. Overfitting:
o Definition: Overfitting occurs when a model is too complex and captures noise or
random fluctuations in the training data, leading to poor generalization to new,
unseen data.
o Signs: Overfitting is often indicated by a very low training error but a high test
error. In regression trees, this happens when the tree grows too deep and the leaf
nodes represent small subsets of the data that fit the training set very well but
don't generalize well.
7. Variance Reduction:
o Definition: Variance reduction is the process of choosing splits that minimize the
variability of the target variable within each node. A good split results in more
homogeneous subsets, meaning less variability in the predicted values.
o Goal: The goal is to create nodes where the variance in the target variable is
minimized, improving the accuracy of the predictions made by the regression tree.
2. Underfitting:
Underfitting happens when the model is too simple to capture the underlying patterns of the data,
leading to poor performance on both the training and test datasets. It occurs when the model
fails to learn the relationships in the data, resulting in low predictive accuracy.
• Key Indicators of Underfitting:
o Shallow Trees: A regression tree that is too shallow may miss important patterns
because it doesn't make enough splits to capture the complexities of the data.
o Low Training and Test Accuracy: The model performs poorly on both the training
and test sets because it cannot capture the underlying structure of the data.
o Bias: Underfitting often results in high bias, where the model’s predictions are
consistently off because it is too simplistic to capture the relationships.
• Prevention Techniques:
o Increasing Tree Complexity: Allowing the tree to grow deeper or using a smaller
minimum number of samples per leaf can help the model capture more patterns in
the data.
o Feature Engineering: Creating new features or transforming existing ones may
provide the model with more information and improve its ability to make accurate
predictions.
o Relaxing Regularization: Reducing regularization constraints (e.g., increasing the
maximum depth, reducing pruning) allows the model more flexibility to capture
complex relationships.
2. Seasonality
The seasonal component refers to regular, repeating patterns or fluctuations in a time series that
occur at fixed intervals, such as daily, monthly, or yearly. Seasonality is typically influenced by
natural cycles or societal customs, such as weather changes, holidays, or cultural events. For
example, retail sales tend to spike in December due to Christmas shopping and drop in January as
consumer spending slows down. Similarly, electricity usage often exhibits seasonal patterns, peaking
during summer months when air conditioners are used more frequently. Seasonal analysis is crucial
for understanding short-term patterns and preparing for predictable changes, such as increasing
inventory before peak sales periods.
• Example:
Retail sales data often show a seasonal pattern, with peaks during December due to Christmas
shopping and dips in January as consumer spending decreases. Similarly, electricity
consumption in summer tends to increase due to the extensive use of air conditioning.
• Visualization:
Consider monthly sales data for a clothing store. The data shows higher sales during summer
and winter holiday seasons, with consistent peaks at the same months each year. The
seasonality forms a wavelike pattern over time.
3. Cyclic Component
The cyclic component captures long-term, irregular fluctuations that are not tied to fixed
intervals. Unlike seasonality, cycles are influenced by macroeconomic or external factors and do
not follow a predictable frequency. For instance, the stock market experiences cycles of growth
(bull markets) and decline (bear markets) due to factors like economic policies, geopolitical events,
and investor sentiment. Similarly, the housing market exhibits cycles of booms and busts, driven by
demand-supply imbalances and financial regulations. Cyclic patterns are challenging to predict
but provide insights into the broader context influencing the time series. Understanding cycles
helps analysts anticipate potential downturns or upswings over extended periods.
• Example:
The stock market often follows cycles of bull (growth) and bear (decline) markets, influenced
by economic factors, policies, and global events. Another example is the housing market,
which experiences cycles of high demand followed by periods of slower activity or decline.
• Visualization:
GDP growth data over decades often show alternating periods of rapid growth and slowdown.
These cyclical variations can appear as broad peaks and troughs that span several years in a
time series plot.
4. Random (Irregular) Component
The random or irregular component encompasses unpredictable variations in the time series
that cannot be attributed to the trend, seasonality, or cycles. These variations are often caused by
sudden, unforeseen events such as natural disasters, political turmoil, or unexpected market
disruptions. For example, a sudden spike in stock prices following the announcement of a
groundbreaking technology by a company is a random fluctuation. Unlike other components, the
random component does not exhibit a pattern and is often treated as noise in time series models.
A well-fitted model should have residuals (errors) that closely resemble white noise, with no
discernible structure. Identifying and minimizing random noise is vital for improving the accuracy of
forecasts.
• Example:
A sudden spike in sales due to a one-time celebrity endorsement or a dip in stock prices caused
by unexpected political unrest are examples of random variations. Random components are
also observed in weather patterns, like unexpected heavy rainfall disrupting normal
conditions.
• Visualization:
In a time series graph of daily stock prices, random spikes or drops occur due to market
reactions to news. These irregular points appear scattered without any recognizable pattern.
Component Scenario Example
Unpredictable, short-term deviations due to Sudden spike in sales due to viral marketing or
Random
noise or unforeseen events unexpected demand.
1. Line Plots
2. Seasonal Decomposition
• Purpose: To break down the time series into its components: trend, seasonality, and
residuals.
• Usage: This technique helps separate the underlying components of the data, allowing for
better interpretation and forecasting.
• Types: Additive Decomposition and Multiplicative Decomposition.
o Additive Decomposition assumes that seasonality is constant across the series.
o Multiplicative Decomposition is used when seasonality grows with the trend.
• Example: Decomposing a monthly sales series into trend (overall growth), seasonality
(monthly peaks), and residuals (random fluctuations).
• Advantages: Helps in understanding the behavior of each component separately.
• Disadvantages: May be complex and requires additional assumptions.
3. Heatmaps
4. Scatter Plots
5. Lag Plots
• Purpose: To quantify the correlation between the series and its lagged versions.
• Usage: ACF plots show the correlation between the series and its lags, while PACF focuses
on partial correlations.
• Example: A monthly sales series might have significant ACF at multiples of 12 (indicating
seasonality) and significant PACF for the first few lags (indicating direct influences).
• Advantages: Key tools for model selection and diagnostic checks.
• Disadvantages: Interpretation requires some statistical knowledge.
7. Box Plots
In time series analysis, understanding the relationship between observations at different points in time
is crucial. Two important tools for this are the Autocorrelation Function (ACF) and the Partial
Autocorrelation Function (PACF).
What is Autocorrelation?
Autocorrelation, also known as serial correlation, measures the relationship between a time
series and a lagged version of itself over successive time intervals. Simply put, it tells you how
similar the data points are to each other at different time lags.
Example: Imagine you are tracking the temperature of a city every day. If today’s temperature is
similar to yesterday’s, and yesterday’s temperature is similar to the day before, we say that the
temperature data is autocorrelated.
How it's calculated: ACF calculates the correlation coefficient between the original series and its
lagged versions (lagged by 1, 2, 3, etc. time periods).
Interpretation:
• Positive ACF: The current value tends to be similar to past values.
• Negative ACF: The current value tends to be opposite to past values.
• Zero ACF: No significant relationship between the current value and past values.
Visual Representation: ACF is typically plotted as a graph with lags on the x-axis and correlation
coefficients on the y-axis.
Partial Autocorrelation
Partial autocorrelation measures the correlation between observations at two time points, accounting
for the values of the observations at all shorter lags. This helps isolate the direct relationship between
observations at different lags, removing the influence of intermediary observations
Interpretation:
• Significant PACF: Indicates a direct relationship between the current value and the specific
lag.
• Insignificant PACF: Suggests that the relationship between the current value and the lag is
mostly explained by the influence of intervening lags.
Visual Representation: PACF is also plotted as a graph with lags on the x-axis and correlation
coefficients on the y-axis.
Time series preprocessing refers to the steps taken to clean, transform, and prepare time series data
for analysis or forecasting. It involves techniques aimed at improving data quality, removing noise,
handling missing values, and making the data suitable for modeling. Preprocessing tasks may include
removing outliers, handling missing values through imputation, scaling or normalizing the data,
detrending, deseasonalizing, and applying transformations to stabilize variance. The goal is to ensure
that the time series data is in a suitable format for subsequent analysis or modeling.
• Handling Missing Values : Dealing with missing values in the time series data to ensure
continuity and reliability in analysis.
• Dealing with Outliers: Identifying and addressing observations that significantly deviate
from the rest of the data, which can distort analysis results.
• Stationarity and Transformation: Ensuring that the statistical properties of the time series,
such as mean and variance, remain constant over time. Techniques like differencing,
detrending, and deseasonalizing are used to achieve stationarity.
Time Series Analysis and Decomposition is a systematic approach to studying sequential data
collected over successive time intervals. It involves analyzing the data to understand its underlying
patterns, trends, and seasonal variations, as well as decomposing the time series into its fundamental
components. This decomposition typically includes identifying and isolating elements such as trend,
seasonality, and residual (error) components within the data.
2. Partial Autocorrelation Functions (PACF): PACF measures the correlation between a time
series and its lagged values, controlling for intermediate lags, aiding in identifying direct
relationships between variables.
3. Trend Analysis: The process of identifying and analyzing the long-term movement or
directionality of a time series. Trends can be linear, exponential, or nonlinear and are crucial
for understanding underlying patterns and making forecasts.
7. Seasonal and Trend decomposition using Loess: STL decomposes a time series into three
components: seasonal, trend, and residual. This decomposition enables modeling and
forecasting each component separately, simplifying the forecasting process.
8. Rolling Correlation: Rolling correlation calculates the correlation coefficient between two
time series over a rolling window of observations, capturing changes in the relationship
between variables over time.
10. Box-Jenkins Method: Box-Jenkins Method is a systematic approach for analyzing and
modeling time series data. It involves identifying the appropriate autoregressive integrated
moving average (ARIMA) model parameters, estimating the model, diagnosing its adequacy
through residual analysis, and selecting the best-fitting model.
11. Granger Causality Analysis: Granger causality analysis determines whether one time series
can predict future values of another time series. It helps infer causal relationships between
variables in time series data, providing insights into the direction of influence.
5. SARIMAX: Extension of SARIMA that incorporates exogenous variables for seasonal time
series forecasting.
7. Theta Method: A simple and intuitive forecasting technique based on extrapolation and trend
fitting.
10. Generalized Additive Models (GAM): A flexible modeling approach that combines additive
components, allowing for nonlinear relationships and interactions.
11. Random Forests: Random Forests is a machine learning ensemble method that constructs
multiple decision trees during training and outputs the average prediction of the individual
trees. It can handle complex relationships and interactions in the data, making it effective for
time series forecasting.
12. Gradient Boosting Machines (GBM): GBM is another ensemble learning technique that
builds multiple decision trees sequentially, where each tree corrects the errors of the previous
one. It excels in capturing nonlinear relationships and is robust against overfitting.
13. State Space Models: State space models represent a time series as a combination of
unobserved (hidden) states and observed measurements. These models capture both the
deterministic and stochastic components of the time series, making them suitable for
forecasting and anomaly detection.
14. Dynamic Linear Models (DLMs): DLMs are Bayesian state-space models that represent
time series data as a combination of latent state variables and observations. They are flexible
models capable of incorporating various trends, seasonality, and other dynamic patterns in the
data.
15. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM)
Networks: RNNs and LSTMs are deep learning architectures designed to handle sequential
data. They can capture complex temporal dependencies in time series data, making them
powerful tools for forecasting tasks, especially when dealing with large-scale and high-
dimensional data.
16. Hidden Markov Model (HMM): A Hidden Markov Model (HMM) is a statistical model
used to describe sequences of observable events generated by underlying hidden states. In
time series, HMMs infer hidden states from observed data, capturing dependencies and
transitions between states. They are valuable for tasks like speech recognition, gesture
analysis, and anomaly detection, providing a framework to model complex sequential data
and extract meaningful patterns from it.
The AR part of ARIMA shows that the time series is regressed on its own past data. The MA
part of ARIMA indicates that the forecast error is a linear combination of past respective errors.
The I part of ARIMA shows that the data values have been replaced with differenced values
of d order to obtain stationary data, which is the requirement of the ARIMA model approach
Components of ARIMA
Strengths
1. Captures Temporal Patterns:
o Effective for identifying trends, seasonality, and cyclic behaviors in data.
2. Predictive Power:
o Provides robust models (e.g., ARIMA, LSTM) for short- to medium-term forecasting.
3. Data-Driven:
o Utilizes historical data without requiring external inputs or assumptions.
4. Multiple Methods:
o Offers diverse techniques:
▪ Traditional models (ARIMA, ETS) for stationary or seasonal data.
▪ Machine learning and deep learning models for complex relationships.
5. Supports Univariate and Multivariate Analysis:
o Univariate models focus on a single variable (e.g., ARIMA).
o Multivariate models like VAR include interdependent time series.
6. Flexibility:
o Can handle diverse data types (e.g., financial, sales, climate).
7. Real-World Applications:
o Widely used in stock price prediction, demand forecasting, and weather modeling.
8. Diagnostic Tools:
o ACF, PACF plots, and residual analysis help refine and validate models.
Limitations
1. Stationarity Requirements:
o Many models (e.g., ARIMA) require stationary data, which might need preprocessing
(e.g., differencing, transformations).
2. Short-Term Focus:
o Most methods perform well for short- to medium-term forecasting but degrade for
long-term predictions.
3. Data Dependency:
o Requires a substantial amount of high-quality historical data.
o Missing or noisy data can reduce accuracy.
4. Handling Complexity:
o Simple models may struggle with non-linear, high-dimensional, or complex seasonal
patterns.
o Advanced models like deep learning require significant computational resources.
5. Assumption Limitations:
o Many models assume linear relationships and constant variance, which may not hold
in real-world data.
6. Seasonality and External Factors:
o Traditional models struggle with irregular seasonality or external variables (e.g.,
weather, holidays).
o Requires specialized models (e.g., SARIMA, Prophet).
7. Overfitting Risk:
o Over-tuning parameters can lead to poor generalization.
8. Interpretability Issues:
o Machine learning models (e.g., neural networks) can be hard to interpret compared to
statistical methods.
9. Not Adaptable to Sudden Changes:
o Models fail to account for unexpected shifts (e.g., economic crises, pandemics)
without external inputs.
Classification Trees
Definition
A classification tree is a supervised machine learning algorithm used to categorize data into distinct
classes. It works by recursively splitting the dataset into smaller subsets based on the values of input
features. Each split is determined using a metric that maximizes the homogeneity of the resulting
groups. The process continues until a stopping condition is met (e.g., a maximum depth or a minimum
number of samples per node). The final output is a tree structure where the branches represent
decision rules, and the leaf nodes represent the predicted categories.
3. Chi-Square
• Measures the independence of classes and features.
• High chi-square values suggest a significant relationship.
• Formula:
Steps:
1. Start with the Entire Dataset:
o Begin at the root node, representing the full dataset.
o Determine which feature and threshold best split the data into more homogeneous
groups (i.e., groups that predominantly belong to one class).
2. Select a Splitting Criterion:
o Use a metric like Gini Index or Entropy to evaluate the quality of each possible split.
o Choose the feature and threshold that maximize the homogeneity of resulting subsets.
3. Create Branches:
o Split the dataset into subsets based on the chosen feature and threshold.
o Each subset forms a branch leading to a new decision node or leaf node.
4. Repeat Recursively:
o Apply the same splitting process to each subset.
o Continue until a stopping condition is met (e.g., maximum depth, minimum number
of samples per node, or perfect classification).
5. Assign Class Labels:
o When no further splitting is possible, assign the majority class of the subset to the leaf
node.
Limitations
1. Overfitting:
o Prone to creating overly complex models that do not generalize well.
2. Instability:
o Small changes in the dataset can result in a completely different tree structure.
3. Bias in Imbalanced Datasets:
o Trees may favor the majority class without proper balancing.
4. Lack of Smooth Predictions:
o Splits create step-like decision boundaries, which may not fit the problem's natural
curve.
5. Scalability:
o Computationally expensive for very large datasets due to recursive splitting.
Logistic Regression
Definition
Logistic Regression is a statistical and machine learning technique used for modeling the
probability of a binary outcome (yes/no, true/false, 0/1).
It predicts the likelihood of an event occurring by fitting data to a logistic function (sigmoid
curve), which maps inputs to probabilities between 0 and 1.
Where:
• σ(z) is the sigmoid function output (a probability between 0 and 1).
• e is the base of the natural logarithm (approximately 2.718).
• z is the input to the sigmoid function, typically a linear combination of the input features in
logistic regression (i.e., z=β0+β1X1+β2X2+...+βnXn).
1. Confusion Matrix
A confusion matrix is a table that summarizes the model's predictions against the true labels. It
consists of four components:
Predicted: Positive (1) Predicted: Negative (0)
Actual: Positive (1) True Positive (TP) False Negative (FN)
Actual: Negative (0) False Positive (FP) True Negative (TN)
Terms:
• True Positive (TP): Correctly predicted positive instances.
• True Negative (TN): Correctly predicted negative instances.
• False Positive (FP): Incorrectly predicted positive (type I error).
• False Negative (FN): Incorrectly predicted negative (type II error).
Purpose:
The confusion matrix provides a detailed breakdown of classification results, useful for calculating
other metrics
2. Accuracy
Accuracy is the proportion of correctly classified instances out of the total instances.
Formula:
Use Case:
• Suitable when the dataset is balanced (equal representation of classes).
• Not ideal for imbalanced datasets, as it might be misleading (e.g., predicting the majority class
always could yield high accuracy).
3. Precision
Precision (also called Positive Predictive Value) measures the proportion of correctly predicted
positive instances out of all predicted positive instances.
Formula:
Use Case:
• Important when false positives are costly (e.g., in spam detection, a false positive might
classify an important email as spam).
• High precision means fewer false positives.
4. Recall
Recall (also called Sensitivity or True Positive Rate) measures the proportion of actual positive
instances that were correctly predicted.
Formula:
Use Case:
• Important when false negatives are costly (e.g., in medical diagnosis, missing a disease case
can have severe consequences).
• High recall means fewer false negatives.
5. F1-Score
F1-Score is the harmonic mean of precision and recall. It balances the two metrics and is useful when
there’s an uneven class distribution.
Formula:
Use Case:
• Used when you want to balance precision and recall.
• Particularly useful in imbalanced datasets where both metrics need equal importance.
Balanced Dataset:
• Accuracy is a reliable metric when the dataset has a similar number of positive and negative
instances.
Imbalanced Dataset:
• Focus on precision, recall, and F1-score:
o High Precision: Useful when false positives are costly.
o High Recall: Useful when false negatives are costly.
o F1-Score: Balances both precision and recall.
Threshold Tuning:
• Adjust the decision threshold (default P>0.5) to optimize metrics based on the application.
For instance:
o Lower the threshold for higher recall (reduce false negatives).
o Raise the threshold for higher precision (reduce false positives).
Example
Suppose the problem is to predict whether a patient has a disease (y=1) or not (y=0):
1. High Recall: Important to catch all diseased patients (minimize missed diagnoses).
2. High Precision: Important to avoid alarming healthy patients with a false diagnosis.
3. F1-Score: Useful if both errors (false positives and false negatives) are equally important.
θ0+θ1x1+θ2x2+…+θnxn=0
A hyperplane acts as a separator. The points lying on two different sides of the hyperplane will make
up two different groups.
Basic idea of support vector machines is to find out the optimal hyperplane for linearly separable
patterns. A natural choice of separating hyperplane is optimal margin hyperplane (also known as
optimal separating hyperplane) which is farthest from the observations. The perpendicular distance
from each observation to a given separating hyperplane is computed. The smallest of all those
distances is a measure of how close the hyperplane is to the group of observations. This
minimum distance is known as the margin. The operation of the SVM algorithm is based on
finding the hyperplane that gives the largest minimum distance to the training examples, i.e. to
find the maximum margin. This is known as the maximal margin classifier.
Note that the maximal margin hyperplane depends directly only on these support vectors.
If any of the other points change, the maximal margin hyperplane does not change, until the
movement affects the boundary conditions or the support vectors. The support vectors are the
most difficult to classify and give the most information regarding classification. Since the support
vectors lie on or closest to the decision boundary, they are the most essential or critical data points
in the training set.
When Data is NOT Linearly Separable
SVM is quite intuitive when the data is linearly separable. However, when they are not, as shown in
the diagram below, SVM can be extended to perform well.
There are two main steps for nonlinear generalization of SVM. The first step involves
transformation of the original training (input) data into a higher dimensional data using a nonlinear
mapping. Once the data is transformed into the new higher dimension, the second step involves
finding a linear separating hyperplane in the new space. The maximal marginal hyperplane found in
the new space corresponds to a nonlinear separating hypersurface in the original space.
Kernel Functions
Handling nonlinear transformation of input data into higher dimension may not be easy. There may
be many options available to begin with and the procedures may be computationally heavy also. To
avoid some of those problems, the concept of Kernel functions is introduced.
It so happens that in solving the quadratic optimization problem of the linear SVM, the training data
points contribute through inner products of nonlinear transformations. The inner product of two n-
dimensional vectors is defined as