Data Analytics Unit IV
Data Analytics Unit IV
Object Segmentation
Object segmentation in data analytics involves dividing data into meaningful groups or
segments. Each segment shares common characteristics, which helps in analyzing patterns or
behaviors within a dataset. Object segmentation is widely used in various domains such as
marketing, finance, and healthcare to group data points that are similar, making it easier to
interpret and predict trends.
Diagram: Below is a sample diagram illustrating the difference between regression and
segmentation. In regression, there’s a continuous line representing predictions. In segmentation,
data points are grouped into distinct clusters.
Example: In retail, regression might be used to predict sales based on historical sales data, while
segmentation can help group customers into categories (e.g., high-spenders, occasional buyers)
based on purchasing behavior.
● In supervised learning, the model is trained on labeled data where both input and the
corresponding output are provided.
● The primary goal is to learn the mapping function that relates input to output.
Examples:
○ Predicting house prices based on features like area, number of rooms, and
location.
○ Email spam classification.
● Techniques:
○ Regression: Predict continuous values, e.g., predicting stock prices.
○ Classification: Predict discrete categories, e.g., classifying emails as spam or
non-spam.
○ Linear Regression: For continuous target variables.
○ Logistic Regression: For binary classification tasks.
● Key Features:
○ Requires labeled data for training.
○ The output can be continuous (regression) or categorical (classification).
● Applications:
○ Predictive Analytics: Forecasting sales, predicting customer churn.
○ Classification Problems: Identifying whether an email is spam or not.
Example: Predicting housing prices based on features like area, location, and the number of
bedrooms.
The model learns from this data to predict prices for new housing data.
Advantages:
Challenges:
Unsupervised Learning:
● In unsupervised learning, the model is trained on data without labeled outputs.
● It identifies patterns and relationships in the data.
● The goal is to identify underlying patterns, structures, or clusters within the data.
Examples:
○ Customer segmentation for marketing.
○ Identifying fraudulent transactions in financial data.
● Techniques:
○ Clustering: Grouping similar data points, e.g., K-means, hierarchical clustering.
○ Dimensionality Reduction: Reducing the number of features, e.g., PCA
(Principal Component Analysis).
● Key Features:
○ Does not rely on labeled outputs.
○ Focuses on exploring the dataset's hidden structures.
● Applications:
○ Customer Segmentation: Grouping customers based on purchasing behavior.
○ Anomaly Detection: Identifying fraudulent credit card transactions.
Example: Clustering shopping data to group customers based on their buying patterns.
Advantages:
Challenges:
Example: A financial institution categorizes loan applicants as "low risk" or "high risk" based on
their credit history and income. The algorithm is trained on labeled data to classify new
applicants accordingly.
Unsupervised Learning: Here, the model is trained on unlabeled data, and it automatically
identifies patterns within the data. Clustering algorithms like K-means and DBSCAN are
commonly used for unsupervised segmentation.
Example: In marketing, clustering algorithms are applied to identify different customer groups
based on purchasing habits without predefined categories, enabling targeted marketing strategies.
Comparison Table:
Tree Building
Tree-building algorithms are widely used in supervised learning for both regression and
classification tasks.
Decision Trees
1. Root Node: Represents the entire dataset and the initial decision to be made.
2. Internal Nodes: Represent decisions or tests on attributes. Each internal node has one
or more branches.
3. Branches: Represent the outcome of a decision or test, leading to another node.
4. Leaf Nodes/ Terminal Nodes: Represent the final decision or prediction. No further
1. Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or
3. Repeating the Process: The process is repeated recursively for each subset, creating
a new internal node or leaf node until a stopping criterion is met (e.g., all instances in
a node belong to the same class or a predefined depth is reached).
● Information Gain: Measures the reduction in entropy or Gini impurity after a dataset is
split on an attribute.
● Simplicity and Interpretability: Decision trees are easy to understand and interpret.
The visual representation closely mirrors human decision-making processes.
● Versatility: Can be used for both classification and regression tasks.
● No Need for Feature Scaling: Decision trees do not require normalization or scaling
of the data.
● Handles Non-linear Relationships: Capable of capturing non-linear relationships
between features and target variables.
● Overfitting: Decision trees can easily overfit the training data, especially if they are
deep with many nodes.
● Instability: Small variations in the data can result in a completely different tree being
generated.
● Bias towards Features with More Levels: Features with more levels can dominate
the tree structure.
Key Concepts:
● Decision Trees:
○ A tree-like structure where each internal node represents a feature, each branch
represents a decision rule, and each leaf node represents an output.
○ Algorithms: CART (Classification and Regression Trees), ID3.
● Regression Trees:
○ Used for predicting continuous values.
○ Example: Predicting the sales of a product based on pricing and advertising.
● Classification Trees:
○ Used for predicting discrete categories.
○ Example: Determining whether a customer will buy a product based on age and
income.
Challenges:
● Overfitting: A tree model that is too complex and performs well on training data but
poorly on unseen data.
● Trees may become too complex, leading to poor performance on unseen data.
○ Solution: Use techniques like pruning (removing unnecessary nodes), limiting
tree depth, or ensembling methods like Random Forest.
● Pruning:
○ Reduces tree size by removing sections of the tree that provide little predictive
power.
○ Types: Pre-pruning (limit growth during training) and Post-pruning (simplify after
tree creation).
○ Pre-pruning: Stops tree growth early by limiting depth or the number of nodes.
○ Post-pruning: Simplifies a fully grown tree by removing redundant nodes.
○
● Random Forest: Builds multiple decision trees and aggregates their outputs for better
accuracy and robustness.
● Gradient Boosting Machines (GBM): Combines weak learners (small trees) iteratively
to improve overall performance.
Applications:
Diagram:
Decision Trees: Decision trees are widely used in segmentation, especially when the data has a
clear hierarchy. A decision tree divides the data based on the value of specific features
(variables), making a series of splits that result in a tree-like structure.
Regression Trees: Used when the outcome is continuous. For instance, predicting the price of a
house based on features like size and location.
Classification Trees: Used when the outcome is categorical. For example, classifying emails as
"spam" or "not spam."
Overfitting: A decision tree can become too complex by adding too many branches that fit the
training data very well but don’t generalize to new data. Overfitting makes the model less
effective in real-world scenarios.
Complexity: Large trees may become very complex and less interpretable. Simplifying the trees
with pruning can help make them easier to understand.
Pruning: To address overfitting, pruning techniques are applied. Pruning removes sections of the
tree that provide little predictive power, thereby reducing complexity and making the model
more generalizable.
To overcome overfitting, pruning techniques are used. Pruning reduces the size of the tree by
removing nodes that provide little power in classifying instances. There are two main types of
pruning:
● Pre-pruning (Early Stopping): Stops the tree from growing once it meets certain
criteria (e.g., maximum depth, minimum number of samples per leaf).
● Post-pruning: Removes branches from a fully grown tree that do not provide
significant power.
Example: In a medical diagnosis decision tree, some branches might only apply to specific cases
and are not representative of general patterns. Pruning these branches improves the model’s
accuracy on unseen data.
Applications of Decision Trees
Multiple Decision Trees involve leveraging multiple decision tree algorithms to improve
prediction accuracy and reduce the risk of overfitting. These techniques are commonly used in
ensemble learning methods, where a group of trees collaborates to make better predictions than a
single tree.
● Overfitting: Single trees may fit the training data too closely, resulting in poor
generalization to unseen data.
● Bias and Variance: A single tree may have high variance or high bias, depending on its
configuration.
● Stability: Small changes in the dataset can lead to significantly different trees.
1) Random Forest
● Concept:
○ Builds multiple decision trees on different subsets of the dataset (created through
bootstrapping) and features (randomly selected for each split).
○ The final prediction is obtained by:
■ Regression: Taking the average of predictions from all trees.
■ Classification: Using majority voting among the trees.
● Key Features:
○ Reduces overfitting by averaging multiple trees.
○ Handles missing data well.
○ Can measure feature importance.
● Concept:
○ Builds trees sequentially, where each tree attempts to correct the errors of the
previous trees.
○ A loss function (e.g., Mean Squared Error) guides how trees are built.
○ Models like XGBoost, LightGBM, and CatBoost are popular implementations.
● Key Features:
○ High predictive accuracy.
○ Can handle both regression and classification tasks.
○ Requires careful tuning of hyperparameters (e.g., learning rate, number of trees).
● Example: Predicting product sales:
○ First tree predicts 100100100, but the actual value is 120120120.
○ Second tree tries to predict the residual (202020).
○ Final prediction is the sum of predictions from all trees.
● Concept:
○ Trains multiple decision trees on different random subsets of the training data
(using bootstrapping).
○ Combines predictions by averaging (for regression) or voting (for classification).
● Key Features:
○ Reduces variance and avoids overfitting.
○ Often used as a base for Random Forest.
● Example: Predicting stock prices:
○ Each tree is trained on a different subset of the dataset.
○ Predictions are averaged to produce the final output.
● Concept:
○ A variant of Random Forest where splits are made randomly, rather than choosing
the best split.
○ Uses the entire dataset (no bootstrapping).
● Key Features:
○ Faster than Random Forest.
○ Adds additional randomness to reduce overfitting.
● Concept:
○ Focuses on improving weak learners (e.g., shallow decision trees).
○ Adjusts the weights of incorrectly predicted samples, so subsequent trees focus
more on them.
● Key Features:
○ Works well with imbalanced data.
○ Sensitive to outliers.
● Example: Classifying fraudulent transactions:
○ First tree classifies 90%90\%90% of the data correctly.
○ Second tree focuses on the 10%10\%10% misclassified cases, and so on.
Applications
Time series analysis focuses on understanding patterns and trends in data over time to make
forecasts. Time series analysis deals with data that is collected over time in sequential order. This
type of data can reveal trends, cycles, and seasonal patterns. Time series analysis is crucial in
fields like finance, retail, and meteorology, where forecasting future values based on historical
patterns is valuable. Time series analysis involves analyzing data points collected or recorded at
specific time intervals. It is widely used for forecasting trends and predicting future values.
Key Concepts:
Techniques:
● Weather prediction.
● Sales forecasting.
● Anomaly detection in IoT data.
The ARIMA model is one of the most widely used time series forecasting models. ARIMA is a
statistical model used for time series forecasting. It combines three main
components—Auto-Regression (AR), Integration (I), and Moving Average (MA)—to predict
future values based on past observations.
AR (Auto-Regressive): Uses past values to predict future values. For example, the sales on a
particular day could depend on the sales from previous days.
I (Integrated): Differencing the data to make it stationary. Stationarity means that the data’s
statistical properties (mean, variance) are consistent over time.
MA (Moving Average): Incorporates the dependency between an observation and residual
errors from previous observations.
Components of ARIMA
1. Auto-Regression (AR):
● Refers to a model that uses the relationship between a variable and its past values.
● Example: Predicting the current sales of a product based on sales in the previous
months.
● Represented as p: The number of lagged observations to include in the model.
2. Integration (I):
● Involves differencing the data to make it stationary (removing trends or
seasonality).
● Represented as d: The number of differencing operations required.
● Example: If sales consistently increase by 10 units every month, differencing will
subtract one month’s sales from the next to stabilize the trend.
Example: Suppose a retailer wants to forecast monthly sales for the next year. Using ARIMA,
the model would learn from monthly sales data over the past few years, capturing trends and
seasonality, to predict future sales values.
Parameters of ARIMA
Each component in ARIMA functions as a parameter with a standard notation. For ARIMA
models, a standard notation would be ARIMA with p, d, and q, where integer values substitute
for the parameters to indicate the type of ARIMA model used. The parameters can be defined as:
● p: the number of lag observations in the model, also known as the lag order.
● d: the number of times the raw observations are differenced; also known as the degree of
differencing.
● q: the size of the moving average window, also known as the order of the moving
average.
For example, a linear regression model includes the number and type of terms. A value of zero
(0), which can be used as a parameter, would mean that a particular component should not be
used in the model. This way, the ARIMA model can be constructed to perform the function of an
ARMA model, or even simple AR, I, or MA models.
1. Check Stationarity:
○ Stationarity means that the statistical properties (mean, variance) of the time
series do not change over time.
○ Check whether the data is stationary. (ensure stationarity by testing)
○ If not stationary, apply differencing until the series becomes stationary.
2. Identify Parameters (p, d, q):
○ Use Autocorrelation Function (ACF) and Partial Autocorrelation Function
(PACF) plots to identify the values for p and q.
○ The differencing order d is determined by the number of times differencing was
applied.
3. Fit the Model:
○ Use the chosen p, d, and q values to fit the ARIMA model.
4. Validate the Model:
○ Check residual errors for randomness (using residual plots and statistical tests).
○ If residuals are not random, refine the model parameters.
5. Forecast:
○ Use the fitted ARIMA model to predict future values.
ARIMA Equation: The general ARIMA model combines AR, I, and MA components-
Applications of ARIMA
Let’s assume you want to predict the daily sales of a product for the next week. You have daily
sales data for the past year, which shows both trend and seasonal patterns.
1. Step 1: Data Collection You collect daily sales data from your e-commerce platform for
the past year.
2. Step 2: Make the Data Stationary Before applying ARIMA, you check whether the
data is stationary. If not, you apply differencing to remove trends and seasonality.
3. Step 3: Choose ARIMA Model Parameters (p, d, q) You choose the order of the AR
(p), differencing (d), and MA (q) parts using statistical techniques like the ACF
(Auto-Correlation Function) and PACF (Partial Auto-Correlation Function).
4. Step 4: Train the Model You fit the ARIMA model using historical data.
5. Step 5: Make Predictions Once the model is trained, you can use it to make predictions
for the next 7 days.
Forecast accuracy is crucial in evaluating the performance of predictive models. It ensures that
forecasts generated by models align closely with actual values. The measures of forecast
accuracy help quantify the error between predicted and observed values, enabling analysts to
choose or improve forecasting methods.
a) Positive Error
b) Negative Error
1) Mean Absolute Error (MAE): Measures the average of the absolute errors between
predicted and actual values.
● Definition: The average of the absolute differences between observed and predicted
values.
● Characteristics:
○ Simple to calculate and interpret.
○ Treats all errors equally, irrespective of their magnitude.
2) Mean Squared Error (MSE): Measures the average of the squared errors between predicted
and actual values, emphasizing larger errors.
● Definition: The average of the squared differences between observed and predicted
values.
● Characteristics:
○ Penalizes larger errors more heavily.
○ Sensitive to outliers.
3) Root Mean Squared Error (RMSE): The square root of MSE, giving an indication of the
model’s prediction error in the original units.
● Definition: The square root of the MSE, providing error in the same units as the data.
● Characteristics:
○ Combines the advantages of MSE but is interpretable in the original scale of the
data.
Use Case: In weather forecasting, RMSE is commonly used to measure the accuracy of
temperature predictions.
● Definition: Measures the average percentage error between observed and predicted
values.
● Characteristics:
○ Expresses errors as percentages, making it scale-independent.
○ May give misleading results if actual values are close to zero.
● Characteristics:
○ Addresses the issue of zero or near-zero actual values.
○ Useful for more balanced percentage error calculations.
● Definition: Measures the average error between observed and predicted values (signed).
● Characteristics:
○ Indicates bias in forecasts (negative MFE shows overestimation, positive MFE
shows underestimation).
7) Tracking Signal
● Definition: Monitors the consistency of forecast errors over time.
● Use:
○ Helps detect bias or systematic error in forecasts.
Practical Examples
Other Applications:
● Weather forecasting.
● Stock market predictions.
● Predictive maintenance in manufacturing.
Example: A retail company can use STL to decompose monthly sales data. This allows them to
separate seasonal effects (e.g., holiday sales boosts) from the underlying trend in sales growth.
STL is a robust method used to decompose a time series into three components:
1. Seasonal Component: Represents the periodic patterns (e.g., weekly, monthly, or yearly
cycles).
2. Trend Component: Represents the long-term movement in the data (e.g., increasing
sales over years).
3. Residual (Remainder) Component: Represents the irregular or random variations in the
data.
Key Features of STL
1. Input Data:
○ The time series data is provided as input.
○ Example: Monthly sales data over the past 3 years.
2. Seasonal Extraction:
○ The seasonal component is extracted using smoothing techniques.
○ This component captures repeating patterns (e.g., higher sales in December).
3. Trend Extraction:
○ The trend component is obtained by removing the seasonal component and
applying smoothing to capture the long-term movement.
4. Residual Calculation:
○ After removing the seasonal and trend components, the remainder (residuals) is
calculated, representing noise or unexplained variation.
Mathematical Representation
Serialization refers to saving time-ordered data in a format that can be easily transmitted or
stored for later analysis. Common serialization formats include JSON and CSV. Serialization
ensures data integrity and allows time series data to be analyzed across different systems and
applications.
Example: In financial applications, serialization is used to store historical stock prices in JSON
format for real-time analytics and forecasting.
Data Extraction: Select key features or variables from the dataset. In time series analysis, this
might involve identifying important dates, events, or anomalies.
Data Analysis: Apply models like ARIMA or machine learning algorithms (e.g., RNNs or
LSTMs) to analyze time series data. These models can learn patterns over time, allowing for
more accurate predictions.
Example: A company might extract sales data during promotional events, analyze trends during
these periods, and predict future sales for upcoming promotions using ARIMA.
Extract Features from Generated Model as Height, Average Energy etc., and
Analyze for Prediction
Feature extraction is a crucial step in machine learning and data analysis, where meaningful
information is derived from raw data or model outputs to improve prediction accuracy. In the
context of analyzing a generated model, features like Height, Average Energy, and other
derived metrics can provide insightful information for predictive analytics.
Understanding Features
Feature extraction is a fundamental process in data analysis and machine learning. It involves
identifying and deriving significant attributes from raw data or a model's output that can be
utilized for prediction. Features such as Height, Average Energy, and other derived metrics
serve as inputs for predictive models, helping to uncover patterns and trends.
1) Height
● Definition: The peak value or maximum value in the dataset or model output.
● Purpose: Height often signifies the intensity or magnitude of a phenomenon, such as the
highest sales in a month, peak temperature in a year, or the maximum value in a
waveform.
● Significance:
○ Highlights extreme conditions or events.
○ Useful in trend detection and anomaly identification.
● Examples:
○ Stock Market: Height can represent the highest stock price in a time frame.
○ Energy Usage: The highest energy consumption during a day.
○ Waveform Analysis: The peak amplitude in signal processing.
○ Weather Analysis: The highest temperature recorded in a season.
2) Average Energy
● Definition: The mean of the energy values across the data, representing the overall
intensity or activity over time.
● Purpose: Helps in understanding the typical level of activity or variation in the data.
● Significance:
○ Provides a general measure of the dataset's activity over time.
○ Useful for understanding trends and deviations.
● Calculation:
● Examples:
○ Audio Signal Processing: Average energy can reflect the loudness or intensity of
a sound.
○ IoT Sensors: Average power consumption of a device over a period.
○ Time Series Data: Average sales per week for retail forecasting.
○ Signal Processing: Average amplitude of a waveform.
○ IoT Applications: Average sensor readings over a day.
Other Features
After extracting features, the next step is to analyze them for their predictive capabilities.
a) Statistical Analysis
● Correlation: Measure how strongly a feature is associated with the target variable.
● Example: Correlate energy peaks (Height) with outdoor temperature.
b) Feature Selection
● Use algorithms like Recursive Feature Elimination (RFE) to select the most relevant
features.
● Example: Choose Height and Average Energy as key predictors for future energy
consumption.
c) Model Training
e) Correlation Analysis
● Check the relationship between extracted features and the target variable.
● Example:
○ Height of sales peaks correlating with promotional campaigns.
○ Average energy consumption correlating with seasonal changes.
f) Feature Engineering
g) Model Building
h) Visualization
Scenario:
A smart grid collects data on daily energy usage. The goal is to predict future consumption
patterns.
Feature Extraction:
Visualization:
● Plot energy usage trends to highlight peaks (Height) and averages over time.
Predictive Model:
a) Fourier Transform
● Use decision trees, Random Forest, or LASSO regression to rank feature importance.
a) Healthcare
b) Finance
● Extract features like peak energy (Height) and average daily usage (Average Energy) to
predict future consumption and identify energy-saving opportunities.
e) Medical Diagnosis
f) Financial Forecasting
g) Predictive Maintenance
Dataset
Step-by-Step Analysis
1. Feature Extraction:
○ Height: Identify the day with the highest power consumption (e.g., during
summer months with heavy air conditioning usage).
○ Average Energy: Calculate the daily average power consumption over the year.
2. Visualization:
○ Plot a time series of daily consumption, marking the highest points (Height).
○ Create a bar chart of average monthly energy usage.
3. Predictive Analysis:
○ Use features to predict periods of high energy usage.
○ Example Model: Linear regression to predict future daily consumption.
4. Insights:
○ Height might indicate days of peak activity (e.g., holidays or extreme weather).
○ Average Energy provides a baseline for typical consumption, helping identify
anomalies.
Diagram: