ADS-ch3 2024-25
ADS-ch3 2024-25
Data Science
An Elective Course offered by Dept. of Computer Engineering
(Semester-VIII, 2024-25)
By Machhindranath Patil 2
Overview of Model Building
• Model building in data science is a systematic process that involves creating mathematical
representations or algorithms to make predictions or decisions based on data.
• This process is iterative and requires careful consideration at each step to ensure the
model is accurate, reliable, and effective.
1. De ning the Problem: It's crucial to precisely outline the problem which is to be solved
using data. Clearly, this step involves understanding the context, the desired outcomes
and the necessary requirements to achieve those outcomes.
• Ex: Suppose you work for an e-commerce company, and the problem is to predict
whether a customer will churn (stop purchasing) in the next month. The goal is to
identify ‘at-risk’ customers so that targeted marketing campaigns can be deployed to
retain them.
By Machhindranath Patil 3
fi
Overview of Model Building
2. Data Collection: Collect relevant data from various sources, including databases,
APIs, les, or sensors that can help solve the problem.
• EX: For the churn prediction problem, you might collect data from—
• Customer transaction history (e.g., purchase frequency, average order
value).
• Website interaction data (e.g., time spent on the site, pages visited).
• Customer demographics (e.g., age, location).
• Customer support interactions (e.g., number of complaints resolved).
By Machhindranath Patil 4
fi
Overview of Model Building
3. Data Cleaning: This is data preprocessing step that involves cleaning the data in order
to handle missing values, outliers, and inconsistencies. Various techniques like
normalization, standardization, and feature engineering may be applied to prepare
the data for modeling.
• EX:
• Handle missing values: If some customers have missing age data, you might
impute the missing values with the median age.
• Remove outliers: If a customer has an unusually high number of purchases
(e.g., 1,000 purchases in a month), you might investigate and remove this
outlier.
• Standardize data: Convert all monetary values to the same currency or
normalize numerical features to a common scale.
By Machhindranath Patil 5
Overview of Model Building
4. Exploratory Data Analysis: Explore data to understand its characteristics,
relationships, and patterns through visualization, descriptive statistics, and
variable correlations. It helps in uncovering the data's underlying structure and
identifying features for modeling.
• EX:
• Visualize the distribution of customer churn (e.g., using a bar chart to
show the percentage of churned vs. retained customers).
• Analyze correlations: Check if features like "purchase frequency" and
"time spent on the site" are correlated with churn.
• Identify trends: For instance, you might discover that customers who
haven’t made a purchase in the last 30 days are more likely to churn.
By Machhindranath Patil 6
Overview of Model Building
5. Feature Engineering: Selecting relevant features or creating new features can signi cantly
impact model performance. Commonly used techniques for feature selection include
correlation analysis, dimensionality reduction methods like principle component
analysis (PCA). This step essentially involves transforming raw data into meaningful
inputs for the model.
• EX:
• Create new features: Combine "total purchases" and "total spending" to create a
new feature like "average spending per purchase.”
• Feature selection: Use correlation analysis to identify the most relevant features.
For instance, if "time spent on the site" has a high correlation with churn, it should
be included in the model.
• Dimensionality reduction: Use PCA to reduce the number of features if the dataset
has too many variables.
By Machhindranath Patil 7
fi
Overview of Model Building
6. Model Selection: Selection of a suitable machine learning or statistical models is based
on the nature of the problem and data features, while taking into account factors like
model complexity, interpretability, and scalability.
• EX:
• For the churn prediction problem, you might consider:
• Logistic Regression: If interpretability is important.
• Random Forest: If the dataset has complex relationships and you want
high accuracy.
• Gradient Boosting: If you need state-of-the-art performance.
• The choice depends on factors like model complexity, interpretability, and
scalability.
By Machhindranath Patil 8
Overview of Model Building
7. Model Training: The training process involves adjusting the model parameters to
minimize the difference between the predicted and actual values.
• EX:
• Split the data into training and validation sets (e.g., 80% training, 20%
validation).
• Train a Random Forest model using the training data.
• Use techniques like cross-validation to ensure the model generalizes well
to unseen data.
• Optimize hyper-parameters (e.g., number of trees in the forest) using grid
search or random search.
By Machhindranath Patil 9
Overview of Model Building
8. Model Evaluation: Evaluate model performance using metrics like accuracy, precision,
Receiver Operating Characteristic (ROC)-Area Under the Curve (AUC) etc. to assess
generalization to unseen data for real-world reliability and effectiveness.
• EX: For the churn prediction problem, you might evaluate the model using:
• Accuracy: Percentage of correctly predicted churn and non-churn cases.
• Precision: Proportion of predicted churn cases that are actual churn cases.
• Recall: Proportion of actual churn cases correctly identi ed by the model.
• ROC-AUC: Measures the model's ability to distinguish between churn and non-
churn cases.
• If the model achieves a high ROC-AUC score (e.g., 0.85), it indicates good
performance.
By Machhindranath Patil 10
fi
Overview of Model Building
9. Model Deployment: Once a model with a desirable performance is developed, it
can be deployed into production environments where it can make predictions
on new data.
• EX:
• Integrate the churn prediction model into the company’s customer
relationship management (CRM) system.
• Use the model to score customers daily, and ag those at high risk of
churn.
• Automate email campaigns targeting at-risk customers with personalized
offers to reduce churn.
•
By Machhindranath Patil 11
fl
Cross Validation
• Cross validation is a statistical technique used in machine learning to assess how
well a predictive model will perform on an independent dataset.
• The primary objective is to obtain an unbiased evaluation of a model's performance.
• This can be done by partitioning the dataset into subsets. Train the model on some
of these subsets and then evaluate its performance on the remaining data.
• The most common form of cross validation is k-fold cross validation, where the
dataset is divided into k equally sized folds.
• Other widely used cross validation technique is Leave-1-out Cross Validation, in
which the dataset is partitioned into i subsets, where i is equal to the number of
instances or data points in the dataset.
By Machhindranath Patil 12
k-fold Cross Validation
• . The k-fold cross-validation is a widely used technique in machine learning to evaluate the
performance of a model. It helps in assessing how well a model generalizes to an
independent dataset, reducing the risk of over tting or under tting.
• Steps in k-Fold cross validation are as follows,
1. Divide the Dataset into k Folds: The dataset is split into k equally sized subsets or folds.
For example, if k=5, the dataset is divided into 5 folds, each containing 20% of the data.
By Machhindranath Patil 13
fi
fi
k-fold Cross Validation
3. Calculate the Average Performance: After all k iterations, the performance
metrics (e.g., accuracy, precision, recall, etc.) from each fold are averaged to
provide a robust estimate of the model's generalization performance.
• Example: Suppose a dataset has 100 samples and we choose k=5 for k-fold cross-
validation.
1. Split the Dataset: The dataset is divided into k=5 folds, so each containing 20
samples. Fold 1: Samples 1–20, Fold 2: Samples 21–40, Fold 3: Samples 41–60,
Fold 4: Samples 61–80 and Fold 5: Samples 81–100.
2. Iterations:
• Iteration 1: Training Set: Folds 2, 3, 4, 5 (Samples 21–100) and Test Set: Fold 1
(Samples 1–20). Train the model on Folds 2–5 and evaluate on Fold 1.
By Machhindranath Patil 14
k-fold Cross Validation
• Iteration 2: Training Set: Folds 1, 3, 4, 5 (Samples 1–20 and 41–100) and Test Set: Fold 2 (Samples 21–40).
Train the model on Folds 1, 3–5 and evaluate on Fold 2.
• Iteration 3: Training Set: Folds 1, 2, 4, 5 (Samples 1–40 and 61–100) and Test Set: Fold 3 (Samples 41–60).
Train the model on Folds 1, 2, 4, 5 and evaluate on Fold 3.
• Iteration 4: Training Set: Folds 1, 2, 3, 5 (Samples 1–60 and 81–100) and Test Set: Fold 4 (Samples 61–80).
Train the model on Folds 1–3, 5 and evaluate on Fold 4.
• Iteration 5: Training Set: Folds 1, 2, 3, 4 (Samples 1–80) and Test Set: Fold 5 (Samples 81–100). Train the
model on Folds 1–4 and evaluate on Fold 5.
By Machhindranath Patil 15
k-fold Cross Validation : Advantages
1. Reduces Over tting: By training and testing the model on different subsets of the
data, k-fold cross-validation ensures that the model’s performance is not overly
dependent on a single train-test split.
2. Reduces Bias in Performance Estimation. When using a single train-test split, the
performance estimate can be highly dependent on how the data is split. If the test set
is not representative of the overall dataset, the performance estimate may be biased.
On the other hand, by averaging the performance across k different test sets, k-fold
cross-validation reduces the bias that can arise from a single, potentially
unrepresentative split
3. Utilizes Data Ef ciently: Every data point is used for both training and testing,
making it especially useful for small datasets.
By Machhindranath Patil 16
fi
fi
k-fold Cross Validation : Limitations
1. Not Suitable for Time-Series or Sequential Data: k-fold cross-validation assumes
that the data points are independent and identically distributed. However, in
time-series or sequential data, the order of data points matters.
3. High Variance with Small Datasets: When the dataset is small, the performance
estimates from k-fold cross-validation can have high variance, especially for
larger values of k.
By Machhindranath Patil 17
fi
Leave-1-Out Cross Validation
• Leave-1-out Cross Validation is a is a special case of k-fold cross-validation, where
the dataset is partitioned into i subsets, where i is equal to the number of instances
or data points in the dataset.
• In other words, each data point is treated as a separate fold.
• The process involves training of the model on all data points except one.
• The testing of the model is performed on the excluded data point.
• This process is repeated for each data point in the dataset.
• This type pf validation provides an unbiased evaluation. However, it can be
computationally expensive, especially for large datasets, as the model needs to be
trained and evaluated as many times as there are data points.
By Machhindranath Patil 18
Leave-1-Out Cross Validation : Advantages
1. Unbiased Performance Estimate: Since each data point is used exactly once as
the test set, Leave-1-Out provides an almost unbiased estimate of the model's
performance.
2. Maximizes Data Utilization: In each iteration, the model is trained on n-1 data
points, which is the maximum possible training set size for a dataset of size n.
4. Ideal for Small Datasets: Leave-1-Out is especially useful when the dataset is
small, as it provides a reliable performance estimate without sacri cing training
data.
By Machhindranath Patil 19
fi
Leave-1-Out Cross Validation : Limitations
1. High Computational Cost: Leave-1-Out requires training of the model n times,
which can be computationally expensive, especially for large datasets or
complex models.
2. High Variance in Performance Estimates: Since each test set consists of only one
data point, the performance estimate can have high variance, particularly for
noisy datasets.
3. Not Suitable for Large Datasets: For large datasets, Leave-1-Out becomes
impractical due to the computational cost of training the model n times.
By Machhindranath Patil 20
fl
Bootstrapping
• Bootstrapping is a resampling technique in which multiple samples are drawn and
replace them from the original dataset to estimate the distribution of a statistic or to
improve the robustness of a model.
• This process involves following steps
• Sample with Replacement: Randomly select data points from the original dataset, allowing
for the possibility of selecting the same data point more than once for the replacement.
• Create Bootstrap Samples: Generate multiple bootstrap samples by repeating the sampling
process. Each bootstrap sample has the same size as the original dataset but may contain
duplicates and miss some original data points.
• Estimate Statistics or Model Performance: Calculate the statistic of interest (e.g., mean,
median, standard deviation) or train and evaluate a model on each bootstrap sample.
This provides a distribution of the statistic or model performance.
By Machhindranath Patil 21
Data Visualization
By Machhindranath Patil 22
Bar Graph
• A bar chart is a graphical representation of categorical data using rectangular bars. The length of
each bar is proportional to the value it represents. It's useful for comparing different categories
or groups.
20000
Category Expense
15000
Rent ₹ 20000.00
Utilities ₹ 4500.00
5000
School Fees ₹ 9000.00
0
Rent Groceries Utilities School Fees
Utilities ₹ 4500.00
Groceries
31%
School Fees ₹ 9000.00
Roll No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Marks 47 59 77 64 66 60 45 65 71 33 42 49 54 43 22 60 65
Roll No. 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Marks 45 56 57 43 33 23 58 73 65 69 58 71 58 49 21 50 55
Roll No. 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Marks 65 66 32 45 65 34 28 65 74 64 34 45 56 54 53 22
Fail Pass 5
Second Class First Class 10
Distinction
13
10
12
• It is a data visualization tool that can be used to quickly and easily understand the distribution of a set of data.
• Steps to follow:
• To create a stem and leaf plot, the data is rst ordered from least to greatest.
• Then, each data point is split into two parts: the stem and the leaf. The stem is the rst digit or digits of
the data point, and the leaf is the last digit.
• Example 1: Examination Result: Represent the following result data with Stem and Leaf plot.
Roll No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Marks 47 59 77 64 66 60 45 65 71 33 42 49 54 43 22 60 65
Roll No. 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Marks 45 56 57 43 33 23 58 73 65 69 58 71 58 49 21 50 55
Roll No. 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Marks 65 66 32 45 65 34 28 65 74 64 34 45 56 54 53 22
Marks 21 22 22 23 28 32 33 33 34 34 42 43 43 45 45 45 45
Marks 47 49 49 50 53 54 54 55 56 56 57 58 58 58 59 60 60
Marks 64 64 65 65 65 65 65 65 66 66 69 71 71 73 74 77
Stem Leaves
2 1,2,2,3,8
3 2,3,3,4,4
4 2,3,3,5,5,5,5,7,9,9
5 0,3,4,4,5,6,6,7,8,8,8,9
6 0,0,4,4,5,5,5,5,5,5,6,6
7 1,1,3,4,7
• It is a simple and effective way to display the distribution and frequency of a dataset.
• Each dot corresponds to a single data point, and when there are multiple data points with the
same value, they are stacked on top of each other.
6
4
Students
3
0
20 30 40 50 60 70 80
Marks
70
60
Marks 50
40
30
20
0 10 20 30 40 50
Roll No
Date
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
of Sep. 2023
Temperature in
oC
25 27 26.5 27 27.5 26 26.5 27 28 27.5 25 25.5 25 26 26.5
28
27
26
25
24
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
• Various graphs are used to visualise the frequency distribution of a dataset. Such as Histograms,
Bar charts, Frequency polygons and Ogive curves.
• Histograms:
• They are created by dividing the data into bins, or intervals, and then plotting the number of
values in each bin on a bar graph.
• The x-axis of a histogram represents the bins, and the y-axis represents the frequency.
12
10
Frequency
8
0
20 30 40 50 60 70 80
Value
Frequency
8
the overall shape of the distribution.
6
0
20 30 40 50 60 70 80
Value
• They show the percentage of values in the dataset that are less than or equal to each value on
the x-axis.
45
40
35
Cumulative Freq. 30
25
20
15
10
0
20 30 40 50 60 70 80
Marks
By Prof. Machhindranath Patil 35