0% found this document useful (0 votes)

12 views35 pages

ADS-ch3 2024-25

Uploaded by

shubhamchelani21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views35 pages

ADS-ch3 2024-25

Uploaded by

shubhamchelani21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

A course in

Data Science
An Elective Course offered by Dept. of Computer Engineering

(Semester-VIII, 2024-25)

Module -3: Machhindranath Patil,

Methodology and Data Visualization

(PhD. IIT Bombay)

V.E.S. Institute of Technology, Mumbai

Methodology

By Machhindranath Patil 2
Overview of Model Building
• Model building in data science is a systematic process that involves creating mathematical
representations or algorithms to make predictions or decisions based on data.
• This process is iterative and requires careful consideration at each step to ensure the
model is accurate, reliable, and effective.

Steps involved in the model building are as follows,

1. De ning the Problem: It's crucial to precisely outline the problem which is to be solved
using data. Clearly, this step involves understanding the context, the desired outcomes
and the necessary requirements to achieve those outcomes.
• Ex: Suppose you work for an e-commerce company, and the problem is to predict
whether a customer will churn (stop purchasing) in the next month. The goal is to
identify ‘at-risk’ customers so that targeted marketing campaigns can be deployed to
retain them.

By Machhindranath Patil 3
fi
Overview of Model Building
2. Data Collection: Collect relevant data from various sources, including databases,
APIs, les, or sensors that can help solve the problem.
• EX: For the churn prediction problem, you might collect data from—
• Customer transaction history (e.g., purchase frequency, average order
value).
• Website interaction data (e.g., time spent on the site, pages visited).
• Customer demographics (e.g., age, location).
• Customer support interactions (e.g., number of complaints resolved).

By Machhindranath Patil 4
fi
Overview of Model Building
3. Data Cleaning: This is data preprocessing step that involves cleaning the data in order
to handle missing values, outliers, and inconsistencies. Various techniques like
normalization, standardization, and feature engineering may be applied to prepare
the data for modeling.
• EX:
• Handle missing values: If some customers have missing age data, you might
impute the missing values with the median age.
• Remove outliers: If a customer has an unusually high number of purchases
(e.g., 1,000 purchases in a month), you might investigate and remove this
outlier.
• Standardize data: Convert all monetary values to the same currency or
normalize numerical features to a common scale.

By Machhindranath Patil 5
Overview of Model Building
4. Exploratory Data Analysis: Explore data to understand its characteristics,
relationships, and patterns through visualization, descriptive statistics, and
variable correlations. It helps in uncovering the data's underlying structure and
identifying features for modeling.
• EX:
• Visualize the distribution of customer churn (e.g., using a bar chart to
show the percentage of churned vs. retained customers).
• Analyze correlations: Check if features like "purchase frequency" and
"time spent on the site" are correlated with churn.
• Identify trends: For instance, you might discover that customers who
haven’t made a purchase in the last 30 days are more likely to churn.

By Machhindranath Patil 6
Overview of Model Building
5. Feature Engineering: Selecting relevant features or creating new features can signi cantly
impact model performance. Commonly used techniques for feature selection include
correlation analysis, dimensionality reduction methods like principle component
analysis (PCA). This step essentially involves transforming raw data into meaningful
inputs for the model.
• EX:
• Create new features: Combine "total purchases" and "total spending" to create a
new feature like "average spending per purchase.”
• Feature selection: Use correlation analysis to identify the most relevant features.
For instance, if "time spent on the site" has a high correlation with churn, it should
be included in the model.
• Dimensionality reduction: Use PCA to reduce the number of features if the dataset
has too many variables.

By Machhindranath Patil 7

fi
Overview of Model Building
6. Model Selection: Selection of a suitable machine learning or statistical models is based
on the nature of the problem and data features, while taking into account factors like
model complexity, interpretability, and scalability.
• EX:
• For the churn prediction problem, you might consider:
• Logistic Regression: If interpretability is important.
• Random Forest: If the dataset has complex relationships and you want
high accuracy.
• Gradient Boosting: If you need state-of-the-art performance.
• The choice depends on factors like model complexity, interpretability, and
scalability.

By Machhindranath Patil 8
Overview of Model Building
7. Model Training: The training process involves adjusting the model parameters to
minimize the difference between the predicted and actual values.
• EX:
• Split the data into training and validation sets (e.g., 80% training, 20%
validation).
• Train a Random Forest model using the training data.
• Use techniques like cross-validation to ensure the model generalizes well
to unseen data.
• Optimize hyper-parameters (e.g., number of trees in the forest) using grid
search or random search.

By Machhindranath Patil 9
Overview of Model Building
8. Model Evaluation: Evaluate model performance using metrics like accuracy, precision,
Receiver Operating Characteristic (ROC)-Area Under the Curve (AUC) etc. to assess
generalization to unseen data for real-world reliability and effectiveness.
• EX: For the churn prediction problem, you might evaluate the model using:
• Accuracy: Percentage of correctly predicted churn and non-churn cases.
• Precision: Proportion of predicted churn cases that are actual churn cases.
• Recall: Proportion of actual churn cases correctly identi ed by the model.
• ROC-AUC: Measures the model's ability to distinguish between churn and non-
churn cases.
• If the model achieves a high ROC-AUC score (e.g., 0.85), it indicates good
performance.

By Machhindranath Patil 10
fi
Overview of Model Building
9. Model Deployment: Once a model with a desirable performance is developed, it
can be deployed into production environments where it can make predictions
on new data.
• EX:
• Integrate the churn prediction model into the company’s customer
relationship management (CRM) system.
• Use the model to score customers daily, and ag those at high risk of
churn.
• Automate email campaigns targeting at-risk customers with personalized
offers to reduce churn.
•

By Machhindranath Patil 11
fl
Cross Validation
• Cross validation is a statistical technique used in machine learning to assess how
well a predictive model will perform on an independent dataset.
• The primary objective is to obtain an unbiased evaluation of a model's performance.
• This can be done by partitioning the dataset into subsets. Train the model on some
of these subsets and then evaluate its performance on the remaining data.
• The most common form of cross validation is k-fold cross validation, where the
dataset is divided into k equally sized folds.
• Other widely used cross validation technique is Leave-1-out Cross Validation, in
which the dataset is partitioned into i subsets, where i is equal to the number of
instances or data points in the dataset.

By Machhindranath Patil 12
k-fold Cross Validation
• . The k-fold cross-validation is a widely used technique in machine learning to evaluate the
performance of a model. It helps in assessing how well a model generalizes to an
independent dataset, reducing the risk of over tting or under tting.
• Steps in k-Fold cross validation are as follows,

1. Divide the Dataset into k Folds: The dataset is split into k equally sized subsets or folds.
For example, if k=5, the dataset is divided into 5 folds, each containing 20% of the data.

2. Train and Test the Model k Times:

• For each iteration, one fold is used as the test set, and the remaining k-1 folds are
used as the training set.
• .The model is trained on the training set and evaluated on the test set
• This process is repeated k times, with each fold serving as the test set exactly once.

By Machhindranath Patil 13
fi
fi
k-fold Cross Validation
3. Calculate the Average Performance: After all k iterations, the performance
metrics (e.g., accuracy, precision, recall, etc.) from each fold are averaged to
provide a robust estimate of the model's generalization performance.
• Example: Suppose a dataset has 100 samples and we choose k=5 for k-fold cross-
validation.

1. Split the Dataset: The dataset is divided into k=5 folds, so each containing 20
samples. Fold 1: Samples 1–20, Fold 2: Samples 21–40, Fold 3: Samples 41–60,
Fold 4: Samples 61–80 and Fold 5: Samples 81–100.

2. Iterations:
• Iteration 1: Training Set: Folds 2, 3, 4, 5 (Samples 21–100) and Test Set: Fold 1
(Samples 1–20). Train the model on Folds 2–5 and evaluate on Fold 1.

By Machhindranath Patil 14
k-fold Cross Validation
• Iteration 2: Training Set: Folds 1, 3, 4, 5 (Samples 1–20 and 41–100) and Test Set: Fold 2 (Samples 21–40).
Train the model on Folds 1, 3–5 and evaluate on Fold 2.
• Iteration 3: Training Set: Folds 1, 2, 4, 5 (Samples 1–40 and 61–100) and Test Set: Fold 3 (Samples 41–60).
Train the model on Folds 1, 2, 4, 5 and evaluate on Fold 3.
• Iteration 4: Training Set: Folds 1, 2, 3, 5 (Samples 1–60 and 81–100) and Test Set: Fold 4 (Samples 61–80).
Train the model on Folds 1–3, 5 and evaluate on Fold 4.
• Iteration 5: Training Set: Folds 1, 2, 3, 4 (Samples 1–80) and Test Set: Fold 5 (Samples 81–100). Train the
model on Folds 1–4 and evaluate on Fold 5.

3. Calculate Average Performance:

• Suppose the model’s accuracy on each fold is, Fold 1: 85%, Fold 2: 90%, Fold 3: 88%, Fold 4: 87% and Fold
5: 89%
85 + 90 + 88 + 87 + 89
The average accuracy across all folds is = 87.8 %
•
5
• This average accuracy (87.8%) is considered a robust estimate of the model’s generalization performance.

By Machhindranath Patil 15
k-fold Cross Validation : Advantages
1. Reduces Over tting: By training and testing the model on different subsets of the
data, k-fold cross-validation ensures that the model’s performance is not overly
dependent on a single train-test split.

2. Reduces Bias in Performance Estimation. When using a single train-test split, the
performance estimate can be highly dependent on how the data is split. If the test set
is not representative of the overall dataset, the performance estimate may be biased.
On the other hand, by averaging the performance across k different test sets, k-fold
cross-validation reduces the bias that can arise from a single, potentially
unrepresentative split

3. Utilizes Data Ef ciently: Every data point is used for both training and testing,
making it especially useful for small datasets.

4. Provides a Reliable Performance Estimate: The average performance across multiple

folds gives a more stable and reliable estimate of the model’s generalization ability.

By Machhindranath Patil 16
fi
fi
k-fold Cross Validation : Limitations
1. Not Suitable for Time-Series or Sequential Data: k-fold cross-validation assumes
that the data points are independent and identically distributed. However, in
time-series or sequential data, the order of data points matters.

2. Reduces Over tting: High Computational Expense: For large datasets or

complex models, k-fold cross-validation can be computationally expensive,
especially when k is large (e.g. k=10 or more).

3. High Variance with Small Datasets: When the dataset is small, the performance
estimates from k-fold cross-validation can have high variance, especially for
larger values of k.

4. Model Instability in Sensitive Models: If the model is highly sensitive to the

training data (e.g., small changes in the data lead to large changes in the model),
k-fold cross-validation may produce inconsistent results.

By Machhindranath Patil 17
fi
Leave-1-Out Cross Validation
• Leave-1-out Cross Validation is a is a special case of k-fold cross-validation, where
the dataset is partitioned into i subsets, where i is equal to the number of instances
or data points in the dataset.
• In other words, each data point is treated as a separate fold.
• The process involves training of the model on all data points except one.
• The testing of the model is performed on the excluded data point.
• This process is repeated for each data point in the dataset.
• This type pf validation provides an unbiased evaluation. However, it can be
computationally expensive, especially for large datasets, as the model needs to be
trained and evaluated as many times as there are data points.

By Machhindranath Patil 18
Leave-1-Out Cross Validation : Advantages
1. Unbiased Performance Estimate: Since each data point is used exactly once as
the test set, Leave-1-Out provides an almost unbiased estimate of the model's
performance.

2. Maximizes Data Utilization: In each iteration, the model is trained on n-1 data
points, which is the maximum possible training set size for a dataset of size n.

3. No Randomness in Splitting: Unlike k-fold cross-validation, Leave-1-Out does

not involve random splitting of the data, so the results are deterministic and
reproducible.

4. Ideal for Small Datasets: Leave-1-Out is especially useful when the dataset is
small, as it provides a reliable performance estimate without sacri cing training
data.

By Machhindranath Patil 19

fi
Leave-1-Out Cross Validation : Limitations
1. High Computational Cost: Leave-1-Out requires training of the model n times,
which can be computationally expensive, especially for large datasets or
complex models.

2. High Variance in Performance Estimates: Since each test set consists of only one
data point, the performance estimate can have high variance, particularly for
noisy datasets.

3. Not Suitable for Large Datasets: For large datasets, Leave-1-Out becomes
impractical due to the computational cost of training the model n times.

4. Sensitive to Outliers: If the dataset contains outliers, the performance estimate

can be heavily in uenced by them, as each outlier will be part of the test set in
one iteration.

By Machhindranath Patil 20
fl
Bootstrapping
• Bootstrapping is a resampling technique in which multiple samples are drawn and
replace them from the original dataset to estimate the distribution of a statistic or to
improve the robustness of a model.
• This process involves following steps
• Sample with Replacement: Randomly select data points from the original dataset, allowing
for the possibility of selecting the same data point more than once for the replacement.
• Create Bootstrap Samples: Generate multiple bootstrap samples by repeating the sampling
process. Each bootstrap sample has the same size as the original dataset but may contain
duplicates and miss some original data points.
• Estimate Statistics or Model Performance: Calculate the statistic of interest (e.g., mean,
median, standard deviation) or train and evaluate a model on each bootstrap sample.
This provides a distribution of the statistic or model performance.

By Machhindranath Patil 21
Data Visualization

By Machhindranath Patil 22
Bar Graph
• A bar chart is a graphical representation of categorical data using rectangular bars. The length of
each bar is proportional to the value it represents. It's useful for comparing different categories
or groups.

• Example 1: Bar graph for monthly expenses.

20000
Category Expense
15000
Rent ₹ 20000.00

Groceries ₹ 15000.00 10000

Utilities ₹ 4500.00
5000
School Fees ₹ 9000.00
0
Rent Groceries Utilities School Fees

By Prof. Machhindranath Patil 23

Pie Chart
• A pie chart is a circular graphical representation of data, where the circle represents the total
data set, and the individual "slices" represent parts or categories of the data set. It allows for a
quick and easy comparison of the proportions or percentages of different components.

• Example 1: Bar graph for monthly expenses.

Category Expense School Fees

19%
Rent ₹ 20000.00 Rent
Utilities 41%
Groceries ₹ 15000.00 9%

Utilities ₹ 4500.00
Groceries
31%
School Fees ₹ 9000.00

By Prof. Machhindranath Patil 24

Pie Chart
• Example 2: Examination Result: Plot on Pie Chart for various category of results.

Roll No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Marks 47 59 77 64 66 60 45 65 71 33 42 49 54 43 22 60 65
Roll No. 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Marks 45 56 57 43 33 23 58 73 65 69 58 71 58 49 21 50 55
Roll No. 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Marks 65 66 32 45 65 34 28 65 74 64 34 45 56 54 53 22

Fail Pass 5
Second Class First Class 10
Distinction
13
10

By Prof. Machhindranath Patil 25

Stem and Leaf Plot
• A stem and leaf plot is a way to arrange and represent data so that it is simple to see how frequently various
data values occur.

• It is a data visualization tool that can be used to quickly and easily understand the distribution of a set of data.

• Steps to follow:

• To create a stem and leaf plot, the data is rst ordered from least to greatest.

• Then, each data point is split into two parts: the stem and the leaf. The stem is the rst digit or digits of
the data point, and the leaf is the last digit.

• Example 1: Examination Result: Represent the following result data with Stem and Leaf plot.
Roll No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Marks 47 59 77 64 66 60 45 65 71 33 42 49 54 43 22 60 65
Roll No. 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Marks 45 56 57 43 33 23 58 73 65 69 58 71 58 49 21 50 55
Roll No. 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Marks 65 66 32 45 65 34 28 65 74 64 34 45 56 54 53 22

By Prof. Machhindranath Patil 26

fi
fi
Stem and Leaf Plot
• Step 1: Sort the data in ascending order.

Marks 21 22 22 23 28 32 33 33 34 34 42 43 43 45 45 45 45

Marks 47 49 49 50 53 54 54 55 56 56 57 58 58 58 59 60 60

Marks 64 64 65 65 65 65 65 65 66 66 69 71 71 73 74 77

• Step 2: Separate First digit as stem and remains digits as leaves

Stem Leaves
2 1,2,2,3,8
3 2,3,3,4,4
4 2,3,3,5,5,5,5,7,9,9
5 0,3,4,4,5,6,6,7,8,8,8,9
6 0,0,4,4,5,5,5,5,5,5,6,6
7 1,1,3,4,7

By Prof. Machhindranath Patil 27

Dot Plot
• A dot plot uses dots to represent individual data points on a number line.

• It is a simple and effective way to display the distribution and frequency of a dataset.

• Each dot corresponds to a single data point, and when there are multiple data points with the
same value, they are stacked on top of each other.
6

4
Students
3

0
20 30 40 50 60 70 80
Marks

By Prof. Machhindranath Patil 28

Scatter Plot
• A scatterplot is a plot uses Cartesian coordinates to display values for typically two variables for
a set of data.
80

Marks 50

20
0 10 20 30 40 50
Roll No

By Prof. Machhindranath Patil 29

Time-series Graph
• A time-series graph is a type of graph that plots data points collected over a period of time. The
x-axis of the graph represents time, and the y-axis represents the variable being measured.

• Example 1: Ambient Temperature in a 15 days.

Date
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
of Sep. 2023

Temperature in
oC
25 27 26.5 27 27.5 26 26.5 27 28 27.5 25 25.5 25 26 26.5

28
27
26
25
24
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

By Prof. Machhindranath Patil 30

Frequency Distribution Graph

• A frequency distribution graph is a graphical representation of the frequency distribution of a

dataset.

• It shows the number of times each value occurs in the dataset.

• Various graphs are used to visualise the frequency distribution of a dataset. Such as Histograms,
Bar charts, Frequency polygons and Ogive curves.

• Histograms:

• Histograms are used to visualize the distribution of continuous data.

• They are created by dividing the data into bins, or intervals, and then plotting the number of
values in each bin on a bar graph.

• The x-axis of a histogram represents the bins, and the y-axis represents the frequency.

By Prof. Machhindranath Patil 31

Frequency Distribution Graph
Histogram of Random Data
14

10
Frequency
8

0
20 30 40 50 60 70 80
Value

By Prof. Machhindranath Patil 32

Frequency Distribution Graph
• Frequency polygons: Histogram of Random Data
14

• Frequency polygons are created by

12
connecting the midpoints of the tops
of the bars in a histogram. 10

• This creates a line graph that shows

Frequency
8
the overall shape of the distribution.
6

0
20 30 40 50 60 70 80
Value

By Prof. Machhindranath Patil 33

Frequency Distribution Graph
• Ogive curves:

• Ogive curves are cumulative frequency polygons.

• They show the percentage of values in the dataset that are less than or equal to each value on
the x-axis.

Cumulative Cumulative Cumulative

Marks Freq. Marks Freq. Marks Freq.
Frequency Frequency Frequency
21 1 1 47 1 18 60 2 34
22 2 3 49 2 20 64 2 36
23 1 4 50 1 21 65 6 42
28 1 5 53 1 22 66 2 44
32 1 6 54 2 24 69 1 45
33 2 8 55 1 25 71 2 47
34 2 10 56 2 27 73 1 48
42 1 11 57 1 28 74 1 49
43 2 13 58 3 31 77 1 50
45 4 17 59 1 32

By Prof. Machhindranath Patil 34

Frequency Distribution Graph
50

Cumulative Freq. 30

0
20 30 40 50 60 70 80
Marks
By Prof. Machhindranath Patil 35

Big Data Unit I
No ratings yet
Big Data Unit I
27 pages
Provide A Definition For A Workspace. Explain The Term Ergonomics and What It Means in A Workspace
No ratings yet
Provide A Definition For A Workspace. Explain The Term Ergonomics and What It Means in A Workspace
42 pages
Unit - 5
No ratings yet
Unit - 5
7 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
39 pages
Ds Sem
No ratings yet
Ds Sem
71 pages
Data Science
No ratings yet
Data Science
5 pages
Lecture 5 - Feature Extraction, Model Building & Evaluation
No ratings yet
Lecture 5 - Feature Extraction, Model Building & Evaluation
35 pages
House Price Prediction: Numpy Pandas Matplotlib Seaborn
No ratings yet
House Price Prediction: Numpy Pandas Matplotlib Seaborn
8 pages
Question 4 Module
No ratings yet
Question 4 Module
26 pages
Provide A Definition For A Workspace. Explain The Term Ergonomics and What It Means in A Workspace PDF
No ratings yet
Provide A Definition For A Workspace. Explain The Term Ergonomics and What It Means in A Workspace PDF
21 pages
Nikhil Sanjay Thorat Assignment 2
No ratings yet
Nikhil Sanjay Thorat Assignment 2
9 pages
ModelQB - Part B&C-1
No ratings yet
ModelQB - Part B&C-1
51 pages
Unit 3
No ratings yet
Unit 3
55 pages
NN 7
No ratings yet
NN 7
26 pages
Varshini Phase 2
No ratings yet
Varshini Phase 2
19 pages
SSRN 4976040
No ratings yet
SSRN 4976040
14 pages
EXAMPLE ML in Real Life
No ratings yet
EXAMPLE ML in Real Life
6 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Lecture 1 Introduction PM
No ratings yet
Lecture 1 Introduction PM
21 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
Business Data Analytics Part 4
No ratings yet
Business Data Analytics Part 4
52 pages
Final Review Batch 07
No ratings yet
Final Review Batch 07
30 pages
Major Project
No ratings yet
Major Project
27 pages
1.) Detailed Workflow For Predicting Customer Churn in An Online Retail Store
No ratings yet
1.) Detailed Workflow For Predicting Customer Churn in An Online Retail Store
9 pages
Project Report
No ratings yet
Project Report
12 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
Predicting Churn
No ratings yet
Predicting Churn
37 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Churn Prediction and ML
No ratings yet
Churn Prediction and ML
9 pages
ML & Statistical Methods in Business
No ratings yet
ML & Statistical Methods in Business
9 pages
Sales Prediction For Big Mart 3.0.pptx MM
No ratings yet
Sales Prediction For Big Mart 3.0.pptx MM
25 pages
Case Study - Churn Mdel Prediction
No ratings yet
Case Study - Churn Mdel Prediction
77 pages
Cryptography
No ratings yet
Cryptography
201 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Daa 01
No ratings yet
Daa 01
11 pages
Big Mart Sales Prediction Using Machine Learning Report PDF
No ratings yet
Big Mart Sales Prediction Using Machine Learning Report PDF
56 pages
English Manual v3 001
No ratings yet
English Manual v3 001
63 pages
Lec 2
No ratings yet
Lec 2
13 pages
AI ML K6rn1i 54 Merged
No ratings yet
AI ML K6rn1i 54 Merged
6 pages
DS Model Steps
No ratings yet
DS Model Steps
8 pages
Project Report
No ratings yet
Project Report
11 pages
BA Unit IV
No ratings yet
BA Unit IV
27 pages
Ids Case Study
No ratings yet
Ids Case Study
15 pages
House-Price-Prediction-Using-Regression-Techniques Retouch - Removed
No ratings yet
House-Price-Prediction-Using-Regression-Techniques Retouch - Removed
14 pages
INNOVATION - PDF Phrase 2
No ratings yet
INNOVATION - PDF Phrase 2
9 pages
Comparison of Learning Techniques For Prediction of Customer Churn in Telecommunication
No ratings yet
Comparison of Learning Techniques For Prediction of Customer Churn in Telecommunication
36 pages
Data Analytics On Banking
No ratings yet
Data Analytics On Banking
3 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
Data Science
No ratings yet
Data Science
8 pages
ML Project Life Cycle With Example
No ratings yet
ML Project Life Cycle With Example
2 pages
Hanoi - 2021: (Document Title)
No ratings yet
Hanoi - 2021: (Document Title)
19 pages
Telangana State Report 10-05-2022
No ratings yet
Telangana State Report 10-05-2022
34 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Bachelor Thesis
No ratings yet
Bachelor Thesis
88 pages
6 Applications of Predictive Analytics in Business Intelligence
No ratings yet
6 Applications of Predictive Analytics in Business Intelligence
6 pages
Iot Domain Analyst Digital Assignment - 1: Name: Harshith C S Reg No: 18bec0585 Slot: B1
No ratings yet
Iot Domain Analyst Digital Assignment - 1: Name: Harshith C S Reg No: 18bec0585 Slot: B1
6 pages
Harshit Ipr PPT Mba Sec B First Sem
No ratings yet
Harshit Ipr PPT Mba Sec B First Sem
12 pages
Sales Prediction Model For Big Mart: Parichay: Maharaja Surajmal Institute Journal of Applied Research
No ratings yet
Sales Prediction Model For Big Mart: Parichay: Maharaja Surajmal Institute Journal of Applied Research
11 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
Sat - 149.Pdf - Prediction of Bigmart Sales Using Machine Learning Algorihms
No ratings yet
Sat - 149.Pdf - Prediction of Bigmart Sales Using Machine Learning Algorihms
11 pages
CS Executive Sbec MCQ Questions With Answers
No ratings yet
CS Executive Sbec MCQ Questions With Answers
20 pages
Econ2330 Ch09
No ratings yet
Econ2330 Ch09
65 pages
Meltem Adar Essay1
No ratings yet
Meltem Adar Essay1
3 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Motion in 1dimension DPP 05 of Lec 06 Prayas JEE 2.0 2025666a9a5e40a9650018c46d03
No ratings yet
Motion in 1dimension DPP 05 of Lec 06 Prayas JEE 2.0 2025666a9a5e40a9650018c46d03
5 pages
Carolina Reaper
No ratings yet
Carolina Reaper
19 pages
Equlibrium
No ratings yet
Equlibrium
20 pages
FinalPaper SalesPredictionModelforBigMart
No ratings yet
FinalPaper SalesPredictionModelforBigMart
14 pages
Syllabus
No ratings yet
Syllabus
7 pages
Cambridge O Level: Environmental Management 5014/22
No ratings yet
Cambridge O Level: Environmental Management 5014/22
11 pages
Altman Z Score Model
No ratings yet
Altman Z Score Model
7 pages
Financial Kake Da Hotel (N)
No ratings yet
Financial Kake Da Hotel (N)
10 pages
SSV 2018 DPS (MAVERICK TRAIL) Shop 219100905-050
No ratings yet
SSV 2018 DPS (MAVERICK TRAIL) Shop 219100905-050
11 pages
Bio Paper 5 PDF
No ratings yet
Bio Paper 5 PDF
8 pages
Figure of Speech
No ratings yet
Figure of Speech
4 pages
2
No ratings yet
2
29 pages
ARINC Meteorological Data Collection and Reporting System (MDCRS)
No ratings yet
ARINC Meteorological Data Collection and Reporting System (MDCRS)
16 pages
New Microsoft Office Word Document
No ratings yet
New Microsoft Office Word Document
6 pages
Type A Type B 72 78 78 76 73 81 69 74 75 82 74 75 69 75 Heaters? Find The Approximate P-Value For The Test and Interpret Its Value
No ratings yet
Type A Type B 72 78 78 76 73 81 69 74 75 82 74 75 69 75 Heaters? Find The Approximate P-Value For The Test and Interpret Its Value
9 pages
Project Charter Template
No ratings yet
Project Charter Template
9 pages
Lesson 4 (Computer Maintenance)
No ratings yet
Lesson 4 (Computer Maintenance)
4 pages
TTPL Supplier Evaluation Form Doc No:Ttpl/F/Pur/05 DOC REV NO/DATE:00/03.04.17 Page 1 of 3
No ratings yet
TTPL Supplier Evaluation Form Doc No:Ttpl/F/Pur/05 DOC REV NO/DATE:00/03.04.17 Page 1 of 3
3 pages
Syllabus MKCU Semester 2
No ratings yet
Syllabus MKCU Semester 2
3 pages
My MVP in Volleyball: Individual Awards: Collegiate Awards
No ratings yet
My MVP in Volleyball: Individual Awards: Collegiate Awards
1 page
Century Iib: Autopilot Flight System
No ratings yet
Century Iib: Autopilot Flight System
24 pages
Lab02 DataTypes PDF
No ratings yet
Lab02 DataTypes PDF
5 pages
Customer Churn Analysis and Prediction
No ratings yet
Customer Churn Analysis and Prediction
4 pages
Biography of Adolf Hitler
No ratings yet
Biography of Adolf Hitler
1 page
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
From Everand
IT Specialist: Artificial Intelligence Exam Prep - 500 Questions for Certification Success (0225)
Satou Takahiro
No ratings yet

ADS-ch3 2024-25

Uploaded by

ADS-ch3 2024-25

Uploaded by

A course in

Module -3: Machhindranath Patil,

Methodology and Data Visualization

V.E.S. Institute of Technology, Mumbai

Steps involved in the model building are as follows,

2. Train and Test the Model k Times:

3. Calculate Average Performance:

4. Provides a Reliable Performance Estimate: The average performance across multiple

2. Reduces Over tting: High Computational Expense: For large datasets or

4. Model Instability in Sensitive Models: If the model is highly sensitive to the

3. No Randomness in Splitting: Unlike k-fold cross-validation, Leave-1-Out does

4. Sensitive to Outliers: If the dataset contains outliers, the performance estimate

• Example 1: Bar graph for monthly expenses.

Groceries ₹ 15000.00 10000

By Prof. Machhindranath Patil 23

• Example 1: Bar graph for monthly expenses.

Category Expense School Fees

By Prof. Machhindranath Patil 24

By Prof. Machhindranath Patil 25

By Prof. Machhindranath Patil 26

• Step 2: Separate First digit as stem and remains digits as leaves

By Prof. Machhindranath Patil 27

By Prof. Machhindranath Patil 28

By Prof. Machhindranath Patil 29

• Example 1: Ambient Temperature in a 15 days.

By Prof. Machhindranath Patil 30

• A frequency distribution graph is a graphical representation of the frequency distribution of a

• It shows the number of times each value occurs in the dataset.

• Histograms are used to visualize the distribution of continuous data.

By Prof. Machhindranath Patil 31

By Prof. Machhindranath Patil 32

• Frequency polygons are created by

• This creates a line graph that shows

By Prof. Machhindranath Patil 33

• Ogive curves are cumulative frequency polygons.

Cumulative Cumulative Cumulative

By Prof. Machhindranath Patil 34

You might also like