0% found this document useful (0 votes)

46 views8 pages

ML Exp No 1

The document outlines the importance of data preprocessing in machine learning, detailing steps such as handling missing values, outlier treatment, and encoding categorical data. It emphasizes the role of exploratory data analysis (EDA) in understanding data patterns and relationships, which aids in informed decision-making. Ultimately, effective data preprocessing is crucial for enhancing model performance and obtaining reliable insights.

Uploaded by

Aditya sharma Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views8 pages

ML Exp No 1

Uploaded by

Aditya sharma Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

NAME: ADITYA SHARMA

REGISTRATION I’D: 221070001

BATCH: A

EXPERIMENT NO: 1

AIM: To analyze sales data for trends, patterns, and insights to

enhance decision-making and improve business strategies.

Theory:
Data preprocessing is essential before building a machine learning model. It
involves checking for missing values, outliers, duplicates, and garbage values.
The process includes exploratory data analysis, normalization, and encoding
categorical data. Effective preprocessing ensures a clean dataset, improving the
model's performance and accuracy.

What are the steps in data preprocessing?

Data preprocessing is a critical step in preparing your dataset for analysis or
modeling. Here’s a summary of the key steps involved:

1. Sanity Check:

○ Verify the dataset for any inconsistencies, such as incorrect data
types or unexpected values.
2. Handling Missing Values:
○ Removal: Delete rows or columns with excessive missing values.
○ Imputation: Fill in missing values using methods like:
■ Mean, median, or mode for numerical data.
■ K-Nearest Neighbors (KNN) for more complex imputation.
■ Mode for categorical data.
3. Outlier Treatment:
○ Identify outliers using statistical methods (e.g., box plots).
○ Cap outliers at the upper and lower whiskers of the data distribution
to minimize their impact.
4. Duplicate Removal:
○ Check for and remove duplicate entries in the dataset to ensure data
integrity.
5. Garbage Value Treatment:
○ Replace erroneous or nonsensical values with appropriate statistics
(mean, median, or mode).
6. Encoding Categorical Data:
○ Convert categorical variables into numerical format using:
■ One-Hot Encoding: Create binary columns for each category.
■ Label Encoding: Assign numerical values to ordinal
categories.
7. Normalization/Standardization (if necessary):
○ Scale numerical features to a standard range (e.g., 0-1) or
standardize to have a mean of 0 and a standard deviation of 1.
8. Exploratory Data Analysis (EDA):
○ Visualize and analyze the data to understand distributions,
relationships, and patterns.

These steps help ensure that the dataset is clean, consistent, and ready for
modeling. Would you like to dive deeper into any specific step?

How do you handle missing values in a dataset?

Handling missing values is a crucial step in data preprocessing. Here are some
common strategies:
1. Remove Rows: If the dataset is large and the number of missing values is
small, you can remove the rows with missing values.
2. Imputation:
○ Mean/Median/Mode Imputation: Replace missing values with the
mean (for numerical data), median (for skewed data), or mode (for
categorical data).
○ K-Nearest Neighbors (KNN) Imputation: Use KNN algorithms to
estimate missing values based on similar instances.
○ Predictive Modeling: Use regression or classifiers to predict and fill
in missing values.
3. Fill with Constants: Depending on the context, you might fill missing values
with a constant (e.g., 0, "unknown").
4. Interpolation: For time series data, you can interpolate missing values
based on surrounding data points.
5. Use Algorithms that Support Missing Values: Some machine learning
algorithms can handle missing data inherently.
6. Flagging: Add a new binary feature to indicate whether the data was
missing, which can help the model learn patterns related to missingness.

What is the importance of exploratory data analysis?

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process,
serving several important purposes:

1. Understanding Data Structure: EDA helps you gain insights into the
dataset's characteristics, including the types of variables, ranges, and how
they interact with each other.
2. Detecting Anomalies and Outliers: By visualizing data distributions and
relationships, EDA allows you to identify anomalies, outliers, and possible
errors that may affect analysis and modeling.
3. Assessing Assumptions: EDA helps validate the assumptions underlying
various statistical models, such as normality, homoscedasticity, and
independence, ensuring the appropriateness of chosen analysis methods.
4. Identifying Patterns and Trends: Through visualizations (like histograms,
scatter plots, and box plots), EDA can reveal patterns, trends, correlations,
or relationships that might inform model selection or feature engineering.
5. Informed Feature Selection: Insights gained during EDA help in determining
which features are most relevant to the target variable, guiding feature
engineering and reduction decisions.
6. Hypothesis Generation: EDA can generate hypotheses for further analysis
by highlighting interesting trends and relationships in the data.
7. Guiding Data Preprocessing: Insights from EDA inform necessary
preprocessing steps, such as handling missing values or scaling features,
making the data more suitable for modeling.
8. Improving Communication: Using visualizations, EDA can effectively
communicate findings and insights to stakeholders, facilitating better
decision-making.
Data pre-processing is essential before fitting it into a machine learning model.
This involves checking for missing values, outliers, duplicates, and understanding
the data relationships.

The first step in data pre-processing is importing necessary libraries and reading
the dataset into a DataFrame. This ensures that the data is ready for further
analysis.
Exploratory Data Analysis (EDA) aids in understanding data patterns and
relationships between variables. EDA is a crucial step before any machine
learning model is built.
Sanity checks are crucial for identifying missing values, outliers, and duplicates
within the dataset. These checks help maintain data integrity before model
training.
Data preprocessing is crucial to ensure clean datasets, particularly by identifying
and addressing missing values before analysis. This step is essential to avoid
potential biases in modeling results.
Identifying missing values involves checking each column's data type and
counting instances where data is absent. This helps in understanding the extent
of missing data across different features.
Deciding how to handle missing values can include options like deletion of rows
or columns based on the percentage of missing data. This decision is vital to
maintain the integrity of the analysis.
After handling missing values, assessing duplicates and garbage values further
ensures data quality. Checking for unique values is necessary to confirm the
absence of duplicates in the dataset.
Identifying and handling garbage values in a dataset is crucial for accurate data
analysis. Unique value counts help in detecting such anomalies effectively and
ensuring data quality.
Data cleaning techniques, such as imputing or replacing garbage values with
median or mode, are vital for preparing datasets for accurate analysis and
modeling.
Descriptive statistics provide insight into data distribution, including measures like
mean, standard deviation, and percentiles, essential for understanding dataset
characteristics.
Exploratory data analysis (EDA) utilizes visualizations, such as histograms, to
understand data distributions and identify potential outliers or patterns within the
data.
Data visualization techniques like histograms and box plots are essential in
exploratory data analysis for understanding data distribution and identifying
outliers. These methods provide insights into the characteristics of the dataset,
guiding further analysis.
Histograms help visualize the frequency distribution of numerical data, allowing
for quick assessments of how data points are spread across different ranges.
This aids in identifying patterns or anomalies.
Box plots provide a clear representation of the data's central tendency and
variability, highlighting any outliers. They are invaluable for comparing
distributions across different groups.
Scatter plots are utilized to explore relationships between the dependent variable
and independent variables. They help in determining whether a linear or
non-linear relationship exists for modeling purposes.
The video discusses creating and interpreting a correlation matrix using heat
maps to analyze relationships between numerical data. Understanding these
correlations is crucial for predictive modeling and exploratory data analysis.

Correlation values range from 0 to 1, where values above 0.5 indicate strong
relationships among data variables. This helps in identifying significant patterns
in the dataset.

The heat map visualization enhances understanding of correlation by

representing data relationships with colors. High correlation is indicated by
specific color gradients.

Missing value treatment is essential for preparing data before modeling. Various
methods, including median and mean imputation, help ensure that the dataset is
complete.

Handling missing values in datasets is crucial for accurate data analysis.

Different strategies, such as using mean, median, or mode, are required based
on the data type.
Continuous numerical data requires different treatment than discrete data when it
comes to filling missing values. The choice between median and mean depends
on the data distribution.

Target variables should not undergo missing value treatments to avoid

introducing artificial data, which could skew results. Special care is needed when
handling these variables.

Imputing categorical data involves filling missing values with the mode,
particularly when multiple modes are present, requiring careful selection. This
ensures the representation remains accurate.

The video discusses how to handle missing values in datasets using the KNN
imputer method, which fills in gaps by averaging similar data points. This
approach is effective for numerical columns, ensuring data integrity for analysis.

The KNN imputer operates by calculating the average of the nearest neighbors to
fill missing values, which enhances the accuracy of datasets. This method is
widely regarded in data preprocessing.

After treating missing values, the video highlights the importance of outlier
treatment, emphasizing that outliers can adversely affect model performance.
Understanding and managing outliers is crucial for accurate predictions.

The outlier treatment involves capping extreme values using whiskers, ensuring
that they do not skew the model results. The process is vital for maintaining data
quality during analysis.

Outlier treatment is essential for continuous numerical data to ensure accurate

analysis. The decision to apply this treatment varies based on the nature of the
data involved.

Different types of data require distinct approaches for outlier treatment.

Continuous numerical data like GDP is suitable for such analysis, while discrete
data is not.
Implementing outlier treatment requires selecting specific columns for analysis.
The process involves defining a whisker function to effectively identify and handle
outliers in the dataset.

The method for calculating outliers involves determining the interquartile range
(IQR). This consists of obtaining the 25th and 75th percentiles to find lower and
upper bounds.

Outlier treatment is essential for cleaning data, ensuring that statistical analyses
yield accurate results. Proper identification and capping of outliers prevent
skewed interpretations in datasets.

The process of calculating lower and upper whiskers is crucial in determining

outliers in a dataset. This is done through specific functions that analyze the
data's distribution.

Effective outlier treatment significantly improves visual representations of data,

such as box plots. Removing outliers allows for a clearer understanding of the
data's overall trends.

Data cleaning also involves handling duplicates and garbage values to maintain
data integrity. Techniques like dropping duplicates and replacing invalid entries
ensure high-quality datasets.

Data preprocessing is crucial for preparing datasets for machine learning. This
includes encoding categorical variables, handling missing values, and treating
outliers effectively.

Creating dummy variables is essential when dealing with categorical features in a

dataset. It helps in converting these features into a numerical format for
modeling.

Visualization tools like histograms and scatter plots are used to understand data
relationships better. This exploratory data analysis is vital for informed
decision-making.
Link:
https://www.kaggle.com/code/markmedhat/sales-analysis-and-visualiz
ation

CONCLUSION:
In conclusion, effective data preprocessing is essential for building robust
machine learning models. It involves a series of systematic steps to clean and
prepare the data, which ultimately leads to improved model performance and
more reliable insights

Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
UNIT 1 Exploratory Data Analysis
100% (3)
UNIT 1 Exploratory Data Analysis
21 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
EDA
100% (1)
EDA
9 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Ip Project Class 12 Final
100% (1)
Ip Project Class 12 Final
33 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
Eda Sandhya
No ratings yet
Eda Sandhya
7 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
28 Oct EDA Notes
No ratings yet
28 Oct EDA Notes
16 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Project Report On Data Analytics
100% (1)
Project Report On Data Analytics
44 pages
Eapp Melc 9-10
No ratings yet
Eapp Melc 9-10
23 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Exploratory Data Analysis EDA Part of Data PreProcessing
No ratings yet
Exploratory Data Analysis EDA Part of Data PreProcessing
11 pages
DWM - Exp 1
No ratings yet
DWM - Exp 1
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Business Data Mining Week 2
No ratings yet
Business Data Mining Week 2
6 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
20 pages
8DAX in Power BI For Market Basket Analysis and Sales Data Analysis
No ratings yet
8DAX in Power BI For Market Basket Analysis and Sales Data Analysis
29 pages
DSML Notes
No ratings yet
DSML Notes
32 pages
Unit3 Eda
No ratings yet
Unit3 Eda
13 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Exploratory Data Analysis Using Python
No ratings yet
Exploratory Data Analysis Using Python
7 pages
Unit 4
No ratings yet
Unit 4
33 pages
Dw&bi PR2,3
No ratings yet
Dw&bi PR2,3
6 pages
Notes Unit I
No ratings yet
Notes Unit I
47 pages
Cracking The LinkedIn Data Scientist Interview - by Dan Lee - DataInterview - Medium
No ratings yet
Cracking The LinkedIn Data Scientist Interview - by Dan Lee - DataInterview - Medium
17 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Unit 3
No ratings yet
Unit 3
18 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
DS203 2024 09 06 Data Problems 1
No ratings yet
DS203 2024 09 06 Data Problems 1
25 pages
Eda
No ratings yet
Eda
6 pages
IIITDMJ Resume Template
No ratings yet
IIITDMJ Resume Template
2 pages
Exploratory Data Analysis (Eda)
No ratings yet
Exploratory Data Analysis (Eda)
10 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
3 pages
A Network Analysis of Cross-Occupational Skill Transferability For The Hospitality Industry
No ratings yet
A Network Analysis of Cross-Occupational Skill Transferability For The Hospitality Industry
22 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
ML Exp1 - 2201107
No ratings yet
ML Exp1 - 2201107
34 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Unit 3
No ratings yet
Unit 3
222 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Namrata Resume
No ratings yet
Namrata Resume
4 pages
MSC Cybersecurity Risk Management Programme Handbook 2024-25 V2
No ratings yet
MSC Cybersecurity Risk Management Programme Handbook 2024-25 V2
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
2 pages
Python Pandas Visualisation Assignment
No ratings yet
Python Pandas Visualisation Assignment
15 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Delta Ia-Oms Diaview C en 20200306 Web
No ratings yet
Delta Ia-Oms Diaview C en 20200306 Web
20 pages
Ca2 - Lpu
No ratings yet
Ca2 - Lpu
2 pages
Industrial Training Report
No ratings yet
Industrial Training Report
20 pages
Germany PHD Thesis
100% (3)
Germany PHD Thesis
5 pages
OptiPerformer User Reference
No ratings yet
OptiPerformer User Reference
28 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Bubble Chart
No ratings yet
Bubble Chart
7 pages
16-Data Preprocessing
No ratings yet
16-Data Preprocessing
27 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Fdsa Unit 1 Aids Sem 4
No ratings yet
Fdsa Unit 1 Aids Sem 4
26 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
Blue and White Professional Resume
No ratings yet
Blue and White Professional Resume
1 page
Da Mid1
No ratings yet
Da Mid1
32 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Classification of Visualization System
No ratings yet
Classification of Visualization System
2 pages
Group 7
No ratings yet
Group 7
19 pages
TechIServe Consulting - Revised - Sep
No ratings yet
TechIServe Consulting - Revised - Sep
7 pages
Data Science
No ratings yet
Data Science
9 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
PDF Introduction To Business Analytics 2nd Edition Marguerite L. Johnson Download
100% (19)
PDF Introduction To Business Analytics 2nd Edition Marguerite L. Johnson Download
55 pages
13th International Conference On Data Mining & Knowledge Management Process (DKMP 2025)
No ratings yet
13th International Conference On Data Mining & Knowledge Management Process (DKMP 2025)
2 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
InsightLens: Augmenting LLM-Powered Data Analysis With Interactive Insight Management and Navigation
No ratings yet
InsightLens: Augmenting LLM-Powered Data Analysis With Interactive Insight Management and Navigation
11 pages
Vaibhav AIML
No ratings yet
Vaibhav AIML
2 pages
Dev Core
No ratings yet
Dev Core
7 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Data Visualization
No ratings yet
Data Visualization
17 pages
Unit 3-BA
No ratings yet
Unit 3-BA
31 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Python
No ratings yet
Python
29 pages
Suganya C Resume
No ratings yet
Suganya C Resume
2 pages

ML Exp No 1

Uploaded by

ML Exp No 1

Uploaded by

NAME: ADITYA SHARMA

REGISTRATION I’D: 221070001

AIM: To analyze sales data for trends, patterns, and insights to

What are the steps in data preprocessing?

1.​ Sanity Check:

How do you handle missing values in a dataset?

What is the importance of exploratory data analysis?

The heat map visualization enhances understanding of correlation by

Handling missing values in datasets is crucial for accurate data analysis.

Target variables should not undergo missing value treatments to avoid

Outlier treatment is essential for continuous numerical data to ensure accurate

Different types of data require distinct approaches for outlier treatment.

The process of calculating lower and upper whiskers is crucial in determining

Effective outlier treatment significantly improves visual representations of data,

Creating dummy variables is essential when dealing with categorical features in a

You might also like

1. Sanity Check: