0% found this document useful (0 votes)
5 views20 pages

Unit-4 Part 1 Preparing Model

The document outlines the preparation and exploration of data for machine learning, focusing on different data types, data quality issues, and remediation techniques. It emphasizes the importance of data preprocessing, including handling missing values and outliers, to enhance model accuracy and efficiency. Key steps in data preprocessing are also detailed, such as dataset acquisition, data cleaning, and feature scaling.

Uploaded by

harshlpatel.4274
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

Unit-4 Part 1 Preparing Model

The document outlines the preparation and exploration of data for machine learning, focusing on different data types, data quality issues, and remediation techniques. It emphasizes the importance of data preprocessing, including handling missing values and outliers, to enhance model accuracy and efficiency. Key steps in data preprocessing are also detailed, such as dataset acquisition, data cleaning, and feature scaling.

Uploaded by

harshlpatel.4274
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Unit-2 Preparing

Model
PROF. ATMIYA PATEL
Understanding about Data
Different Types of Data
Exploring Structure of Data
Two basic data types:
1. Numerical
2. Categorical

Standard dataset have data dictionary. Like UCI repository (University of California)
Link: https://archive.ics.uci.edu/ml/index.php
Exploring Numerical Data
Steps:
1. Understanding central tendency (Ex. Mean, Median)
2. Understanding data spread
a. Dispersion of data
b. Position of different data values

3. Plotting and exploring numerical data


◦ Two plots for numerical data:
a. Box plot
b. Histogram
Exploring Categorical Data
Exploring relationship between
variables
Cont…
Two-way cross tabulations
Data Quality and Remediation
Data Quality: Major factor to decide success of machine learning. However, it is not realistic to
expect that the data will be flawless.
Two types of problems:
1. Certain data elements without a value. (missing value)
2. Data elements having value different from the other elements. (outliers)

(a.) (b.)
Data Quality issues factors
Incorrect sample set selection: The data may not reflect normal pr regular quality due to this.
 Ex. Use Festival sale data to predict the future sale.

Error in data collection: resulting outliers and missing values.


 Ex. When group of person is responsible for collection of data. (Outliers)
 If data not recorded at all. (missing values)
Data remediation
The right amount of efficiency has to be achieved in the learning activity.
Remediation actions:
For incorrect data it can be remedied by proper sampling technique. However human error can
not be solved 100 per.
For outliers and missing values we can follow the proper steps.
Handling outliers
These are the data elements with an abnormally high value which may impact prediction
accuracy.
Approaches to handle outliers:
1.) Remove outliers: If the number of records which are outliers is not many then remove it.
Cont…
2.) Imputation: to impute the value with mean or median or mode.
Cont…
3.) Capping: For values that lie outside the 1.5|x| IQR limits, we can cap them by replacing those
observation below the lower limit with the value of 95th percentile.
Handling Missing Values
1.) Eliminate records having missing values of the data elements
If it is a tolerable limit this is the effective approach.
2.) Imputing missing values
To assign a value to the data elements. Mean/mode/median is most frequently used to assign
the values.
3.) Estimate missing values
If there are data points similar to the ones with missing attribute values, then the attribute
values from those similar data points can be planted in place of the missing value.
Ex. Weight of a student having age 12 years and height 5 ft. is missing. Then the weight of any
other student having age close to 12 years and height close to 5 ft. can be assigned.
Data Preprocessing
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model.
It is the first and crucial step while creating a machine learning model.
◦ When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a
formatted way. So, for this, we use data preprocessing task.
Why do we need Data Preprocessing?
A real-world data generally contains noises, missing values, and maybe in an
unusable format which cannot be directly used for machine learning models.
Data preprocessing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy and
efficiency of a machine learning model.
Data Preprocessing steps:
1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Splitting dataset into training and test set
7. Feature scaling
Thank you…

You might also like