0% found this document useful (0 votes)

19 views14 pages

Class4 DataPreprocessing DiscriptiveAnalytics 19aug2021

1) The document discusses data preprocessing techniques which are important for cleaning and preparing raw data for analysis. 2) It describes common data preprocessing steps like data cleaning, integration, transformation and reduction which are used to handle issues like missing values, noise, inconsistencies and reduce data size. 3) Descriptive analytics techniques are also covered, including measuring the central tendency of data using the mean, median and mode to understand characteristics of numeric attributes.

Uploaded by

siddharth0208yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views14 pages

Class4 DataPreprocessing DiscriptiveAnalytics 19aug2021

Uploaded by

siddharth0208yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

19-08-2021

Data Preprocessing

Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge

Data Modeling Inference

Data Collection (Machine
Learning)

Data Preprocessing

Data
Feature
Database Cleaning and
Representation
Cleansing
2

1
19-08-2021

Need for Data Preprocessing

• Real world data are tend to be incomplete, noisy and
inconsistent due to their huge size and their likely origin
from multiple heterogeneous sources
• Preprocessing is important to clean the data
• Low quality data will lead to low quality of analysis results
• If the users believe the data is of low quality (dirty), they
are unlikely to trust the results of any data analytics that
has been applied to
• Low quality data can cause confusion for analytic procedure
using machine learning techniques, resulting in unreliable
output
• Data could be
– Incomplete,
– noisy and
– inconsistent
– These are common properties of large real world databases

Data Preprocessing Techniques

• Data cleaning:

• Data integration:

• Data transformation:

• Data reduction :

2
19-08-2021

Data Preprocessing Techniques

• Data cleaning:
– Applied to
• identify the missing values,
• fill in missing values,
• remove noise and
• correct inconsistency in the data
• Data integration:
– It merges data from multiple sources in to a coherent
data source
• Data transformation:
– Transforming the entries of data to a common format
– Techniques like normalization and standardization
applied to transform the data to another form to
improve the accuracy and efficiency of machine learning
(ML) algorithms involving distance measures

Data Preprocessing Techniques

• Data reduction:
– Applied to obtain a reduced representation that is much
smaller in volume, yet producing almost same analytical
results
– It can reduce the data size by
• Aggregation
• Eliminating irrelevant and redundant features (attributes)
through correlation analysis
• Reducing dimension
• These techniques are not mutually exclusive; they
may work together

3
19-08-2021

Descriptive Data Summarization

(Descriptive Analytics)
• It serves as a foundation for data preprocessing
• It helps us to study the general characteristics of data
and identify the presence of noise or outliers
• Data characteristics:
– Central tendency of data
• Centre of the data
• Measuring mean, median and mode
– Dispersion of data
• The degree to which numerical data tend to spread
• Measuring range, quartiles, interquartile range (IQR), the
five-number summery and standard deviation

Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57

N
x
i 1
i 9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83

Sum: 91

4
19-08-2021

Mean Years of 9.1

experience: Sum/10

Mean Salary: 55.4

Sum/10

5
19-08-2021

Mean: 9.1 55.4

Descriptive Analytics:
Measuring Central Tendency
• Median: Number of records
– Let x1, x2, …, xN be a set of N values (tuples), N = 10
in an attribute. The median is the Years of Salary (in
"middle" number (value), when experience Rs 1000)
those numbers are listed in order 3 30
from smallest to greatest. 8 57
– Median is the value separating the 9 64
higher half from the lower half of 13 72
a data sample
3 36
– For a given data of N values in sorted 6 43
order
11 59
• If N is odd, then median is the middle
value of the ordered list 21 90
• If N is even, then median is the 1 20
average of middle two values 16 83

Illustration: Median of attribute “Years of experience”

6
19-08-2021

Descriptive Analytics:
Measuring Central Tendency
• Median: Sort the values in “Years
– Let x1, x2, …, xN be a set of N values in of experience”
an attribute. The median is the Years of
"middle" number (value), when experience
those numbers are listed in order 1
from smallest to greatest. 3
– Median is the value separating the 6
higher half from the lower half of 8
a data sample
9
– For a given data of N values in sorted
11
order
13
• If N is odd, then median is the middle
value of the ordered list 16
• If N is even, then median is the 16
average of middle two values 21

Median:

7
19-08-2021

Descriptive Analytics:
Measuring Central Tendency
• Mode: Most frequent value in an attribute in the data
Number of records
(tuples), N = 10
Years of Salary (in
experience Rs 1000)
Illustration: Mode of attribute 3 30
“Years of experience” 8 57
Assume that values are discrete 9 64
numerical
13 72
3 36
6 43
11 59
21 90
1 20
16 83

Mode: 3

8
19-08-2021

Descriptive Analytics:
Measuring Central Tendency
• Mode: Most frequent value in an attribute in the data
Number of samples, N = 61 • The mode of a continuous
Date Temperature
variable is the value at which
the probability density function,
Sept 1 25.47
f(x) , is at a maximum.
Sept 2 26.19
Sept 3 25.17 • It is a value that is most likely
Sept 4 24.30 to lie within the same interval as
Sept 5 24.07 the outcome
Sept 6 21.21
Sept 7 23.49
Sept 8 21.79
Sept 9 25.09
Sept 10 25.39
--- ---
Oct 29 23.06
Oct 30 23.72
Oct 31 23.02
Mean: 22.85
Mode: (22.32 – 23.62]
Median: 22.89

Descriptive Analytics:
Measuring Central Tendency

Positively Skewed Negatively Skewed

Symmetric Data
Data Data

9
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• The degree to which numerical data tend to spread
• It is also called as variance (in symmetrically
distributed data)
• Common measures of data dispersion:
– Range
– The five-number summery (based on quartiles)
– The inter quartile range (IQR)
– Standard deviation
• Range: The range of a finite set of values is the
difference between the maximum and minimum
values

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Number of records
(tuples), N = 10
– The kth percentile:
Years of Salary (in
• Let x1, x2, …, xN be a set of N experience Rs 1000)
values in an attribute 3 30
• The kth percentile of a set of data 8 57
in numerical order is the value of 9 64
xn having the property that k
13 72
percent of data entries lie at or
below xn 3 36
6 43
– Example: 50th percentile
11 59
• The value (number) below which
50% of the data entries (values) 21 90
lie 1 20
– Those 50% of entries have values 16 83
equal to or less that 50th
percentile

Illustration: 50th percentile of attribute “Years of

experience”

10
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 50th percentile
13
• The value (number) below which
50% of the data entries (values) 16
lie 16
– Those 50% of entries have values 21
equal to or less that 50th
percentile 50th Percentile: 10

Illustration: 50th percentile of attribute “Years of

experience”

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 25th percentile
13
• The value (number) below which
25% of the data entries (values) 16
lie 16
– Those 25% of entries have values 21
equal to or less that 25th
percentile 25th Percentile: 6
• Middle element between minimum
and 50th percentile
Illustration: 25th percentile of attribute “Years of experience”

11
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 75th percentile
13
• The value (number) below which
75% of the data entries (values) 16
lie 16
– Those 75% of entries have values 21
equal to or less that 75th
percentile 75th Percentile: 16
• Middle element between
maximum and 50th percentile
Illustration: 75th percentile of attribute “Years of experience”

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles:
– The kth percentile:
• Let x1, x2, …, xN be a set of N values in an attribute
• The kth percentile of a set of data in numerical order is the
value of xn having the property that k percent of data
entries lie at or below xn
• Median is the 50th percentile (the second quartile (Q2))
• The first quartile (Q1): It is the 25th percentile
• The third quartile (Q3): It is the 75th percentile
– The quartiles including median give some indication of
centre, spread and shape of distribution
• The distance between the Q1 and Q3 is a simple
measure of spread
• Inter quartile range (IQR): Distance between the first
quartile (Q1) and third quartile (Q2)
IQR = Q3 – Q1

12
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and maximum
value
• Box plots are the popular way of visualising distribution
Largest
observation (max) Q3
(top whisker) Q2/
IQR
Median

Q1
Smallest
observation (min)
(bottom whisker)

• The whiskers terminate at

– Smallest (minimum) or largest (maximum) observations or
– the most extreme observations occurring within 1.5 x IQR of
respective quartiles (Q1 and Q3)

Q1
Smallest
observation (min)
(bottom whisker)

• 1.5 x IQR is equivalent to 2.7σ from mean if the distribution

is normal distribution
– It is close to 3σ from mean which is a standard in normal distribution

13
19-08-2021

Q1
Outlier: The values are larger Smallest
than 1.5 x IQR observation (min)
(bottom whisker)

Outlier(s): The values are less than 1.5 x IQR

• Lower bound: Q1 – (1.5 x IQR) Upper bound: Q3 + (1.5 x IQR)

• Outliers: Any datapoint less than the lower bound and
larger than the upper bound

Descriptive Analytics:
Measuring Dispersion of Data
• Variance (σ2):
– Let x1, x2, …, xN be a set of N values in an attribute.
variance (σ2) of this set of values is given by
1 N

2  xi   2 μ = mean
N  1 i 1
• Standard deviation (σ):
– The square root of variance   Variance
• Standard deviation measures the spread about the
mean
– It is used when the mean is chosen as the measure of
centre, especially in symmetric distribution
• The quartiles Q1 and Q3 measure the spread about
median
– Q1 and Q3 are used when the median is chosen as the
measure of centre, especially in skewed distribution
28

CertPREP Instructor PPT ITDataAnlytics 03
No ratings yet
CertPREP Instructor PPT ITDataAnlytics 03
157 pages
02 - Data Pre Processing
No ratings yet
02 - Data Pre Processing
91 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
No ratings yet
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
15 pages
Lecture 1,2&3
No ratings yet
Lecture 1,2&3
80 pages
Chapter 2 Descriptive Analytics I Nature of Data, Statistical Modeling, and Visualization
100% (1)
Chapter 2 Descriptive Analytics I Nature of Data, Statistical Modeling, and Visualization
54 pages
DWDM - Unit - III
No ratings yet
DWDM - Unit - III
77 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
53 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
Unit 3
No ratings yet
Unit 3
43 pages
L1
No ratings yet
L1
49 pages
Slide for Chapter 3
No ratings yet
Slide for Chapter 3
26 pages
Module 2c - Exploratory Data Analysis
No ratings yet
Module 2c - Exploratory Data Analysis
18 pages
Descriptive Analysis
No ratings yet
Descriptive Analysis
20 pages
2-1-Data
No ratings yet
2-1-Data
22 pages
02 Data
No ratings yet
02 Data
64 pages
Ch01_ICS422_04
No ratings yet
Ch01_ICS422_04
84 pages
Unit 4
No ratings yet
Unit 4
66 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
E-Note_33325_Content_Document_20250319114322AM
No ratings yet
E-Note_33325_Content_Document_20250319114322AM
69 pages
Slide for Chapter 3
No ratings yet
Slide for Chapter 3
26 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
02Data
No ratings yet
02Data
66 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
02Data
No ratings yet
02Data
65 pages
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
No ratings yet
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
8 pages
Data ch2
No ratings yet
Data ch2
16 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
DSA-REPORT
No ratings yet
DSA-REPORT
11 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
16 pages
Module 1
No ratings yet
Module 1
64 pages
Lect 3
No ratings yet
Lect 3
51 pages
CH - 4
No ratings yet
CH - 4
71 pages
02 Data
No ratings yet
02 Data
35 pages
02 Data
No ratings yet
02 Data
65 pages
02Data
No ratings yet
02Data
24 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
1_L2_Intro_DAM
No ratings yet
1_L2_Intro_DAM
27 pages
Data science-Unit-3-Complete
No ratings yet
Data science-Unit-3-Complete
33 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Lec 2
No ratings yet
Lec 2
26 pages
All Projects Spring 22
No ratings yet
All Projects Spring 22
202 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Intership Report.
No ratings yet
Intership Report.
52 pages
12.2 Computer Vision
No ratings yet
12.2 Computer Vision
12 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
4 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
75 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
AML - Mid Term - Merged
No ratings yet
AML - Mid Term - Merged
192 pages
Process and Summarize Data
No ratings yet
Process and Summarize Data
2 pages
Comparative Analysis of Online Review Platforms: Implication in Electronic Service Quality For Video Players and Editor Apps (Youtube and Tiktok)
No ratings yet
Comparative Analysis of Online Review Platforms: Implication in Electronic Service Quality For Video Players and Editor Apps (Youtube and Tiktok)
28 pages
Question Bank For Int - Data Science
100% (1)
Question Bank For Int - Data Science
5 pages
EEG Preprocessing Protocol Guideline
No ratings yet
EEG Preprocessing Protocol Guideline
114 pages
MLDM2006S Lecture 01 Introduction
No ratings yet
MLDM2006S Lecture 01 Introduction
45 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Using Ontologies and Machine Learning For Hazard Identification and Safety Analysis
No ratings yet
Using Ontologies and Machine Learning For Hazard Identification and Safety Analysis
25 pages
References Paper For Sentiment Analysis
No ratings yet
References Paper For Sentiment Analysis
21 pages
Car Price Prediction
No ratings yet
Car Price Prediction
21 pages
1 s2.0 S1874490722000490 Main
No ratings yet
1 s2.0 S1874490722000490 Main
14 pages
RapidMiner For ML
No ratings yet
RapidMiner For ML
9 pages
AI Intern Resume
No ratings yet
AI Intern Resume
1 page
Assessment of Performance of Machine Learning Based Similarities Calculated For Different English Translations of Holy Quran
No ratings yet
Assessment of Performance of Machine Learning Based Similarities Calculated For Different English Translations of Holy Quran
8 pages
Data Warehousing Data Mining Lecture Notes On UNIT 1
No ratings yet
Data Warehousing Data Mining Lecture Notes On UNIT 1
22 pages
Data Mining UTS - KELOMPOK 4 (TI 3 A) - Resume Jurnal Internasional 2 Baru
No ratings yet
Data Mining UTS - KELOMPOK 4 (TI 3 A) - Resume Jurnal Internasional 2 Baru
10 pages
Impedovo 2019 Dynamic Handwriting Analysis For TH
No ratings yet
Impedovo 2019 Dynamic Handwriting Analysis For TH
12 pages
Semantic Data Mining A Survey of Ontology-Based Approaches
No ratings yet
Semantic Data Mining A Survey of Ontology-Based Approaches
8 pages
Unit 1
No ratings yet
Unit 1
61 pages
Introduction To Gender and Age Detection Project
No ratings yet
Introduction To Gender and Age Detection Project
8 pages
Aimll Report Fake News Detection
No ratings yet
Aimll Report Fake News Detection
27 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Study On Decision Tree and KNN Algorithm For Intrusion Detection System IJERTV9IS050303
No ratings yet
Study On Decision Tree and KNN Algorithm For Intrusion Detection System IJERTV9IS050303
6 pages
A Deep Learning Approach To Job Recommendation Analysis With NLP
No ratings yet
A Deep Learning Approach To Job Recommendation Analysis With NLP
8 pages
1) Transfer Learning Based Plant Disease Detection Using ResNet50
No ratings yet
1) Transfer Learning Based Plant Disease Detection Using ResNet50
6 pages
Computer Vision and Image Processing + Libaries
No ratings yet
Computer Vision and Image Processing + Libaries
9 pages
Transformations Problem Statement
0% (1)
Transformations Problem Statement
7 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet

Class4 DataPreprocessing DiscriptiveAnalytics 19aug2021

Uploaded by

Class4 DataPreprocessing DiscriptiveAnalytics 19aug2021

Uploaded by

19-08-2021

Data Modeling Inference

Need for Data Preprocessing

Data Preprocessing Techniques

Data Preprocessing Techniques

Data Preprocessing Techniques

Descriptive Data Summarization

Mean Years of 9.1

Mean Salary: 55.4

Mean: 9.1 55.4

Illustration: Median of attribute “Years of experience”

Positively Skewed Negatively Skewed

Illustration: 50th percentile of attribute “Years of

Illustration: 50th percentile of attribute “Years of

• The whiskers terminate at

• 1.5 x IQR is equivalent to 2.7σ from mean if the distribution

Outlier(s): The values are less than 1.5 x IQR

• Lower bound: Q1 – (1.5 x IQR) Upper bound: Q3 + (1.5 x IQR)

You might also like