0% found this document useful (0 votes)

19 views34 pages

DSV-S8 Data Cleaning

The session aims to equip participants with skills for effective data preprocessing and cleaning, focusing on identifying data quality issues and applying cleaning techniques. It highlights the importance of data quality in analysis and outlines major tasks in data preprocessing, including data cleaning, integration, transformation, reduction, and discretization. Key data cleaning tasks include handling missing values, correcting inconsistencies, and identifying outliers to ensure accurate and reliable data for analysis.

Uploaded by

1730303sivakartheek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views34 pages

DSV-S8 Data Cleaning

Uploaded by

1730303sivakartheek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Department of AI&DS

COURSE NAME: DATA SCIENCE AND VISUALIZATION

COURSE CODE: 22AD3206A
Topic: Data Cleaning

Session - 08

1
AIM OF THE
SESSION
TTo equip participants with the knowledge and skills necessary to effectively preprocess and clean
datasets in preparation for analysis.

INSTRUCTIONAL OBJECTIVES

1. Identify Data Quality Issues

2. Understand Data Cleaning Techniques
3. Apply Data Cleaning Procedures

LEARNING OUTCOMES

1. Understanding of Data Quality Issues

2. Proficiency in Data Cleaning Techniques
3. Application of Data Cleaning Tools and Methods
4. Ability to Evaluate Data Quality
DATA PREPROCESSING

• Data in the real world is dirty

• Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g., occupation=“ ”
• Noisy: containing errors or outliers e.g., Salary=“-10”
• Inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/2005”

3
DIRTY DATA COMES
FROM
• Incomplete data comes from
• n/a data value when collected
• different consideration between the time when the data was collected and
when it is analyzed.
• human/hardware/software problems
• Noisy data comes from the process of data
• collection
• entry
• transmission
• Inconsistent data comes from
• Different data sources
• Functional dependency violation

4
IMPORTANCE OF DATA
PREPROCESSING

• No quality data, no quality mining results!

• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
• Data warehouse needs consistent integration of quality data

• Data extraction, cleaning, and transformation comprises the

majority of the work of building a data warehouse. —Bill Inmon

5
MULTI-DIMENSIONAL MEASURE OF DATA QUALITY

A well-accepted multidimensional view:

• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
Broad categories:
• Intrinsic, Contextual, Representational, and Accessibility.

6
MAJOR TASKS IN DATA PREPROCESSING
A. Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
B. Data integration
• Integration of multiple databases, data cubes, or files
C. Data transformation
• Normalization and aggregation
D. Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
E. Data discretization
• Part of data reduction but with particular importance, especially for
numerical data
7
FORMS OF DATA PREPROCESSING

8
FORMS OF DATA PREPROCESSING
(CONT…)

9
DATA CLEANING

• Importance
• “Data cleaning is one of the three biggest problems in data
Processing”—Ralph Kimball
• “Data cleaning is the number one problem in data Processing”—DCI
survey
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration

10
DATA CLEANING TASKS

1. Data acquisition and metadata

2. Fill in missing values
3. Unified date format
4. Converting nominal to numeric
5. Identify outliers and smooth out noisy data
6. Correct inconsistent data

11
1. DATA ACQUISITION

• Data can be in DBMS

• ODBC, JDBC protocols
• Data in a flat file
• Fixed-column format
• Delimited format: tab, comma “,”, other
• E.g. C4.5 and Weka “arff” use comma-delimited data
• Attention: Convert field delimiters inside strings
• Verify the number of fields before and after

12
METADATA
• Field types:
• binary, nominal (categorical),ordinal, numeric, …
• For nominal fields: tables translating codes to full
descriptions
• Field role:
• input : inputs for modelling
• target : output
• id/auxiliary : keep, but not use for modelling
• ignore : don’t use for modelling
• weight : instance weight
• …
• Field descriptions
13
REFORMATTING

• Convert data to a standard format (e.g. arff or csv)

• Missing values
• Unified date format
• Binning of numeric data
• Fix errors and outliers
• Convert nominal fields whose values have order to numeric.

14
2. FILL IN MISSING
VALUES
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
• Missing data may be due to
• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Not register history or changes of the data
• Missing data may need to be inferred.

15
HANDLING MISSING
DATA
• Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
• Imputation: Use the attribute mean to fill in the missing value, or use the
attribute mean for all samples belonging to the same class to fill in the missing
value: smarter
• Use the most probable value to fill in the missing value: inference-based such as
Bayesian formula or decision tree

16
3. UNIFIED DATE
FORMAT

• We want to transform all dates to the same format internally

• Some systems accept dates in many formats
• e.g. “Sep 24, 2003”, 9/24/03, 24.09.03, etc
• dates are transformed internally to a standard value
• Frequently, just the year (YYYY) is sufficient
• For more details, we may need the month, the day, the hour, etc
• Representing date as YYYYMM or YYYYMMDD can be OK, but
has problems

17
UNIFIED DATE FORMAT
OPTIONS

• To preserve intervals, we can use

• Unix system date: Number of seconds since 1970
• Number of days since Jan 1, 1960 (SAS)
• Problem:
• values are non-obvious
• don’t help intuition and knowledge discovery
• harder to verify, easier to make an error

18
4. CONVERSION NOMINAL TO
NUMERIC
• Some tools can deal with nominal values internally
• Other methods (neural nets, regression, nearest neighbor)
require only numeric inputs
• To use nominal fields in such methods need to convert them
to a numeric value
• Q: Why not ignore nominal fields altogether?
• A: They may contain valuable information
• Different strategies for binary, ordered, multi-valued nominal
fields

19
CONVERSION BINARY TO NUMERIC

• Binary fields
• E.g. Gender = M, F
• Convert to Field_0_1 with 0, 1 values
• e.g. Gender = M  Gender_0_1 = 0
• Gender = F  Gender_0_1 = 1

20
CONVERSION ORDERED TO
NUMERIC
• Ordered attributes (e.g. Grade) can be converted to numbers
preserving natural order, e.g.
• A  4.0
• A-  3.7
• B+  3.3
• B  3.0
• Q: Why is it important to preserve natural order?
• A: To allow meaningful comparisons, e.g. Grade > 3.5

21
5. IDENTIFY OUTLIERS AND SMOOTH OUT NOISY
DATA
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention
• Other data problems which requires data cleaning
• Duplicate records
• Incomplete data
• Inconsistent data

22
HANDLING NOISY DATA

• Binning method:
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human
• Regression
• smooth by fitting the data into regression functions

23
SIMPLE DISCRETIZATION METHODS: BINNING
• Equal-width(distance) partitioning:
• It divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B-A)/N.
• The most straightforward
• But outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth(frequency) partitioning:
• It divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky.

24
BINNING METHODS FOR DATA SMOOTHING
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
• -Bin 1: 4, 8, 9, 15
• -Bin 2: 21, 21, 24, 25
• -Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• -Bin 1: 9, 9, 9, 9
• -Bin 2: 23, 23, 23, 23
• -Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• -Bin 1: 4, 4, 4, 15
• -Bin 2: 21, 21, 25, 25
• -Bin 3: 26, 26, 26, 34

25
DATA SMOOTHING -
REGRESSSION
• Linear regression involves
finding the “best” line to fit
two attributes (or variables), so
that one attribute can be used
to predict the other.
• Multiple linear regression is an
extension of linear regression,
where more than two attributes
are involved and the data are
fit to a multidimensional
surface.

26
DATA SMOOTHING – OUTLIER
ANALYSIS

• Outliers may be detected

by clustering, for example,
where similar values are
organized into groups, or
“clusters.”
• Intuitively, values that fall
outside of the set of
clusters may be considered
outliers
27
6. CORRECT INCONSISTENT DATA

Inconsistent data can arise due to errors in data entry,

different conventions used by different data sources, or
changes in data formats over time. Data inconsistency can
lead to inaccuracies in analysis and modelling. Techniques
such as data cleaning, standardization, and validation can
help identify and correct inconsistencies in the dataset. This
may involve tasks like correcting spelling errors, reconciling
conflicting information, or converting units of measurement to
28
a consistent scale.
SELF-ASSESMENT
QUESTIONS

1.Point out the correct statement.

A) Data has only qualitative value

B) Data has only quantitative value
C) Data has both qualitative and quantitative value
D) None of the mentioned

2.Which of the following is true about outliers

A) Data points that deviate a lot from normal observations

B) Can reduce the accuracy of the model
C) Both A & B
D) None of the mentioned
SELF-ASSESMENT
QUESTIONS

3. What are some examples of data quality problems.

A) Noise and outliers

B) Duplicate data
C) Missing values
D) All of the mentioned

4. Which of the following is an example of raw data?

A) Original swath files generated from a sonar system

B) Initial time-series file of temperature values
C) A real-time GPS-encoded navigation file
D) All of the mentioned
SUMMARY

• Data quality is defined in terms of accuracy, completeness, consistency,

timeliness, believability, and interpretability. These qualities are assessed
based on the intended use of the data.

• Data cleaning routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. Data
cleaning is usually performed as an iterative two-step process consisting
of discrepancy detection and data transformation.

31
TERMINAL QUESTIONS

1. Why is Data Cleaning So Important?

2. Describe in detail about the Data Cleaning Process

3. Is it possible to detect missing values from a data set? If yes, then how?

4. What is binning? How does it help in data visualization and analysis?

REFERENCES

Reference Books:
1. Jiawei Han, Micheline Kamber & Jian Pei, Data Mining: concepts & Techniques Morgan kaufmann,
Elsevier, 3rd Edition.
Sites and Web links:
2. https://www.knowledgehut.com/blog/data-science/data-cleaning#what-is-data-cleaning-in-data-
science?-%C2%A0
THANK YOU

Team – DAV

Kraft Heinz Finds A New Recipe For Analyzing Its Data
No ratings yet
Kraft Heinz Finds A New Recipe For Analyzing Its Data
2 pages
UNIT 6 Pharmacy Trends
No ratings yet
UNIT 6 Pharmacy Trends
3 pages
Lab 3
No ratings yet
Lab 3
6 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
UNIT 2 Data Preprocessing
No ratings yet
UNIT 2 Data Preprocessing
72 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Week2 2
No ratings yet
Week2 2
25 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Unit - II
No ratings yet
Unit - II
56 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Normalization
No ratings yet
Normalization
35 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Unit 2
No ratings yet
Unit 2
37 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
DWM
No ratings yet
DWM
14 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preprocessing 1 - Annotated
No ratings yet
Data Preprocessing 1 - Annotated
23 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Fire Base Vs Fire Store
No ratings yet
Fire Base Vs Fire Store
1 page
Essential Access Exercises
No ratings yet
Essential Access Exercises
15 pages
FYJC I.T Chapter 2 (DBMS)
No ratings yet
FYJC I.T Chapter 2 (DBMS)
9 pages
Lesson 10
No ratings yet
Lesson 10
4 pages
Ict501 10 Jan2025 Norisanabdkarim
No ratings yet
Ict501 10 Jan2025 Norisanabdkarim
6 pages
Intro To GIS
No ratings yet
Intro To GIS
40 pages
Beyond Forgetting
No ratings yet
Beyond Forgetting
6 pages
Primavera P6 Overview
No ratings yet
Primavera P6 Overview
5 pages
Data Flow Diagrams
No ratings yet
Data Flow Diagrams
9 pages
Exercise Oracle Forms 6i Training
100% (1)
Exercise Oracle Forms 6i Training
5 pages
Francesco Carrara - Programa Del Curso de Derecho Criminal - Tomo I PDF
100% (1)
Francesco Carrara - Programa Del Curso de Derecho Criminal - Tomo I PDF
331 pages
Reading Data Information Knowledge Wisdom
100% (1)
Reading Data Information Knowledge Wisdom
4 pages
Import Minex Data To Spry
No ratings yet
Import Minex Data To Spry
11 pages
Ali Korkmaz Aricilik
No ratings yet
Ali Korkmaz Aricilik
63 pages
LaTex Bibliography Management
No ratings yet
LaTex Bibliography Management
15 pages
Data Warehouse Security
No ratings yet
Data Warehouse Security
8 pages
Mid Term Question 241 CSE3521
No ratings yet
Mid Term Question 241 CSE3521
4 pages
Large Language Models Use Cases
No ratings yet
Large Language Models Use Cases
15 pages
21bce0968 VL2023240100969 Ast01
No ratings yet
21bce0968 VL2023240100969 Ast01
22 pages
Unit Testing: Types of Testing in ETL?
No ratings yet
Unit Testing: Types of Testing in ETL?
26 pages
Computer Organization and Architecture C PDF
No ratings yet
Computer Organization and Architecture C PDF
24 pages
Informatica
No ratings yet
Informatica
14 pages
Single Level Indexing
No ratings yet
Single Level Indexing
9 pages
Abdul Khaliq-8Years-PowerBI-MSBI
No ratings yet
Abdul Khaliq-8Years-PowerBI-MSBI
3 pages
Comprehensive Exam For Database Concepts
No ratings yet
Comprehensive Exam For Database Concepts
4 pages
SDSD
No ratings yet
SDSD
6 pages
WQD7005 Final Exam - 17219402
No ratings yet
WQD7005 Final Exam - 17219402
12 pages

DSV-S8 Data Cleaning

Uploaded by

DSV-S8 Data Cleaning

Uploaded by

Department of AI&DS

COURSE NAME: DATA SCIENCE AND VISUALIZATION

1. Identify Data Quality Issues

1. Understanding of Data Quality Issues

• Data in the real world is dirty

• No quality data, no quality mining results!

• Data extraction, cleaning, and transformation comprises the

A well-accepted multidimensional view:

1. Data acquisition and metadata

• Data can be in DBMS

• Convert data to a standard format (e.g. arff or csv)

• We want to transform all dates to the same format internally

• To preserve intervals, we can use

• Outliers may be detected

Inconsistent data can arise due to errors in data entry,

1.Point out the correct statement.

A) Data has only qualitative value

2.Which of the following is true about outliers

A) Data points that deviate a lot from normal observations

3. What are some examples of data quality problems.

A) Noise and outliers

4. Which of the following is an example of raw data?

A) Original swath files generated from a sonar system

• Data quality is defined in terms of accuracy, completeness, consistency,

1. Why is Data Cleaning So Important?

2. Describe in detail about the Data Cleaning Process

4. What is binning? How does it help in data visualization and analysis?

You might also like