DSV-S8 Data Cleaning
DSV-S8 Data Cleaning
Session - 08
1
AIM OF THE
SESSION
TTo equip participants with the knowledge and skills necessary to effectively preprocess and clean
datasets in preparation for analysis.
INSTRUCTIONAL OBJECTIVES
LEARNING OUTCOMES
3
DIRTY DATA COMES
FROM
• Incomplete data comes from
• n/a data value when collected
• different consideration between the time when the data was collected and
when it is analyzed.
• human/hardware/software problems
• Noisy data comes from the process of data
• collection
• entry
• transmission
• Inconsistent data comes from
• Different data sources
• Functional dependency violation
4
IMPORTANCE OF DATA
PREPROCESSING
5
MULTI-DIMENSIONAL MEASURE OF DATA QUALITY
6
MAJOR TASKS IN DATA PREPROCESSING
A. Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
B. Data integration
• Integration of multiple databases, data cubes, or files
C. Data transformation
• Normalization and aggregation
D. Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
E. Data discretization
• Part of data reduction but with particular importance, especially for
numerical data
7
FORMS OF DATA PREPROCESSING
8
FORMS OF DATA PREPROCESSING
(CONT…)
9
DATA CLEANING
• Importance
• “Data cleaning is one of the three biggest problems in data
Processing”—Ralph Kimball
• “Data cleaning is the number one problem in data Processing”—DCI
survey
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
10
DATA CLEANING TASKS
11
1. DATA ACQUISITION
12
METADATA
• Field types:
• binary, nominal (categorical),ordinal, numeric, …
• For nominal fields: tables translating codes to full
descriptions
• Field role:
• input : inputs for modelling
• target : output
• id/auxiliary : keep, but not use for modelling
• ignore : don’t use for modelling
• weight : instance weight
• …
• Field descriptions
13
REFORMATTING
14
2. FILL IN MISSING
VALUES
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
• Missing data may be due to
• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Not register history or changes of the data
• Missing data may need to be inferred.
15
HANDLING MISSING
DATA
• Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
• Imputation: Use the attribute mean to fill in the missing value, or use the
attribute mean for all samples belonging to the same class to fill in the missing
value: smarter
• Use the most probable value to fill in the missing value: inference-based such as
Bayesian formula or decision tree
16
3. UNIFIED DATE
FORMAT
17
UNIFIED DATE FORMAT
OPTIONS
18
4. CONVERSION NOMINAL TO
NUMERIC
• Some tools can deal with nominal values internally
• Other methods (neural nets, regression, nearest neighbor)
require only numeric inputs
• To use nominal fields in such methods need to convert them
to a numeric value
• Q: Why not ignore nominal fields altogether?
• A: They may contain valuable information
• Different strategies for binary, ordered, multi-valued nominal
fields
19
CONVERSION BINARY TO NUMERIC
• Binary fields
• E.g. Gender = M, F
• Convert to Field_0_1 with 0, 1 values
• e.g. Gender = M Gender_0_1 = 0
• Gender = F Gender_0_1 = 1
20
CONVERSION ORDERED TO
NUMERIC
• Ordered attributes (e.g. Grade) can be converted to numbers
preserving natural order, e.g.
• A 4.0
• A- 3.7
• B+ 3.3
• B 3.0
• Q: Why is it important to preserve natural order?
• A: To allow meaningful comparisons, e.g. Grade > 3.5
21
5. IDENTIFY OUTLIERS AND SMOOTH OUT NOISY
DATA
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention
• Other data problems which requires data cleaning
• Duplicate records
• Incomplete data
• Inconsistent data
22
HANDLING NOISY DATA
• Binning method:
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human
• Regression
• smooth by fitting the data into regression functions
23
SIMPLE DISCRETIZATION METHODS: BINNING
• Equal-width(distance) partitioning:
• It divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B-A)/N.
• The most straightforward
• But outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth(frequency) partitioning:
• It divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
24
BINNING METHODS FOR DATA SMOOTHING
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
• -Bin 1: 4, 8, 9, 15
• -Bin 2: 21, 21, 24, 25
• -Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• -Bin 1: 9, 9, 9, 9
• -Bin 2: 23, 23, 23, 23
• -Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• -Bin 1: 4, 4, 4, 15
• -Bin 2: 21, 21, 25, 25
• -Bin 3: 26, 26, 26, 34
25
DATA SMOOTHING -
REGRESSSION
• Linear regression involves
finding the “best” line to fit
two attributes (or variables), so
that one attribute can be used
to predict the other.
• Multiple linear regression is an
extension of linear regression,
where more than two attributes
are involved and the data are
fit to a multidimensional
surface.
26
DATA SMOOTHING – OUTLIER
ANALYSIS
• Data cleaning routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. Data
cleaning is usually performed as an iterative two-step process consisting
of discrepancy detection and data transformation.
31
TERMINAL QUESTIONS
3. Is it possible to detect missing values from a data set? If yes, then how?
Reference Books:
1. Jiawei Han, Micheline Kamber & Jian Pei, Data Mining: concepts & Techniques Morgan kaufmann,
Elsevier, 3rd Edition.
Sites and Web links:
2. https://www.knowledgehut.com/blog/data-science/data-cleaning#what-is-data-cleaning-in-data-
science?-%C2%A0
THANK YOU
Team – DAV