0% found this document useful (0 votes)
19 views34 pages

DSV-S8 Data Cleaning

The session aims to equip participants with skills for effective data preprocessing and cleaning, focusing on identifying data quality issues and applying cleaning techniques. It highlights the importance of data quality in analysis and outlines major tasks in data preprocessing, including data cleaning, integration, transformation, reduction, and discretization. Key data cleaning tasks include handling missing values, correcting inconsistencies, and identifying outliers to ensure accurate and reliable data for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views34 pages

DSV-S8 Data Cleaning

The session aims to equip participants with skills for effective data preprocessing and cleaning, focusing on identifying data quality issues and applying cleaning techniques. It highlights the importance of data quality in analysis and outlines major tasks in data preprocessing, including data cleaning, integration, transformation, reduction, and discretization. Key data cleaning tasks include handling missing values, correcting inconsistencies, and identifying outliers to ensure accurate and reliable data for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Department of AI&DS

COURSE NAME: DATA SCIENCE AND VISUALIZATION


COURSE CODE: 22AD3206A
Topic: Data Cleaning

Session - 08

1
AIM OF THE
SESSION
TTo equip participants with the knowledge and skills necessary to effectively preprocess and clean
datasets in preparation for analysis.

INSTRUCTIONAL OBJECTIVES

1. Identify Data Quality Issues


2. Understand Data Cleaning Techniques
3. Apply Data Cleaning Procedures

LEARNING OUTCOMES

1. Understanding of Data Quality Issues


2. Proficiency in Data Cleaning Techniques
3. Application of Data Cleaning Tools and Methods
4. Ability to Evaluate Data Quality
DATA PREPROCESSING

• Data in the real world is dirty


• Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g., occupation=“ ”
• Noisy: containing errors or outliers e.g., Salary=“-10”
• Inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/2005”

3
DIRTY DATA COMES
FROM
• Incomplete data comes from
• n/a data value when collected
• different consideration between the time when the data was collected and
when it is analyzed.
• human/hardware/software problems
• Noisy data comes from the process of data
• collection
• entry
• transmission
• Inconsistent data comes from
• Different data sources
• Functional dependency violation

4
IMPORTANCE OF DATA
PREPROCESSING

• No quality data, no quality mining results!


• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
• Data warehouse needs consistent integration of quality data

• Data extraction, cleaning, and transformation comprises the


majority of the work of building a data warehouse. —Bill Inmon

5
MULTI-DIMENSIONAL MEASURE OF DATA QUALITY

A well-accepted multidimensional view:


• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
Broad categories:
• Intrinsic, Contextual, Representational, and Accessibility.

6
MAJOR TASKS IN DATA PREPROCESSING
A. Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
B. Data integration
• Integration of multiple databases, data cubes, or files
C. Data transformation
• Normalization and aggregation
D. Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
E. Data discretization
• Part of data reduction but with particular importance, especially for
numerical data
7
FORMS OF DATA PREPROCESSING

8
FORMS OF DATA PREPROCESSING
(CONT…)

9
DATA CLEANING

• Importance
• “Data cleaning is one of the three biggest problems in data
Processing”—Ralph Kimball
• “Data cleaning is the number one problem in data Processing”—DCI
survey
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration

10
DATA CLEANING TASKS

1. Data acquisition and metadata


2. Fill in missing values
3. Unified date format
4. Converting nominal to numeric
5. Identify outliers and smooth out noisy data
6. Correct inconsistent data

11
1. DATA ACQUISITION

• Data can be in DBMS


• ODBC, JDBC protocols
• Data in a flat file
• Fixed-column format
• Delimited format: tab, comma “,”, other
• E.g. C4.5 and Weka “arff” use comma-delimited data
• Attention: Convert field delimiters inside strings
• Verify the number of fields before and after

12
METADATA
• Field types:
• binary, nominal (categorical),ordinal, numeric, …
• For nominal fields: tables translating codes to full
descriptions
• Field role:
• input : inputs for modelling
• target : output
• id/auxiliary : keep, but not use for modelling
• ignore : don’t use for modelling
• weight : instance weight
• …
• Field descriptions
13
REFORMATTING

• Convert data to a standard format (e.g. arff or csv)


• Missing values
• Unified date format
• Binning of numeric data
• Fix errors and outliers
• Convert nominal fields whose values have order to numeric.

14
2. FILL IN MISSING
VALUES
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
• Missing data may be due to
• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Not register history or changes of the data
• Missing data may need to be inferred.

15
HANDLING MISSING
DATA
• Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
• Imputation: Use the attribute mean to fill in the missing value, or use the
attribute mean for all samples belonging to the same class to fill in the missing
value: smarter
• Use the most probable value to fill in the missing value: inference-based such as
Bayesian formula or decision tree

16
3. UNIFIED DATE
FORMAT

• We want to transform all dates to the same format internally


• Some systems accept dates in many formats
• e.g. “Sep 24, 2003”, 9/24/03, 24.09.03, etc
• dates are transformed internally to a standard value
• Frequently, just the year (YYYY) is sufficient
• For more details, we may need the month, the day, the hour, etc
• Representing date as YYYYMM or YYYYMMDD can be OK, but
has problems

17
UNIFIED DATE FORMAT
OPTIONS

• To preserve intervals, we can use


• Unix system date: Number of seconds since 1970
• Number of days since Jan 1, 1960 (SAS)
• Problem:
• values are non-obvious
• don’t help intuition and knowledge discovery
• harder to verify, easier to make an error

18
4. CONVERSION NOMINAL TO
NUMERIC
• Some tools can deal with nominal values internally
• Other methods (neural nets, regression, nearest neighbor)
require only numeric inputs
• To use nominal fields in such methods need to convert them
to a numeric value
• Q: Why not ignore nominal fields altogether?
• A: They may contain valuable information
• Different strategies for binary, ordered, multi-valued nominal
fields

19
CONVERSION BINARY TO NUMERIC

• Binary fields
• E.g. Gender = M, F
• Convert to Field_0_1 with 0, 1 values
• e.g. Gender = M  Gender_0_1 = 0
• Gender = F  Gender_0_1 = 1

20
CONVERSION ORDERED TO
NUMERIC
• Ordered attributes (e.g. Grade) can be converted to numbers
preserving natural order, e.g.
• A  4.0
• A-  3.7
• B+  3.3
• B  3.0
• Q: Why is it important to preserve natural order?
• A: To allow meaningful comparisons, e.g. Grade > 3.5

21
5. IDENTIFY OUTLIERS AND SMOOTH OUT NOISY
DATA
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention
• Other data problems which requires data cleaning
• Duplicate records
• Incomplete data
• Inconsistent data

22
HANDLING NOISY DATA

• Binning method:
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human
• Regression
• smooth by fitting the data into regression functions

23
SIMPLE DISCRETIZATION METHODS: BINNING
• Equal-width(distance) partitioning:
• It divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B-A)/N.
• The most straightforward
• But outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth(frequency) partitioning:
• It divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky.

24
BINNING METHODS FOR DATA SMOOTHING
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
• -Bin 1: 4, 8, 9, 15
• -Bin 2: 21, 21, 24, 25
• -Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• -Bin 1: 9, 9, 9, 9
• -Bin 2: 23, 23, 23, 23
• -Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• -Bin 1: 4, 4, 4, 15
• -Bin 2: 21, 21, 25, 25
• -Bin 3: 26, 26, 26, 34

25
DATA SMOOTHING -
REGRESSSION
• Linear regression involves
finding the “best” line to fit
two attributes (or variables), so
that one attribute can be used
to predict the other.
• Multiple linear regression is an
extension of linear regression,
where more than two attributes
are involved and the data are
fit to a multidimensional
surface.

26
DATA SMOOTHING – OUTLIER
ANALYSIS

• Outliers may be detected


by clustering, for example,
where similar values are
organized into groups, or
“clusters.”
• Intuitively, values that fall
outside of the set of
clusters may be considered
outliers
27
6. CORRECT INCONSISTENT DATA

Inconsistent data can arise due to errors in data entry,


different conventions used by different data sources, or
changes in data formats over time. Data inconsistency can
lead to inaccuracies in analysis and modelling. Techniques
such as data cleaning, standardization, and validation can
help identify and correct inconsistencies in the dataset. This
may involve tasks like correcting spelling errors, reconciling
conflicting information, or converting units of measurement to
28
a consistent scale.
SELF-ASSESMENT
QUESTIONS

1.Point out the correct statement.

A) Data has only qualitative value


B) Data has only quantitative value
C) Data has both qualitative and quantitative value
D) None of the mentioned

2.Which of the following is true about outliers

A) Data points that deviate a lot from normal observations


B) Can reduce the accuracy of the model
C) Both A & B
D) None of the mentioned
SELF-ASSESMENT
QUESTIONS

3. What are some examples of data quality problems.

A) Noise and outliers


B) Duplicate data
C) Missing values
D) All of the mentioned

4. Which of the following is an example of raw data?

A) Original swath files generated from a sonar system


B) Initial time-series file of temperature values
C) A real-time GPS-encoded navigation file
D) All of the mentioned
SUMMARY

• Data quality is defined in terms of accuracy, completeness, consistency,


timeliness, believability, and interpretability. These qualities are assessed
based on the intended use of the data.

• Data cleaning routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. Data
cleaning is usually performed as an iterative two-step process consisting
of discrepancy detection and data transformation.

31
TERMINAL QUESTIONS

1. Why is Data Cleaning So Important?

2. Describe in detail about the Data Cleaning Process

3. Is it possible to detect missing values from a data set? If yes, then how?

4. What is binning? How does it help in data visualization and analysis?


REFERENCES

Reference Books:
1. Jiawei Han, Micheline Kamber & Jian Pei, Data Mining: concepts & Techniques Morgan kaufmann,
Elsevier, 3rd Edition.
Sites and Web links:
2. https://www.knowledgehut.com/blog/data-science/data-cleaning#what-is-data-cleaning-in-data-
science?-%C2%A0
THANK YOU

Team – DAV

You might also like