0% found this document useful (0 votes)
25 views

Chapter 2 - Data Preprocessing

The document discusses the importance of data preprocessing for data mining. It describes common issues with real-world data being dirty, incomplete, noisy or inconsistent. The major tasks of data preprocessing - cleaning, integration and reduction are explained.

Uploaded by

kusamee0
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Chapter 2 - Data Preprocessing

The document discusses the importance of data preprocessing for data mining. It describes common issues with real-world data being dirty, incomplete, noisy or inconsistent. The major tasks of data preprocessing - cleaning, integration and reduction are explained.

Uploaded by

kusamee0
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Chapter 2

Data Preprocessing

Eng. Ali sheak Ahmed


[email protected]
090-7731966

* Data Mining: Concepts and Techniques 1


Outline

■ Why preprocess the data?


■ Descriptive data summarization
■ Data cleaning
■ Data integration and transformation
■ Data reduction

* Data Mining: Concepts and Techniques 2


Why Data Preprocessing?
■ Data in the real world is dirty
■ incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
■ e.g., occupation=“ ”
■ noisy: containing errors or outliers
■ e.g., Salary=“-10”
■ inconsistent: containing discrepancies in codes
or names
■ e.g., Age=“42” Birthday=“03/07/1997”
■ e.g., Was rating “1,2,3”, now rating “A, B, C”
■ e.g., discrepancy between duplicate records
* Data Mining: Concepts and Techniques 3
Why Is Data Dirty?
■ Incomplete data may come from
■ “Not applicable” data value when collected
■ Different considerations between the time when the data was
collected and when it is analyzed.
■ Human/hardware/software problems
■ Noisy data (incorrect values) may come from
■ Faulty data collection instruments
■ Human or computer error at data entry
■ Errors in data transmission
■ Inconsistent data may come from
■ Different data sources
■ Functional dependency violation (e.g., modify some linked data)
■ Duplicate records also need data cleaning

* Data Mining: Concepts and Techniques 4


Why Is Data Preprocessing Important?

■ No quality data, no quality mining results!


■ Quality decisions must be based on quality data
■ e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
■ Data warehouse needs consistent integration of quality
data
■ Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse

* Data Mining: Concepts and Techniques 5


Multi-Dimensional Measure of Data Quality

■ A well-accepted multidimensional view:


■ Accuracy
■ Completeness
■ Consistency
■ Timeliness
■ Believability
■ Value added
■ Interpretability
■ Accessibility
■ Broad categories:
■ Intrinsic, contextual, representational, and accessibility

* Data Mining: Concepts and Techniques 6


Major Tasks in Data Preprocessing

■ Data cleaning
■ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies

■ Data integration
■ Integration of multiple databases, data cubes, or files

■ Data reduction
■ Obtains reduced representation in volume but produces the same
or similar analytical results

* Data Mining: Concepts and Techniques 7


Forms of Data Preprocessing

* Data Mining: Concepts and Techniques 8


Data Cleaning

■ Importance
■ “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
■ “Data cleaning is the number one problem in data
warehousing”—DCI survey
■ Data cleaning tasks
■ Fill in missing values
■ Identify outliers and smooth out noisy data
■ Correct inconsistent data
■ Resolve redundancy caused by data integration

* Data Mining: Concepts and Techniques 9


How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
■ Fill in the missing value manually: tedious + infeasible?
■ Fill in it automatically with
■ a global constant : e.g., “unknown”, a new class?!
■ the attribute mean
■ the attribute mean for all samples belonging to the same class:
smarter
■ the most probable value: inference-based such as Bayesian
formula or decision tree
* Data Mining: Concepts and Techniques 10
Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may due to
■ faulty data collection instruments
■ data entry problems
■ data transmission problems
■ technology limitation
■ inconsistency in naming convention
■ Other data problems which requires data cleaning
■ duplicate records
■ incomplete data
■ inconsistent data

* Data Mining: Concepts and Techniques 11


Data Integration
■ Data integration:
■ Combines data from multiple sources into a coherent
store
■ Schema integration: e.g., A.cust-id ≡ B.cust-#
■ Integrate metadata from different sources
■ Entity identification problem:
■ Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
■ Detecting and resolving data value conflicts
■ For the same real world entity, attribute values from
different sources are different
■ Possible reasons: different representations, different
scales, e.g., metric vs. British units

* Data Mining: Concepts and Techniques 12


Handling Redundancy in Data Integration

■ Redundant data occur often when integration of multiple


databases
■ Object identification: The same attribute or object
may have different names in different databases
■ Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
■ Redundant attributes may be able to be detected by
correlation analysis
■ Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality

* Data Mining: Concepts and Techniques 13


Data Reduction Strategies

■ Why data reduction?


■ A database/data warehouse may store terabytes of data
■ Complex data analysis/mining may take a very long time to run
on the complete data set
■ Data reduction
■ Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
■ Data reduction strategies
■ Data cube aggregation:
■ Dimensionality reduction — e.g., remove unimportant attributes
■ Data Compression
■ Numerosity reduction — e.g., fit data into models
■ Discretization and concept hierarchy generation

* Data Mining: Concepts and Techniques 14


End

* Data Mining: Concepts and Techniques 15

You might also like