Chapter 3 Data Preparation
Chapter 3 Data Preparation
Data
Outline
Preparation Data cleansing
Data integration
Data reduction
Data transformation
1
Data Collection for Mining
• Data mining requires collecting great amount of data (available in
data warehouses or databases) to achieve the intended objective.
– Data mining starts by understanding the business or problem domain in
order to gain the business knowledge
– Business knowledge guides the process towards useful results, and
enables the recognition of those results that are useful.
– Based on the business knowledge data related to the business problem
are identified from the database/data warehouse for mining.
• Before feeding data to DM we have to make sure the quality of
data?
2
9/25/2020
Data Mining
Data Quality Measures
A well-accepted multidimensional data quality measures are the
following:
Accuracy (free from errors and outliers)
Completeness (no missing attributes and values)
Consistency (no inconsistent values and attributes)
Timeliness (appropriateness of the data for the purpose it is
required)
Believability (acceptability)
Interpretability (easy to understand)
Most of the data in the real world are poor quality; that is: 3
9/25/2020
Data Mining
Incomplete, Inconsistent, Noisy, Invalid, Redundant, …
Data is often of low quality
▪ Collecting the required data is challenging
▫ In addition to its heterogeneous & distributed nature of data,
real world data is low in quality.
▪ Why?
▫ You didn’t collect it yourself
▫ It probably was created for some other use, and then you came
along wanting to integrate it
▫ People make mistakes (typos)
▫ People are busy (“this is good enough”) to systematically
organize carefully using structured formats
4
9/25/2020
Data Mining
Types of problems with data
5
9/25/2020
Data Mining
Case study: Government Agency Data
▪ What we want:
• How to prepare enough and complete data with good quality that
we need for mining?
Coming up with good quality data needs to pass through
different data preprocessing tasks
6
9/25/2020
Data Mining
Major Tasks in Data Preprocessing
• Data cleansing: to get rid of bad data
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of data from multiple sources, such as databases, data
warehouses, or files
• Data reduction
– obtains a reduced representation of the data set that is much
smaller in volume, yet
produces almost the same results.
Dimensionality reduction
Numerosity/size reduction
Data compression
• Data transformation
– Normalization
– Discretization and/or Concept hierarchy generation
7
9/25/2020
Data Mining
Data Cleaning: Redundancy
• Duplicate or redundant data is data problems which
require data cleaning
▪ What’s wrong here?
• The dataset may contain only aggregate data. E.g.: traffic police
car accident report
– this much accident occurs this day in this sub-city
No of Accident Date Address
3 05/05/13 Finfinne, Oromia
2 02/05/13 Sumale
9
9/25/2020
Data Mining
Data Cleaning: Missing Data
• Data is not always available, lacking attribute values. E.g.,
Occupation=“ ”
many tuples have no recorded value for several attributes
13
9/25/2020
Data Mining
Data Cleaning: How to catch Noisy Data
▪ Manually check all data : tedious + infeasible?
▪ Sort data by frequency
▫ ‘green’ is more frequent than ‘rgeen’
▫ Works well for categorical data
• Use, say Numerical constraints to Catch Corrupt Data
▫ Weight can’t be negative
▫ People can’t have more than 2 parents
▫ Salary can’t be less than Birr 300
▪ Use statistical techniques to Catch Corrupt Data
▫ Check for outliers (the case of the 8 meters man)
▫ Check for correlated outliers using n-gram (“pregnant male”)
▫ People can be male
▫ People can be pregnant
▫ People can’t be male AND pregnant 14
9/25/2020
Data Mining
Data Integration
▪ Data integration combines data from multiple sources (database,
data warehouse, files & sometimes from non-electronic sources)
into a coherent store
▪ Because of the use of different sources, data that is fine on its own
may become problematic when we want to integrate it.
▪ Some of the issues are:
▫ Different formats and structures
▫ Conflicting and redundant data
▫ Data at different levels
15
9/25/2020
Data Mining
Data Integration: Formats
Not everyone uses the same format. Do you agree?
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
17
9/25/2020
Data Mining
Data Reduction Strategies
▪Data reduction: Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or almost
the same) analytical results
▪Why data reduction? A database/data warehouse may store terabytes
of data. Complex data analysis may take a very long time to run on
the complete data set.
21
9/25/2020
Data Mining
Data Mining Main Tasks
22
9/25/2020
Data Mining
DM Task: Predictive Modeling
A predictive model makes a prediction/forecast about values of data
using known results found from different historical data
Prediction Methods use existing variables to predict unknown or
future values of other variables.
Predict one variable Y given a set of other variables X. Here X could
be an n-dimensional vector
In effect this is function approximation through learning the
relationship between Y and X
Many, many algorithms for predictive modeling in statistics and
machine learning, including
Classification, regression, etc.
23
9/25/2020
Data Mining
Prediction Problems: Classification vs.
Numeric Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data
Numeric Prediction
models continuous-valued functions, i.e., predicts unknown or
missing values
24
9/25/2020
Data Mining
Time Series Analysis
Example: Stock Market
Predict future values
Determine similar patterns over time
Classify behavior
25
9/25/2020
Data Mining
Models and Patterns
▪ Model = abstract representation of a given training data
e.g., very simple linear model structure
Y=aX+b
▫ a & b are parameters determined from the data
▫ Y = aX + b is the model structure
▫ Y = 0.9X + 0.3 is a particular model
▪ Pattern represents “local structure” in a dataset
▫E.g., if X>x then Y >y with probability p
▪ Example: Given a finite sample, <x,f(x)> pairs, create a model that
can hold for future values?
To guess the true function f, find some pattern (called a
hypothesis) in the training examples, and assume that the
pattern will hold for future examples too. 26
9/25/2020
Data Mining
Predictive Modeling: Customer Scoring
▪ Customer data
▫ History of transactions with the bank
▫ Other credit data (obtained from Experian, etc)
▫ Demographic data on the customer or where they live
▪ Techniques
▫ Binary classification: logistic regression, decision trees, etc
▫ Many, many applications of this nature
27
9/25/2020
Data Mining
DM Task: Descriptive Modeling
• Goal is to build a “descriptive” model that models the underlying
observation
4.3
4.2
3.8
28
9/25/2020
Data Mining
Example of Descriptive Modeling
29
9/25/2020
Data Mining
Pattern (Association Rule) Discovery
30
9/25/2020
Data Mining
Example of Pattern Discovery
▪ Example in retail: Customer transactions to consumer behavior:
▫ People who bought “Da Vinci Code” also bought “The Five People
You Meet in Heaven” (www.amazon.com)
▪ Example: football player behavior
▫ If player A is in the game, player B’s scoring rate increases from
25% chance per game to 95% chance per game
▪ What about the following?
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBB
DBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBBBBCCAC
DABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBA
ACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBAD
CBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDD
DADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADA
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDAD
ABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCB
AAADCADDADAABBACCBB 31
9/25/2020
Data Mining
Basic Data Mining algorithms
33
9/25/2020
Data Mining
Question?
34