100% found this document useful (1 vote)
29 views34 pages

Chapter 3 Data Preparation

data repesentation

Uploaded by

Abenezer Tesfaye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
29 views34 pages

Chapter 3 Data Preparation

data repesentation

Uploaded by

Abenezer Tesfaye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Chapter 2

Data
Outline
Preparation Data cleansing
Data integration
Data reduction
Data transformation

1
Data Collection for Mining
• Data mining requires collecting great amount of data (available in
data warehouses or databases) to achieve the intended objective.
– Data mining starts by understanding the business or problem domain in
order to gain the business knowledge
– Business knowledge guides the process towards useful results, and
enables the recognition of those results that are useful.
– Based on the business knowledge data related to the business problem
are identified from the database/data warehouse for mining.
• Before feeding data to DM we have to make sure the quality of
data?
2
9/25/2020
Data Mining
Data Quality Measures
A well-accepted multidimensional data quality measures are the
following:
Accuracy (free from errors and outliers)
Completeness (no missing attributes and values)
Consistency (no inconsistent values and attributes)
Timeliness (appropriateness of the data for the purpose it is
required)
Believability (acceptability)
Interpretability (easy to understand)
Most of the data in the real world are poor quality; that is: 3
9/25/2020
Data Mining
Incomplete, Inconsistent, Noisy, Invalid, Redundant, …
Data is often of low quality
▪ Collecting the required data is challenging
▫ In addition to its heterogeneous & distributed nature of data,
real world data is low in quality.

▪ Why?
▫ You didn’t collect it yourself
▫ It probably was created for some other use, and then you came
along wanting to integrate it
▫ People make mistakes (typos)
▫ People are busy (“this is good enough”) to systematically
organize carefully using structured formats

4
9/25/2020
Data Mining
Types of problems with data

• Some data have problems on their own that needs to be cleaned:


– Outliers: misleading data that do not fit to most of the data/facts
– Missing data: attributes values might be absent which needs to be
replaced with estimates
– Irrelevant data: attributes in the database that might not be of
interest to the DM task being developed
– Noisy data: attribute values that might be invalid or incorrect. E.g.
typographical errors
– Inconsistent data, duplicate data, etc.
• Other data are problematic only when we integrate them
– Everyone had their own way of structuring and formatting data,
based on what was convenient for them
– How to integrate data organized in different format following
different conventions.

5
9/25/2020
Data Mining
Case study: Government Agency Data
▪ What we want:

ID Name City State


1 Transportation ministry Gambela Gambela
2 Ministry of finance Gambela Gambela
3 Health office Gambela Gambela

• How to prepare enough and complete data with good quality that
we need for mining?
 Coming up with good quality data needs to pass through
different data preprocessing tasks

6
9/25/2020
Data Mining
Major Tasks in Data Preprocessing
• Data cleansing: to get rid of bad data
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of data from multiple sources, such as databases, data
warehouses, or files
• Data reduction
– obtains a reduced representation of the data set that is much
smaller in volume, yet
produces almost the same results.
 Dimensionality reduction
 Numerosity/size reduction
 Data compression
• Data transformation
– Normalization
– Discretization and/or Concept hierarchy generation
7
9/25/2020
Data Mining
Data Cleaning: Redundancy
• Duplicate or redundant data is data problems which
require data cleaning
▪ What’s wrong here?

How to clean it: manually or automatically?


8
9/25/2020
Data Mining
Data Cleaning: Incomplete Data

• The dataset may lack certain attributes of interest


– Is that enough if you have patient demographic profile and
address of region to predict the vulnerability (or exposure) of
a given region to Malaria outbreak?

• The dataset may contain only aggregate data. E.g.: traffic police
car accident report
– this much accident occurs this day in this sub-city
No of Accident Date Address
3 05/05/13 Finfinne, Oromia
2 02/05/13 Sumale

9
9/25/2020
Data Mining
Data Cleaning: Missing Data
• Data is not always available, lacking attribute values. E.g.,
Occupation=“ ”
 many tuples have no recorded value for several attributes

What’s wrong here? A missing required field


10
9/25/2020
Data Mining
Data Cleaning: Missing Data

• Missing data may be due to


– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding and may not be
considered important at the time of entry
– not register history or changes of the data

• How to handle Missing data? Missing data may need to be


inferred
– Ignore the missing value: not effective when the percentage of
missing values per attribute varies considerably
– Fill in the missing value manually: tedious + infeasible?
– Fill automatically
11
• calculate, say, using Expected Maximization (EM) Algorithm
9/25/2020
Data Mining
the most probable value
Predict missing value using EM
▪ Solves estimation with incomplete data.
▫ Obtain initial estimates for parameters using mean value.
▫ use estimates for calculating a value for missing data &
▫ The process continue Iteratively until convergence ((μi - μi+1) ≤ Ө).
▪ E.g.: out of six data items given known values= {1, 5, 10, 4}, estimate the
two missing data items?
▫ Let the EM converge if two estimates differ in 0.05 & our initial guess of
the two missing values= 3.
• The algorithm stop
since the last two
estimates are only
0.05 apart.
• Thus, our estimate
for the two items is
4.97.
12
9/25/2020
Data Cleaning: Noisy Data
• Noisy: containing noise, errors, or outliers
– e.g., Salary=“−10” (an error)
▪ Typographical errors are errors that corrupt data
▫ Let say ‘green’ is written as ‘rgeen’
• Incorrect attribute values may be due to
– faulty data collection instruments (e.g.: OCR)
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention

13
9/25/2020
Data Mining
Data Cleaning: How to catch Noisy Data
▪ Manually check all data : tedious + infeasible?
▪ Sort data by frequency
▫ ‘green’ is more frequent than ‘rgeen’
▫ Works well for categorical data
• Use, say Numerical constraints to Catch Corrupt Data
▫ Weight can’t be negative
▫ People can’t have more than 2 parents
▫ Salary can’t be less than Birr 300
▪ Use statistical techniques to Catch Corrupt Data
▫ Check for outliers (the case of the 8 meters man)
▫ Check for correlated outliers using n-gram (“pregnant male”)
▫ People can be male
▫ People can be pregnant
▫ People can’t be male AND pregnant 14
9/25/2020
Data Mining
Data Integration
▪ Data integration combines data from multiple sources (database,
data warehouse, files & sometimes from non-electronic sources)
into a coherent store
▪ Because of the use of different sources, data that is fine on its own
may become problematic when we want to integrate it.
▪ Some of the issues are:
▫ Different formats and structures
▫ Conflicting and redundant data
▫ Data at different levels

15
9/25/2020
Data Mining
Data Integration: Formats
Not everyone uses the same format. Do you agree?
Schema integration: e.g., A.cust-id  B.cust-#
Integrate metadata from different sources

Dates are especially problematic:


12/19/97
19/12/97
19/12/1997
19-12-97
Dec 19, 1997
19 December 1997
19th Dec. 1997

Are you frequently writing money as:


Birr 200, Br. 200, 200 Birr, …
16
9/25/2020
Data Mining
Data Integration: Inconsistent
▪ Inconsistent data: containing discrepancies in codes or names, which is
also the problem of lack of standardization / naming conventions. e.g.,
▫ Age=“26” vs. Birthday=“03/07/1986”
▫ Some use “1,2,3” for rating; others “A, B, C”
▪ Discrepancy between duplicate records

17
9/25/2020
Data Mining
Data Reduction Strategies
▪Data reduction: Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or almost
the same) analytical results
▪Why data reduction? A database/data warehouse may store terabytes
of data. Complex data analysis may take a very long time to run on
the complete data set.

▪Data reduction strategies


▫ Dimensionality reduction,
▫ Select best attributes or remove unimportant attributes
▫ Numerosity reduction
▫ Reduce data volume by choosing alternative, smaller forms of
data representation
▫ Data compression
▫ Is a technology that reduce the size of large files such that
smaller files take less memory space and fast to transfer over a
network or the Internet 18
9/25/2020
Data Mining
Data Transformation
• A function that maps the entire set of values of a given attribute to a
new set of replacement values such that each old value can be
identified with one of the new values
• Methods for data transformation
– Normalization: Scaled to fall within a smaller, specified
range of values
• min-max normalization
• z-score normalization

– Discretization: Reduce data size by dividing the range of a


continuous attribute into intervals. Interval labels can then be
used to replace actual data values
▫ Discretization can be performed recursively on an attribute
using method such as
▫ Binning: divide values into intervals
▫ Concept hierarchy climbing: organizes concepts (i.e., 19
Miningattribute values) hierarchically
9/25/2020
Data
Data sets preparation for learning
A standard machine learning technique is to divide the dataset into a
training set and a test set.
Training dataset is used for learning the parameters of the model
in order to produce hypotheses.
A training set is a set of problem instances (described as a set of
properties and their values), together with a classification of
the instance.

Test dataset, which is never seen during the hypothesis forming


stage, is used to get a final, unbiased estimate of how well the
model works.
Test set evaluates the accuracy of the model/hypothesis in
predicting the categorization of unseen examples.
A set of instances and their classifications used to test the
accuracy of a learned hypothesis.
20
9/25/2020
Data Mining
Classification: Train, Validation, Test split

21
9/25/2020
Data Mining
Data Mining Main Tasks

22
9/25/2020
Data Mining
DM Task: Predictive Modeling
A predictive model makes a prediction/forecast about values of data
using known results found from different historical data
Prediction Methods use existing variables to predict unknown or
future values of other variables.
Predict one variable Y given a set of other variables X. Here X could
be an n-dimensional vector
In effect this is function approximation through learning the
relationship between Y and X
Many, many algorithms for predictive modeling in statistics and
machine learning, including
Classification, regression, etc.

Often the emphasis is on predictive accuracy, less emphasis on


understanding the model

23
9/25/2020
Data Mining
Prediction Problems: Classification vs.
Numeric Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data

Numeric Prediction
models continuous-valued functions, i.e., predicts unknown or
missing values
24
9/25/2020
Data Mining
Time Series Analysis
Example: Stock Market
Predict future values
Determine similar patterns over time
Classify behavior

25
9/25/2020
Data Mining
Models and Patterns
▪ Model = abstract representation of a given training data
e.g., very simple linear model structure
Y=aX+b
▫ a & b are parameters determined from the data
▫ Y = aX + b is the model structure
▫ Y = 0.9X + 0.3 is a particular model
▪ Pattern represents “local structure” in a dataset
▫E.g., if X>x then Y >y with probability p
▪ Example: Given a finite sample, <x,f(x)> pairs, create a model that
can hold for future values?
 To guess the true function f, find some pattern (called a
hypothesis) in the training examples, and assume that the
pattern will hold for future examples too. 26
9/25/2020
Data Mining
Predictive Modeling: Customer Scoring

▪ Example: a bank has a database of 1 million past customers, 10% of


whom took out mortgages

▪ Use machine learning to rank new customers as a function of


p(mortgage | customer data)

▪ Customer data
▫ History of transactions with the bank
▫ Other credit data (obtained from Experian, etc)
▫ Demographic data on the customer or where they live

▪ Techniques
▫ Binary classification: logistic regression, decision trees, etc
▫ Many, many applications of this nature

27
9/25/2020
Data Mining
DM Task: Descriptive Modeling
• Goal is to build a “descriptive” model that models the underlying
observation

Red Blood Cell Hemoglobin Concentration


EM ITERATION 25

– e.g., a model that could simulate the data if needed


4.4

4.3

4.2

▪Descriptive model identifies patterns or relationship in data 4.1

▫ Unlike the predictive model, 4

▫ a descriptive model serves as


3.9

3.8

▫ a way to explore the properties of the data examined, 3.7


3.3 3.4 3.5 3.6 3.7
Red Blood Cell Volume
3.8 3.9 4

▫ not to predict new properties


• Description Methods find human-interpretable patterns that describe
and find natural groupings of the data.
• Methods used in descriptive modeling are: clustering,
summarization, association rule discovery, etc.

28
9/25/2020
Data Mining
Example of Descriptive Modeling

▪ goal: learn directed relationships among p variables


▪ techniques: directed (causal) graphs
▪ challenge: distinguishing between correlation and causation
▫ Example: Do yellow fingers cause lung cancer?

29
9/25/2020
Data Mining
Pattern (Association Rule) Discovery

▪ Goal is to discover interesting “local” patterns (sequential patterns)


in the data rather than to characterize the data globally
▫ Also called link analysis (uncovers relationships among data)

▪ Given market basket data we might discover that


▫ If customers buy wine and bread then they buy cheese with
probability 0.9

▪ Methods used in pattern discovery include:


▫ Association rules, Sequence discovery, etc.

30
9/25/2020
Data Mining
Example of Pattern Discovery
▪ Example in retail: Customer transactions to consumer behavior:
▫ People who bought “Da Vinci Code” also bought “The Five People
You Meet in Heaven” (www.amazon.com)
▪ Example: football player behavior
▫ If player A is in the game, player B’s scoring rate increases from
25% chance per game to 95% chance per game
▪ What about the following?
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBB
DBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBBBBCCAC
DABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBA
ACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBAD
CBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDD
DADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADA
ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDAD
ABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCB
AAADCADDADAABBACCBB 31
9/25/2020
Data Mining
Basic Data Mining algorithms

▪ Classification: which is also called Supervised learning, maps data


into predefined groups or classes to enhance the prediction process

▪ Clustering: which is also called Unsupervised learning, groups


similar data together into clusters.
▫ is used to find appropriate groupings of elements for a set of data.
▫ Unlike classification, clustering is a kind of undirected knowledge
discovery or unsupervised learning; i.e., there is no target field & the
relationship among the data is identified by bottom-up approach.
▪ Association Rule: is also known as market-basket analysis
▫ It discovers interesting associations between attributes contained in a
database.
▫ Based on frequency of occurrence of number of items in the event,
association rule tells if item X is a part of the event, then what is the
likelihood of item Y is also part of the event.
32
9/25/2020
Data Mining
Supervised vs. Unsupervised Learning

▪ Supervised learning (classification)


▫ Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
▫ New data is classified based on the training set

▪ Unsupervised learning (clustering)


▫ The class labels of training data is unknown
▫ Given a set of measurements, observations, etc. with the aim
of establishing the existence of classes or clusters in the data

33
9/25/2020
Data Mining
Question?

34

You might also like