SlideShare a Scribd company logo
Machine Learning
Chapter Two:
Data Preprocessing
1. Overview of data preprocessing
 Machine Learning requires collecting great amount of data to
achieve the intended objective.
 A real-world data generally contains an unusable format which
cannot be directly used for machine learning models.
 Before feeding data to ML, we have to make sure the quality of
data?
 Data preprocessing is a process of preparing the raw data and
making it suitable for a machine learning model.
 It is the crucial step while creating a machine learning model.
 It increases the accuracy and efficiency of a machine learning
model.
Data Quality
 A well-accepted multidimensional data quality
measures are the following:
 Accuracy (free from errors and outliers)
 Completeness (no missing attributes and values)
 Consistency (no inconsistent values and attributes)
 Timeliness (appropriateness of the data for the purpose it is
required)
 Believability (acceptability)
 Interpretability (easy to understand)
3
Why Data Preprocessing?
 Most of the data in the real world are poor quality
(Incomplete, Inconsistent, Noisy, Invalid, Redundant, …)
 incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 Redundant: including everything, some of which are
irrelevant to our task.
No quality data, no quality results!
4
Data is often of low quality
 Collecting the required data is challenging
 Why?
 You didn’t collect it yourself
 It probably was created for some other use, and then you came
along wanting to integrate it.
 People make mistakes (typos)
 Data collection instruments used may be faulty.
 Everyone had their own way of structuring and formatting data,
based on what was convenient for them.
 Users may purposely submit incorrect data values for
mandatory fields when they do not wish to submit personal
information .
5
6
2. Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
 Integration of data from multiple data sources
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Data discretization (for numerical data) and Concept hierarchy generation
Forms of data preprocessing
7
8
2.1. Data Cleaning
 Data cleaning tasks – attempts to:
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
9
Incomplete (Missing) Data:
 Data is not always available
 many tuples have no recorded value for several attributes,
such as customer income in sales data.
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data.
10
How to Handle Missing Value?
 Ignore the tuple:
 usually done when class label is missing (when doing
classification).
 Not effective method unless several attributes missing values
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with:
 a global constant : e.g., “unknown”, a new class?!
 Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value
 Average income of customer $28,000 (use this value to
replace).
 Use the most probable value :
 determined with regression, inference-based such as Bayesian
formula, or decision tree. (most popular)
How to Handle Missing Data?
Age Income Religion Gender
23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F
Fill missing values using aggregate functions (e.g., average) or probabilistic
estimates on global value distribution
E.g., put the average income here, or put the most probable income based
on the fact that the person is 39 years old
E.g., put the most frequent religion here
11
12
Noisy Data
 Noise-is a random error or variance in a measured
variable
 Incorrect attribute values may be due to
 faulty data collection instruments(e.g.: OCR)
 data entry problems-Let say ‘green’ is written as ‘rgeen’
 data transmission problems
 technology limitation
 inconsistency in naming convention
12
13
How to Handle Noisy Data?
Manually check all data : tedious + infeasible?
Sort data by frequency
‘green’ is more frequent than ‘rgeen’
Works well for categorical data
 Use, say Numerical constraints to Catch Corrupt Data
 Weight can’t be negative
 People can’t have more than 2 parents
 Salary can’t be less than Birr 300
Check for outliers (the case of the 8 meters man)
check for correlated outliers using n-gram (“pregnant
male”)
People can be male
People can be pregnant
People can’t be male AND pregnant
13
2.2. Data Integration
 Data integration combines data from multiple sources
into a coherent store
 Because of the use of different sources, data that that is
fine on its own may become problematic when we want
to integrate it.
 Some of the issues are:
Different formats and structures
Conflicting and redundant data
Data at different levels
14
Data Integration: Formats
 Not everyone uses the same format. Do you agree?
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Dates are especially problematic:
 12/19/97
 19/12/97
 19/12/1997
 19-12-97
 Dec 19, 1997
 19 December 1997
 19th Dec. 1997
 Are you frequently writing money as:
 Birr 200, Br. 200, 200 Birr, …
15
16
Data Integration: Inconsistent
Inconsistent data: containing discrepancies in codes or
names, which is also the problem of lack of
standardization / naming conventions. e.g.,
Age=“26” vs. Birthday=“03/07/1986”
Some use “1,2,3” for rating; others “A, B, C”
Data Integration: Conflicting Data
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
American vs. British units
 weight measurement: KG or pound
 Height measurement: meter or inch
17
2.3.Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Data reduction strategies
Dimensionality reduction,
 Select best attributes or remove unimportant attributes
Numerosity reduction
 Reduce data volume by choosing alternative, smaller forms of
data representation
Data compression
18
Data Reduction: Dimensionality Reduction
 Dimensionality reduction
Helps to eliminate Irrelevant attributes and reduce noise: that
contain no information useful for model development.
E.g. is students' ID relevant to predict students' GPA?
Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
E.g. purchase price of a product & the amount of sales tax paid
Reduce time and space required in model development
Allow easier visualization
 Method: attribute subset selection
One of the method to reduce dimensionality of data is by selecting
best attributes
19
Data Reduction: Numerosity Reduction
 Different methods can be used, including Clustering and
sampling
 Clustering
 Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
 There are many choices of clustering definitions and clustering
algorithms
 Sampling
 obtaining a small sample s to represent the whole data set N
 Key principle: Choose a representative subset of the data using
suitable sampling technique
20
2.4. Data Transformation
 A function that maps the entire set of values of a given
attribute to a new set of replacement values. such that
each old value can be identified with one of the new
values.
 Methods for data transformation
 Normalization: Scaled to fall within a smaller, specified range of
values
• min-max normalization
• z-score normalization
• decimal scaling
 Discretization: Reduce data size by dividing the range of a
continuous attribute into intervals.
– Discretization can be performed recursively on an attribute
using method such as
 Binning: divide values into intervals
 Concept hierarchy climbing: organizes concepts (i.e., attribute
values) hierarchically
Data Transformation: Normalization
 min-max normalization
 z-score normalization
 normalization by decimal scaling
A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
' 




A
A
dev
stand
mean
v
v
_
'


j
v
v
10
' Where j is the smallest integer such that
Max(| |)<1
'
v
21
Example:
 Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively.
We would like to map income to the range [0.0, 1.0].
 Suppose that the mean and standard deviation of the
values for the attribute income are $54,000 and $16,000,
respectively.
 Suppose that the recorded values of A range from –986 to
917.
22
Normalization
 Min-max normalization:
– Ex. Let income range $12,000 to $98,000 is normalized to
[0.0, 1.0]. Then $73,600 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
– Ex. Let μ = 54,000, σ = 16,000. Then,
 Decimal scaling: Suppose that the recorded values of A range from -986 to
917. To normalize by decimal scaling, we therefore divide each value by 1000
(i.e., j = 3) so that -986 normalizes to -0.986 and 917 normalizes to 0.917.
716
.
0
0
)
0
0
.
1
(
000
,
12
000
,
98
000
,
12
600
,
73





newMin
newMin
newMax
min
max
min
v
v
A
A
A




 )
(
'
A
A
v
v




'
225
.
1
000
,
16
000
,
54
600
,
73


23
Discretization and Concept Hierarchy
 Discretization
 reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals.
 Interval labels can then be used to replace actual data values
 Example:
 Binning methods – equal-width, equal-frequency
24
Binning
 Attribute values (for one attribute e.g., age):
 0, 4, 12, 16, 16, 18, 24, 26, 28
 Equi-width binning – for bin width of e.g., 10:
 Bin 1: 0, 4 [-,10) bin
 Bin 2: 12, 16, 16, 18 [10,20) bin
 Bin 3: 24, 26, 28 [20,+) bin
 – denote negative infinity, + positive infinity
 Equi-frequency binning – for bin density of e.g.,
3:
 Bin 1: 0, 4, 12 [-, 14) bin
 Bin 2: 16, 16, 18 [14, 21) bin
 Bin 3: 24, 26, 28 [21,+] bin
25
Concept Hierarchy Generation
Concept hierarchy:
organizes concepts (i.e., attribute values)
hierarchically.
Concept hierarchy formation:
 Recursively reduce the data by collecting
and replacing low level concepts (such as
numeric values for age) by higher level
concepts (such as child, youth, adult, or
senior)
Concept hierarchies can be explicitly
specified by domain experts.
country
Region or state
city
Sub city
Kebele
 It can be automatically formed by the
analysis of the number of distinct
values. E.g., for a set of attributes:
{Kebele, city, state, country}
 For numeric data, use discretization
methods.
26
3. Dataset
 Dataset is a collection of data
objects and their attributes
 An attribute is a property or
characteristic of an object
 Examples: eye color of a person,
temperature, etc.
 Attribute is also known as variable,
field, characteristic, dimension, or
feature
 A collection of attributes
describe an object
 Object is also known as record,
point, case, sample, entity, or
instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
Types of Attributes
 The type of an attribute is determined by the set of possible values :
nominal, binary, ordinal, or numeric—the attribute can have.
 There are different types of attributes
Nominal- means “relating to names” .
 The values of a nominal attribute are symbols or names of
things.
 Nominal attributes are also referred to as categorical.
 Examples: hair-color( Black, Brown, Blond etc.) , Marital-
Status(Single, married, divorced and Widowed), Occupation
etc.
Ordinal:
 an attribute with possible values that have a meaningful order
or ranking among them
 Examples: rankings (e.g., grades, height {tall, medium, short}
28
Types of Attributes..
Binary :
 is a nominal attribute with only two categories or
states: 0-absent or 1-present , Boolean( true or false)
 Example: Smoker(0-not smoker or 1-smoker)
Interval-Scaled : Numeric Attributes
 are measured on a scale of equal-size units.
 allow us to compare and quantify the difference between values
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio-Scaled: Numeric Attributes
 a value as being a multiple (or ratio) of another value
 Examples: temperature in length, time, counts
29
Datasets preparation for learning
A standard machine learning technique is to divide the dataset into a
training set and a test set.
 Training dataset is used for model development.
 Test dataset, which is never seen during model development stage and used to
evaluates the accuracy of the model.
 There are various ways in which to separate the data into training
and test sets
 The holdout method
 Cross-validation
 The bootstrap
30
The holdout method
 In this methods, the given data are randomly partitioned
into two independent sets, a training set and a test set.
 Usually: one third for testing, the rest for training
 For small or “unbalanced” datasets, samples might not
be representative
 Few or none instances of some classes
 Stratified sample: advanced version of balancing the
data
 Make sure that each class is represented with approximately
equal proportions in both subsets.
 Random subsampling : a variation of the holdout method in
which the holdout method is repeated k times.
 The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
32
Cross-validation
 Cross-validation works as follows:
 First step: data is split into k subsets of equal-sized sets
randomly.
 A partition of a set is a collection of subsets for which the
intersection of any pair of sets is empty. That is, no element of
one subset is an element of another subset in a partition.
 Second step: each subset in turn is used for testing and the
remainder for training
This is called k-fold cross-validation
 Often the subsets are stratified before the cross-validation is
performed
 The error estimates are averaged to yield an overall error
estimate.
33
Cross-validation example:
— Break up data into groups of the same size
— Hold aside one group for testing and use the rest to build model
— Repeat
Test
33
Bootstrap
 the bootstrap method samples the given training tuples uniformly
with replacement
 the machine is allowed to select the same tuple more than once.
 A commonly used one is the .632 bootstrap
 Suppose we are given a data set of d tuples. The data set is
sampled d times, with replacement, resulting in a bootstrap sample
or training set of d samples.
 The data tuples that did not make it into the training set end up
forming the test set.
 on average, 63.2% of the original data tuples will end up in the
bootstrap sample, and the remaining 36.8% will form the test set
(hence, the name, .632 bootstrap)
34
Assignment
 Explain PCA(Principal Component Analysis)how
 How it works
 Advantage and disadvantage
35

More Related Content

PPT
Preprocessing.ppt
PPT
Chapter 2 Cond (1).ppt
PPT
data Preprocessing different techniques summarized
PPT
preproccessing level 3 for students.ppt
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
Preprocessing.ppt
Chapter 2 Cond (1).ppt
data Preprocessing different techniques summarized
preproccessing level 3 for students.ppt
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing

Similar to ML-ChapterTwo-Data Preprocessing.ppt (20)

PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data Preprocessing in Pharmaceutical.ppt
PDF
02Data updated.pdf
PPT
Data preprocessing in precision agriculture
PPTX
Datapreprocessing
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
PDF
Data Preprocessing Concepts in Data Engineering
PDF
03Preprocessing01.pdf
DOC
Data processing
PPTX
CST 466 exam help data mining mod2.pptx
PDF
Data preprocessing in Data Mining
PPT
Unit 3-2.ppt
PPT
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
PPTX
Data Preprocessing
PPT
DataPreprocessing.ppt
PDF
Copy of Data preprocessing.pdf give notes regarding mining concpts
PDF
Preprocessing Step in Data Cleaning - Data Mining
PPT
Chapter 3. Data Preprocessing.ppt
Data preprocessing
Data preprocessing
Data preprocessing
Data Preprocessing in Pharmaceutical.ppt
02Data updated.pdf
Data preprocessing in precision agriculture
Datapreprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Preprocessing Concepts in Data Engineering
03Preprocessing01.pdf
Data processing
CST 466 exam help data mining mod2.pptx
Data preprocessing in Data Mining
Unit 3-2.ppt
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
Data Preprocessing
DataPreprocessing.ppt
Copy of Data preprocessing.pdf give notes regarding mining concpts
Preprocessing Step in Data Cleaning - Data Mining
Chapter 3. Data Preprocessing.ppt
Ad

Recently uploaded (20)

PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
PDF
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
PPTX
Lecture 1 Intro in Inferential Statistics.pptx
PPTX
Global journeys: estimating international migration
PDF
Company Profile 2023 PT. ZEKON INDONESIA.pdf
PPTX
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
PPT
Performance Implementation Review powerpoint
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
Understanding Prototyping in Design and Development
PPTX
Trading Procedures (1).pptxcffcdddxxddsss
PPTX
Logistic Regression ml machine learning.pptx
PPTX
batch data Retailer Data management Project.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
Extract Transformation Load (3) (1).pptx
PDF
Digital Infrastructure – Powering the Connected Age
PPTX
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
Purple and Violet Modern Marketing Presentation (1).pptx
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
Research about a FoodFolio app for personalized dietary tracking and health o...
Lecture 1 Intro in Inferential Statistics.pptx
Global journeys: estimating international migration
Company Profile 2023 PT. ZEKON INDONESIA.pdf
Machine Learning Solution for Power Grid Cybersecurity with GraphWavelets
Performance Implementation Review powerpoint
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Understanding Prototyping in Design and Development
Trading Procedures (1).pptxcffcdddxxddsss
Logistic Regression ml machine learning.pptx
batch data Retailer Data management Project.pptx
Business Acumen Training GuidePresentation.pptx
Moving the Public Sector (Government) to a Digital Adoption
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Taxes Foundatisdcsdcsdon Certificate.pdf
Extract Transformation Load (3) (1).pptx
Digital Infrastructure – Powering the Connected Age
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
Ad

ML-ChapterTwo-Data Preprocessing.ppt

  • 2. 1. Overview of data preprocessing  Machine Learning requires collecting great amount of data to achieve the intended objective.  A real-world data generally contains an unusable format which cannot be directly used for machine learning models.  Before feeding data to ML, we have to make sure the quality of data?  Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model.  It is the crucial step while creating a machine learning model.  It increases the accuracy and efficiency of a machine learning model.
  • 3. Data Quality  A well-accepted multidimensional data quality measures are the following:  Accuracy (free from errors and outliers)  Completeness (no missing attributes and values)  Consistency (no inconsistent values and attributes)  Timeliness (appropriateness of the data for the purpose it is required)  Believability (acceptability)  Interpretability (easy to understand) 3
  • 4. Why Data Preprocessing?  Most of the data in the real world are poor quality (Incomplete, Inconsistent, Noisy, Invalid, Redundant, …)  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., occupation=“ ”  noisy: containing errors or outliers  e.g., Salary=“-10”  inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  Redundant: including everything, some of which are irrelevant to our task. No quality data, no quality results! 4
  • 5. Data is often of low quality  Collecting the required data is challenging  Why?  You didn’t collect it yourself  It probably was created for some other use, and then you came along wanting to integrate it.  People make mistakes (typos)  Data collection instruments used may be faulty.  Everyone had their own way of structuring and formatting data, based on what was convenient for them.  Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information . 5
  • 6. 6 2. Major Tasks in Data Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of data from multiple data sources  Data reduction  Dimensionality reduction  Numerosity reduction  Data compression  Data transformation and data discretization  Normalization  Data discretization (for numerical data) and Concept hierarchy generation
  • 7. Forms of data preprocessing 7
  • 8. 8 2.1. Data Cleaning  Data cleaning tasks – attempts to: Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration
  • 9. 9 Incomplete (Missing) Data:  Data is not always available  many tuples have no recorded value for several attributes, such as customer income in sales data.  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data.
  • 10. 10 How to Handle Missing Value?  Ignore the tuple:  usually done when class label is missing (when doing classification).  Not effective method unless several attributes missing values  Fill in the missing value manually: tedious + infeasible?  Fill in it automatically with:  a global constant : e.g., “unknown”, a new class?!  Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value  Average income of customer $28,000 (use this value to replace).  Use the most probable value :  determined with regression, inference-based such as Bayesian formula, or decision tree. (most popular)
  • 11. How to Handle Missing Data? Age Income Religion Gender 23 24,200 Muslim M 39 ? Christian F 45 45,390 ? F Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distribution E.g., put the average income here, or put the most probable income based on the fact that the person is 39 years old E.g., put the most frequent religion here 11
  • 12. 12 Noisy Data  Noise-is a random error or variance in a measured variable  Incorrect attribute values may be due to  faulty data collection instruments(e.g.: OCR)  data entry problems-Let say ‘green’ is written as ‘rgeen’  data transmission problems  technology limitation  inconsistency in naming convention 12
  • 13. 13 How to Handle Noisy Data? Manually check all data : tedious + infeasible? Sort data by frequency ‘green’ is more frequent than ‘rgeen’ Works well for categorical data  Use, say Numerical constraints to Catch Corrupt Data  Weight can’t be negative  People can’t have more than 2 parents  Salary can’t be less than Birr 300 Check for outliers (the case of the 8 meters man) check for correlated outliers using n-gram (“pregnant male”) People can be male People can be pregnant People can’t be male AND pregnant 13
  • 14. 2.2. Data Integration  Data integration combines data from multiple sources into a coherent store  Because of the use of different sources, data that that is fine on its own may become problematic when we want to integrate it.  Some of the issues are: Different formats and structures Conflicting and redundant data Data at different levels 14
  • 15. Data Integration: Formats  Not everyone uses the same format. Do you agree?  Schema integration: e.g., A.cust-id  B.cust-#  Integrate metadata from different sources  Dates are especially problematic:  12/19/97  19/12/97  19/12/1997  19-12-97  Dec 19, 1997  19 December 1997  19th Dec. 1997  Are you frequently writing money as:  Birr 200, Br. 200, 200 Birr, … 15
  • 16. 16 Data Integration: Inconsistent Inconsistent data: containing discrepancies in codes or names, which is also the problem of lack of standardization / naming conventions. e.g., Age=“26” vs. Birthday=“03/07/1986” Some use “1,2,3” for rating; others “A, B, C” Data Integration: Conflicting Data Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different Possible reasons: different representations, different scales, e.g., American vs. British units  weight measurement: KG or pound  Height measurement: meter or inch
  • 17. 17 2.3.Data Reduction Strategies Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Dimensionality reduction,  Select best attributes or remove unimportant attributes Numerosity reduction  Reduce data volume by choosing alternative, smaller forms of data representation Data compression
  • 18. 18 Data Reduction: Dimensionality Reduction  Dimensionality reduction Helps to eliminate Irrelevant attributes and reduce noise: that contain no information useful for model development. E.g. is students' ID relevant to predict students' GPA? Helps to avoid redundant attributes : that contain duplicate information in one or more other attributes E.g. purchase price of a product & the amount of sales tax paid Reduce time and space required in model development Allow easier visualization  Method: attribute subset selection One of the method to reduce dimensionality of data is by selecting best attributes
  • 19. 19 Data Reduction: Numerosity Reduction  Different methods can be used, including Clustering and sampling  Clustering  Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only  There are many choices of clustering definitions and clustering algorithms  Sampling  obtaining a small sample s to represent the whole data set N  Key principle: Choose a representative subset of the data using suitable sampling technique
  • 20. 20 2.4. Data Transformation  A function that maps the entire set of values of a given attribute to a new set of replacement values. such that each old value can be identified with one of the new values.  Methods for data transformation  Normalization: Scaled to fall within a smaller, specified range of values • min-max normalization • z-score normalization • decimal scaling  Discretization: Reduce data size by dividing the range of a continuous attribute into intervals. – Discretization can be performed recursively on an attribute using method such as  Binning: divide values into intervals  Concept hierarchy climbing: organizes concepts (i.e., attribute values) hierarchically
  • 21. Data Transformation: Normalization  min-max normalization  z-score normalization  normalization by decimal scaling A A A A A A min new min new max new min max min v v _ ) _ _ ( '      A A dev stand mean v v _ '   j v v 10 ' Where j is the smallest integer such that Max(| |)<1 ' v 21
  • 22. Example:  Suppose that the minimum and maximum values for the attribute income are $12,000 and $98,000, respectively. We would like to map income to the range [0.0, 1.0].  Suppose that the mean and standard deviation of the values for the attribute income are $54,000 and $16,000, respectively.  Suppose that the recorded values of A range from –986 to 917. 22
  • 23. Normalization  Min-max normalization: – Ex. Let income range $12,000 to $98,000 is normalized to [0.0, 1.0]. Then $73,600 is mapped to  Z-score normalization (μ: mean, σ: standard deviation): – Ex. Let μ = 54,000, σ = 16,000. Then,  Decimal scaling: Suppose that the recorded values of A range from -986 to 917. To normalize by decimal scaling, we therefore divide each value by 1000 (i.e., j = 3) so that -986 normalizes to -0.986 and 917 normalizes to 0.917. 716 . 0 0 ) 0 0 . 1 ( 000 , 12 000 , 98 000 , 12 600 , 73      newMin newMin newMax min max min v v A A A      ) ( ' A A v v     ' 225 . 1 000 , 16 000 , 54 600 , 73   23
  • 24. Discretization and Concept Hierarchy  Discretization  reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals.  Interval labels can then be used to replace actual data values  Example:  Binning methods – equal-width, equal-frequency 24
  • 25. Binning  Attribute values (for one attribute e.g., age):  0, 4, 12, 16, 16, 18, 24, 26, 28  Equi-width binning – for bin width of e.g., 10:  Bin 1: 0, 4 [-,10) bin  Bin 2: 12, 16, 16, 18 [10,20) bin  Bin 3: 24, 26, 28 [20,+) bin  – denote negative infinity, + positive infinity  Equi-frequency binning – for bin density of e.g., 3:  Bin 1: 0, 4, 12 [-, 14) bin  Bin 2: 16, 16, 18 [14, 21) bin  Bin 3: 24, 26, 28 [21,+] bin 25
  • 26. Concept Hierarchy Generation Concept hierarchy: organizes concepts (i.e., attribute values) hierarchically. Concept hierarchy formation:  Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as child, youth, adult, or senior) Concept hierarchies can be explicitly specified by domain experts. country Region or state city Sub city Kebele  It can be automatically formed by the analysis of the number of distinct values. E.g., for a set of attributes: {Kebele, city, state, country}  For numeric data, use discretization methods. 26
  • 27. 3. Dataset  Dataset is a collection of data objects and their attributes  An attribute is a property or characteristic of an object  Examples: eye color of a person, temperature, etc.  Attribute is also known as variable, field, characteristic, dimension, or feature  A collection of attributes describe an object  Object is also known as record, point, case, sample, entity, or instance Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects
  • 28. Types of Attributes  The type of an attribute is determined by the set of possible values : nominal, binary, ordinal, or numeric—the attribute can have.  There are different types of attributes Nominal- means “relating to names” .  The values of a nominal attribute are symbols or names of things.  Nominal attributes are also referred to as categorical.  Examples: hair-color( Black, Brown, Blond etc.) , Marital- Status(Single, married, divorced and Widowed), Occupation etc. Ordinal:  an attribute with possible values that have a meaningful order or ranking among them  Examples: rankings (e.g., grades, height {tall, medium, short} 28
  • 29. Types of Attributes.. Binary :  is a nominal attribute with only two categories or states: 0-absent or 1-present , Boolean( true or false)  Example: Smoker(0-not smoker or 1-smoker) Interval-Scaled : Numeric Attributes  are measured on a scale of equal-size units.  allow us to compare and quantify the difference between values  Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio-Scaled: Numeric Attributes  a value as being a multiple (or ratio) of another value  Examples: temperature in length, time, counts 29
  • 30. Datasets preparation for learning A standard machine learning technique is to divide the dataset into a training set and a test set.  Training dataset is used for model development.  Test dataset, which is never seen during model development stage and used to evaluates the accuracy of the model.  There are various ways in which to separate the data into training and test sets  The holdout method  Cross-validation  The bootstrap 30
  • 31. The holdout method  In this methods, the given data are randomly partitioned into two independent sets, a training set and a test set.  Usually: one third for testing, the rest for training  For small or “unbalanced” datasets, samples might not be representative  Few or none instances of some classes  Stratified sample: advanced version of balancing the data  Make sure that each class is represented with approximately equal proportions in both subsets.  Random subsampling : a variation of the holdout method in which the holdout method is repeated k times.  The overall accuracy estimate is taken as the average of the accuracies obtained from each iteration.
  • 32. 32 Cross-validation  Cross-validation works as follows:  First step: data is split into k subsets of equal-sized sets randomly.  A partition of a set is a collection of subsets for which the intersection of any pair of sets is empty. That is, no element of one subset is an element of another subset in a partition.  Second step: each subset in turn is used for testing and the remainder for training This is called k-fold cross-validation  Often the subsets are stratified before the cross-validation is performed  The error estimates are averaged to yield an overall error estimate.
  • 33. 33 Cross-validation example: — Break up data into groups of the same size — Hold aside one group for testing and use the rest to build model — Repeat Test 33
  • 34. Bootstrap  the bootstrap method samples the given training tuples uniformly with replacement  the machine is allowed to select the same tuple more than once.  A commonly used one is the .632 bootstrap  Suppose we are given a data set of d tuples. The data set is sampled d times, with replacement, resulting in a bootstrap sample or training set of d samples.  The data tuples that did not make it into the training set end up forming the test set.  on average, 63.2% of the original data tuples will end up in the bootstrap sample, and the remaining 36.8% will form the test set (hence, the name, .632 bootstrap) 34
  • 35. Assignment  Explain PCA(Principal Component Analysis)how  How it works  Advantage and disadvantage 35