0% found this document useful (0 votes)
1 views22 pages

Big Data Lecture # 04

The document covers various techniques for handling missing and noisy data in big data analytics, including data cleaning, smoothing techniques, and data reduction strategies. It discusses methods for filling in missing values, detecting outliers, and correcting inconsistencies, as well as data integration and transformation processes. Additionally, it highlights the importance of normalization and dimensionality reduction in improving data analysis efficiency.

Uploaded by

Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views22 pages

Big Data Lecture # 04

The document covers various techniques for handling missing and noisy data in big data analytics, including data cleaning, smoothing techniques, and data reduction strategies. It discusses methods for filling in missing values, detecting outliers, and correcting inconsistencies, as well as data integration and transformation processes. Additionally, it highlights the importance of normalization and dimensionality reduction in improving data analysis efficiency.

Uploaded by

Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

BIG DATA ANALYTICS

Lecture 4 --- Week 4


Content

 Handling missing and noisy data

 Smoothing techniques (Binning method, Clustering, Combined computer and


human inspection, Regression, Use Concept hierarchies)

 Inconsistent Data

 Data Reduction Strategies

 Data Cube Aggregation

 Dimensionality Reduction
Data Cleaning

 Data cleaning tasks

 Fill in missing values

 Identify outliers and smooth out noisy data

 Correct inconsistent data


How to Handle Missing Data?

 Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification)—not effective when the percentage of missing
values per attribute varies considerably.

 Fill in the missing value manually: tedious + infeasible?

 Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!

 Use the attribute mean to fill in the missing value

 Use the attribute mean for all samples belonging to the same class to fill in
the missing value: smarter

 Use the most probable value to fill in the missing value: inference-based
such as Bayesian formula or decision tree
How to Handle Missing Data?

Age Income Religion Gender


23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or probabilistic


estimates on global value distribution
E.g., put the average income here, or put the most probable income based
on the fact that the person is 39 years old
E.g., put the most frequent religion here
Noisy Data

 Noise: random error or variance in a measured variable


 Incorrect attribute values may exist due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
How to Handle Noisy Data?
Smoothing techniques
 Binning method:
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 computer detects suspicious values, which are then checked by
humans
 Regression
 smooth by fitting the data into regression functions
 Use Concept hierarchies
 use concept hierarchies, e.g., price value -> “expensive”
Simple Discretization Methods:
Binning
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling – good handing of skewed data
Simple Discretization Methods: Binning

Example: customer ages number


of values

Equi-width
binning:
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Equi-width
binning: 22-31 62-80
0-22
38-44 48-55
32-38 44-48 55-62
Smoothing using Binning Methods

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries: [4,15],[21,25],[26,34]
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Inconsistent Data

 Inconsistent data are handled by:

 Manual correction (expensive and tedious)

 Use routines designed to detect inconsistencies and manually correct them. E.g.,
the routine may use the check global constraints (age>10) or functional
dependencies

 Other inconsistencies (e.g., between names of the same attribute) can be


corrected during the data integration process
Data Integration

 Data integration:
 combines data from multiple sources into a coherent store
 Schema integration
 integrate metadata from different sources
 metadata: data about the data (i.e., data descriptors)
 Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different
sources are different (e.g., J.D.Smith and Jonh Smith may refer to
the same person)
 possible reasons: different representations, different scales, e.g.,
metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration

 Redundant data occur often when integration of multiple


databases
 The same attribute may have different names in different databases
 One attribute may be a “derived” attribute in another table, e.g.,
annual revenue

 Redundant data may be able to be detected by correlation


analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Data Transformation

 Smoothing: remove noise from data


 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling

 Attribute/feature construction
 New attributes constructed from the given ones
Normalization: Why normalization?

 Speeds-up some learning techniques (ex. neural networks)

 Helps prevent attributes with large ranges outweigh ones with small
ranges

 Example:

 income has range 3000-200000

 age has range 10-80

 gender has domain M/F


Data Transformation: Normalization

 min-max normalization
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
 e.g. convert age=30 to range 0-1, when min=10,max=80.
new_age=(30-10)/(80-10)=2/7
 z-score normalization v − meanA
v' =
stand_devA

 normalization
v by decimal scaling
v' = Where j is the smallest integer such that Max(|v' |)<1
10 j
Data Reduction Strategies

 Warehouse may store terabytes of data: Complex data


analysis/mining may take a very long time to run on the
complete data set
 Data reduction
 Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same)
analytical results
 Data reduction strategies
 Data cube aggregation
 Dimensionality reduction
 Data compression
 Numerosity reduction
 Discretization and concept hierarchy generation
Data Cube Aggregation

 The lowest level of a data cube


 the aggregated data for an individual entity of interest
 e.g., a customer in a phone calling data warehouse.

 Multiple levels of aggregation in data cubes


 Further reduce the size of data to deal with

 Reference appropriate levels


 Use the smallest representation which is enough to solve the task

 Queries regarding aggregated information should be


answered using data cube, when possible
Dimensionality Reduction

 Feature selection (i.e., attribute subset selection):


 Select a minimum set of features such that the probability
distribution of different classes given the values for those features
is as close as possible to the original distribution given the values
of all features
 reduce # of patterns in the patterns, easier to understand
 Heuristic methods (due to exponential # of choices):
 step-wise forward selection
 step-wise backward elimination
 combining forward selection and backward elimination
 decision-tree induction
Numerosity Reduction: Reduce the
volume of data
 Parametric methods
 Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
(except possible outliers)
 Log-linear models: obtain value at a point in m-D space as the
product on appropriate marginal subspaces

 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling
Discretization
 Three types of attributes:

 Nominal — values from an unordered set

 Ordinal — values from an ordered set

 Continuous — real numbers

 Discretization:

 divide the range of a continuous attribute into intervals

 why?

 Some classification algorithms only accept categorical attributes.

 Reduce data size by discretization

 Prepare for further analysis


Discretization and Concept hierachy

 Discretization
 reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values.

 Concept hierarchies
 reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).

You might also like