0% found this document useful (0 votes)

1 views22 pages

Big Data Lecture # 04

The document covers various techniques for handling missing and noisy data in big data analytics, including data cleaning, smoothing techniques, and data reduction strategies. It discusses methods for filling in missing values, detecting outliers, and correcting inconsistencies, as well as data integration and transformation processes. Additionally, it highlights the importance of normalization and dimensionality reduction in improving data analysis efficiency.

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views22 pages

Big Data Lecture # 04

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

BIG DATA ANALYTICS

Lecture 4 --- Week 4

Content

 Handling missing and noisy data

 Smoothing techniques (Binning method, Clustering, Combined computer and

human inspection, Regression, Use Concept hierarchies)

 Inconsistent Data

 Data Reduction Strategies

 Data Cube Aggregation

 Dimensionality Reduction
Data Cleaning

 Data cleaning tasks

 Fill in missing values

 Identify outliers and smooth out noisy data

 Correct inconsistent data

How to Handle Missing Data?

 Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification)—not effective when the percentage of missing
values per attribute varies considerably.

 Fill in the missing value manually: tedious + infeasible?

 Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!

 Use the attribute mean to fill in the missing value

 Use the attribute mean for all samples belonging to the same class to fill in
the missing value: smarter

 Use the most probable value to fill in the missing value: inference-based
such as Bayesian formula or decision tree
How to Handle Missing Data?

Age Income Religion Gender

23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or probabilistic

estimates on global value distribution
E.g., put the average income here, or put the most probable income based
on the fact that the person is 39 years old
E.g., put the most frequent religion here
Noisy Data

 Noise: random error or variance in a measured variable

 Incorrect attribute values may exist due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
How to Handle Noisy Data?
Smoothing techniques
 Binning method:
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 computer detects suspicious values, which are then checked by
humans
 Regression
 smooth by fitting the data into regression functions
 Use Concept hierarchies
 use concept hierarchies, e.g., price value -> “expensive”
Simple Discretization Methods:
Binning
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling – good handing of skewed data
Simple Discretization Methods: Binning

Example: customer ages number

of values

Equi-width
binning:
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Equi-width
binning: 22-31 62-80
0-22
38-44 48-55
32-38 44-48 55-62
Smoothing using Binning Methods

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries: [4,15],[21,25],[26,34]
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Inconsistent Data

 Inconsistent data are handled by:

 Manual correction (expensive and tedious)

 Use routines designed to detect inconsistencies and manually correct them. E.g.,
the routine may use the check global constraints (age>10) or functional
dependencies

 Other inconsistencies (e.g., between names of the same attribute) can be

corrected during the data integration process
Data Integration

 Data integration:
 combines data from multiple sources into a coherent store
 Schema integration
 integrate metadata from different sources
 metadata: data about the data (i.e., data descriptors)
 Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different
sources are different (e.g., J.D.Smith and Jonh Smith may refer to
the same person)
 possible reasons: different representations, different scales, e.g.,
metric vs. British units (inches vs. cm)
Handling Redundant Data in Data Integration

 Redundant data occur often when integration of multiple

databases
 The same attribute may have different names in different databases
 One attribute may be a “derived” attribute in another table, e.g.,
annual revenue

 Redundant data may be able to be detected by correlation

analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Data Transformation

 Smoothing: remove noise from data

 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling

 Attribute/feature construction
 New attributes constructed from the given ones
Normalization: Why normalization?

 Speeds-up some learning techniques (ex. neural networks)

 Helps prevent attributes with large ranges outweigh ones with small
ranges

 Example:

 income has range 3000-200000

 age has range 10-80

 gender has domain M/F

Data Transformation: Normalization

 min-max normalization
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
 e.g. convert age=30 to range 0-1, when min=10,max=80.
new_age=(30-10)/(80-10)=2/7
 z-score normalization v − meanA
v' =
stand_devA

 normalization
v by decimal scaling
v' = Where j is the smallest integer such that Max(|v' |)<1
10 j
Data Reduction Strategies

 Warehouse may store terabytes of data: Complex data

analysis/mining may take a very long time to run on the
complete data set
 Data reduction
 Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same)
analytical results
 Data reduction strategies
 Data cube aggregation
 Dimensionality reduction
 Data compression
 Numerosity reduction
 Discretization and concept hierarchy generation
Data Cube Aggregation

 The lowest level of a data cube

 the aggregated data for an individual entity of interest
 e.g., a customer in a phone calling data warehouse.

 Multiple levels of aggregation in data cubes

 Further reduce the size of data to deal with

 Reference appropriate levels

 Use the smallest representation which is enough to solve the task

 Queries regarding aggregated information should be

answered using data cube, when possible
Dimensionality Reduction

 Feature selection (i.e., attribute subset selection):

 Select a minimum set of features such that the probability
distribution of different classes given the values for those features
is as close as possible to the original distribution given the values
of all features
 reduce # of patterns in the patterns, easier to understand
 Heuristic methods (due to exponential # of choices):
 step-wise forward selection
 step-wise backward elimination
 combining forward selection and backward elimination
 decision-tree induction
Numerosity Reduction: Reduce the
volume of data
 Parametric methods
 Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
(except possible outliers)
 Log-linear models: obtain value at a point in m-D space as the
product on appropriate marginal subspaces

 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling
Discretization
 Three types of attributes:

 Nominal — values from an unordered set

 Ordinal — values from an ordered set

 Continuous — real numbers

 Discretization:

 divide the range of a continuous attribute into intervals

 why?

 Some classification algorithms only accept categorical attributes.

 Reduce data size by discretization

 Prepare for further analysis

Discretization and Concept hierachy

 Discretization
 reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values.

 Concept hierarchies
 reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).

44 Seals of Solomon
33% (3)
44 Seals of Solomon
7 pages
Philippine Electrical Code
100% (2)
Philippine Electrical Code
26 pages
ASV 111.1111 TPA00R1AA-101-000: Proposed Wiring Diagram For SA .2 and SQ .2 With 3-Phase AC Motor
100% (4)
ASV 111.1111 TPA00R1AA-101-000: Proposed Wiring Diagram For SA .2 and SQ .2 With 3-Phase AC Motor
2 pages
Normalization
No ratings yet
Normalization
35 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Week2 2
No ratings yet
Week2 2
25 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Unit 2
No ratings yet
Unit 2
37 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
W2-Data Preparation
No ratings yet
W2-Data Preparation
46 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
CH 3
No ratings yet
CH 3
68 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
Data Preprocessing - Data Cleaning
100% (2)
Data Preprocessing - Data Cleaning
29 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
MCQ 12 Geography
No ratings yet
MCQ 12 Geography
149 pages
Major Complex: Histocompatability
No ratings yet
Major Complex: Histocompatability
42 pages
Pediatrics in Review 2013 Friedman 328 30
No ratings yet
Pediatrics in Review 2013 Friedman 328 30
5 pages
Summary Notes - Topic 2 AQA Physics GCSE
No ratings yet
Summary Notes - Topic 2 AQA Physics GCSE
7 pages
Blue & Rust Remover Chemical Safety Sheet
No ratings yet
Blue & Rust Remover Chemical Safety Sheet
6 pages
Power Electronics
No ratings yet
Power Electronics
190 pages
LB Engleza Contemporana Morfologie
No ratings yet
LB Engleza Contemporana Morfologie
94 pages
Thought-Based Linguistics How Languages Turn Thoughts Into Sounds (Wallace Chafe)
100% (1)
Thought-Based Linguistics How Languages Turn Thoughts Into Sounds (Wallace Chafe)
212 pages
Moluscos Terrestres y Dulceacuícolas Chilenos Claudio Valdovinos Zarges
No ratings yet
Moluscos Terrestres y Dulceacuícolas Chilenos Claudio Valdovinos Zarges
17 pages
Department of Microbiology Sanjay Gandhi Post Graduate Institute of Medical Sciences
No ratings yet
Department of Microbiology Sanjay Gandhi Post Graduate Institute of Medical Sciences
48 pages
1130 11 Handset UK Installation Guide
100% (1)
1130 11 Handset UK Installation Guide
2 pages
Love and Death in The Sea - Life (P
No ratings yet
Love and Death in The Sea - Life (P
3 pages
MCF Vendor List
No ratings yet
MCF Vendor List
3 pages
Reviewer (Medsurg)
No ratings yet
Reviewer (Medsurg)
55 pages
Tle 6 Week 12
No ratings yet
Tle 6 Week 12
3 pages
ABS-C0015, Right Front ABS Outlet Valve
No ratings yet
ABS-C0015, Right Front ABS Outlet Valve
2 pages
13 Isomerism PDF
No ratings yet
13 Isomerism PDF
3 pages
Blue Jay Sample
No ratings yet
Blue Jay Sample
18 pages
Packing List Rig # 11D Medan To Palembang
No ratings yet
Packing List Rig # 11D Medan To Palembang
13 pages
Present Perfect vs. Simple Past - Week 14
No ratings yet
Present Perfect vs. Simple Past - Week 14
7 pages
RSTAB Introductory Example
100% (1)
RSTAB Introductory Example
56 pages
Diass Module Q4
No ratings yet
Diass Module Q4
3 pages
Pacal Votan
100% (1)
Pacal Votan
4 pages
Chemistry SS1 (4 Copies)
No ratings yet
Chemistry SS1 (4 Copies)
3 pages
Manual 9T Edge Lifter 2017 V1.0
No ratings yet
Manual 9T Edge Lifter 2017 V1.0
8 pages
Chemistry Paper 3 Questions Model20012023004
100% (1)
Chemistry Paper 3 Questions Model20012023004
9 pages
GMA Guide 2020
No ratings yet
GMA Guide 2020
8 pages

Big Data Lecture # 04

Uploaded by

Big Data Lecture # 04

Uploaded by

BIG DATA ANALYTICS

Lecture 4 --- Week 4

 Handling missing and noisy data

 Smoothing techniques (Binning method, Clustering, Combined computer and

 Data Reduction Strategies

 Data Cube Aggregation

 Data cleaning tasks

 Fill in missing values

 Identify outliers and smooth out noisy data

 Correct inconsistent data

 Fill in the missing value manually: tedious + infeasible?

 Use the attribute mean to fill in the missing value

Age Income Religion Gender

Fill missing values using aggregate functions (e.g., average) or probabilistic

 Noise: random error or variance in a measured variable

Example: customer ages number

 Inconsistent data are handled by:

 Manual correction (expensive and tedious)

 Other inconsistencies (e.g., between names of the same attribute) can be

 Redundant data occur often when integration of multiple

 Redundant data may be able to be detected by correlation

 Smoothing: remove noise from data

 Speeds-up some learning techniques (ex. neural networks)

 income has range 3000-200000

 age has range 10-80

 gender has domain M/F

 Warehouse may store terabytes of data: Complex data

 The lowest level of a data cube

 Multiple levels of aggregation in data cubes

 Reference appropriate levels

 Queries regarding aggregated information should be

 Feature selection (i.e., attribute subset selection):

 Nominal — values from an unordered set

 Ordinal — values from an ordered set

 Continuous — real numbers

 divide the range of a continuous attribute into intervals

 Some classification algorithms only accept categorical attributes.

 Reduce data size by discretization

 Prepare for further analysis

You might also like