0% found this document useful (0 votes)

32 views10 pages

Lect2 - Data Preprocessing

Uploaded by

chala mitafa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views10 pages

Lect2 - Data Preprocessing

Uploaded by

chala mitafa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

5/26/2020

What is Data?
• Collection of data objects and their attributes
• An attribute is a property or characteristic of
an object
Data Preprocessing
• Attribute is also known as field, or feature
– Examples: eye color or age of a person

• A collection of attributes describe an object,

entity, or instance

1 2

Attribute Values Types of Attributes

• Attribute values are numbers or symbols assigned to • There are different types of attributes
an attribute  Nominal: categories, states
 Examples: ID numbers, eye color, zip codes
• Distinction between attributes and attribute values  Binary: Nominal attribute with only 2 states (0 or 1)
 Examples: gender
Same attribute can be mapped to different attribute
values  Ordinal: Values have a meaningful order (ranking) but
magnitude between successive values is not known.
Example: height can be measured in feet or meters  Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height {tall, medium, short}
Different attributes can be mapped to the same set of  Interval: Measured on a scale of equal-sized units
values • Values have order
Example: Attribute values for ID and age are integers • Examples: calendar dates, temperatures in Celsius or Fahrenheit.
 Ratio: We can speak of values as being an order of magnitude
But properties of attribute values can be different larger than the unit of measurement
 ID has no limit but age has a maximum and minimum value • Examples: temperature in Kelvin, length, time, counts

3 4

1
5/26/2020

Discrete and Continuous Attributes

• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.

5 6

Basic Statistical Descriptions of Data Summary Statistics: Measuring the Central

Tendency
• Motivation • Summary statistics are numbers that summarize
To better understand the data: central tendency, properties of the data
variation and spread
• Summarized properties include frequency,
• Data dispersion characteristics
location and spread
– median, max, min, quantiles, outliers, variance, etc.
– Examples: location - mean
• Numerical dimensions correspond to sorted
– spread - standard deviation
intervals
– Data dispersion: analyzed with multiple granularities • Most summary statistics can be calculated in a
of precision single pass through the data
• Dispersion analysis on computed measures
– Folding measures into numerical dimensions
7 8

2
5/26/2020

Frequency and Mode Percentiles

• The frequency of an attribute value is the
percentage of time the value occurs in the data
set
– For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
• The mode of a an attribute is the most frequent
• attribute value

• The notions of frequency and mode are typically

used with categorical data
9 10

Measures of Location: Mean and Median Arithmetic Mean

• The mean is the most common measure of the
location of a set of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also
commonly used. The mean of this set of values is

11 12

3
5/26/2020

Median Measuring the Central Tendency

• If N is odd, then the median is the middle value
of the ordered set.

• If N is even, then the median is not unique; it is

the two middlemost values and any value in
between.

• If X is a numeric attribute in this case, by

convention, the median is taken as the average
of the two middlemost values.
13 14

Measures of Spread: Range and Variance

Variance and Standard Deviation

15 16

4
5/26/2020

Types of data sets Examples of data quality problems

• Record • Noise: Refers to modification of original values
Data Matrix
Document Data • Outliers: data that are considerably different than most
of the other data objects in the data set
Transaction Data
• Missing values
• Graph  Reasons for missing values
World Wide Web  Information is not collected (e.g., people decline to give their age and
weight)
Molecular Structures
 Attributes may not be applicable to all cases (e.g., annual income is
not applicable to children)
• Ordered
Spatial Data  Handling missing values
 Eliminate Data Objects
Temporal Data  Estimate Missing Values
Sequential Data  Ignore the Missing Value During Analysis
Genetic Sequence Data  Replace with all possible values (weighted by their probabilities)
17 18

Why Data Preprocessing? Why Preprocessing? Data be

Incomplete!
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
• Attributes of interest are not available (e.g.,
attributes of interest, or containing only aggregate customer information for sales transaction data)
data
• e.g., occupation=“ ”
• Data were not considered important at the time
– noisy: containing errors or outliers
of transactions, so they were not recorded.
• e.g., Salary=“-10” • Data not recorder because of misunderstanding
– inconsistent: containing discrepancies in codes or or malfunctions
names
• e.g., Age=“42” Birthday=“03/07/1997” • Data may have been recorded and later deleted.
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records • Missing/unknown values for some data
– Redundant: including everything, some of which are
irrelevant to our task
May 26, 2020 20

5
5/26/2020

Feature Extraction in Fingerprint Recognition

Fingerprint Recognition Case
• Fingerprint identification at the gym

It is not the points, but what is in between the points that matters... Edward
HOW? German
•Identifying/extracting a good feature set is the most challenging part of
data mining.
Feature vector: 10.2, 0.23, 0.34, 0.34, 20, …

Forms of Data Preprocessing Why Data Preprocessing?

• Aggregation • Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
• Sampling attributes of interest, or containing only aggregate
data
• Dimensionality Reduction noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or
• Feature subset selection names

• Feature creation • No quality data, no quality mining results!

Quality decisions must be based on quality data
• Discretization and Binarization DM needs consistent integration of quality data
• Attribute Transformation 23 24

6
5/26/2020

Forms of Data Preprocessing What is Data Exploration?

Data cleaning
A preliminary exploration of the data to better
understand its characteristics.
Data Integration • Key motivations include
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns
• People can recognize patterns not captured by data
analysis tools
Data transformation
• Related to the area of Exploratory Data Analysis (EDA)
Data reduction – Created by statistician John Tukey
– Seminal book is Exploratory Data Analysis by Tukey
– A nice online introduction can be found in Chapter 1 of the NIST
Engineering Statistics Handbook
Data Exploratory Analysis

Aggregation Exploratory Data Analysis Techniques

• Combining two or more attributes (or objects)
into a single attribute (or object)  Summary Statistics
 Visualization
• Purpose
Data reduction
 Feature Selection (big topic)
– Reduce the number of attributes or objects  Dimension Reduction (big topic)
Change of scale
• Cities aggregated into regions, states, countries, etc

More “stable” data

• Aggregated data tends to have less variability
27

7
5/26/2020

Sampling Types of Sampling

• Sampling is the main technique employed for data selection.
• Simple Random Sampling
• It is often used for both the preliminary investigation of the  There is an equal probability of selecting any particular item
data and the final data analysis.
• Sampling without replacement
• Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.  As each item is selected, it is removed from the population

• Sampling is used in data mining because processing the entire • Sampling with replacement
set of data of interest is too expensive or time consuming.  Objects are not removed from the population as they are
selected for the sample.
• The key principle for effective sampling is the following:  In sampling with replacement, the same object can be picked
 using a sample will work almost as well as using the entire data up more than once
sets, if the sample is representative
 A sample is representative if it has approximately the same • Stratified sampling
property (of interest) as the original set of data  Split the data into several partitions; then draw random
samples from each partition
29 30

Curse of dimensionality Dimensionality Reduction: Curse of Dimensionality

• When dimensionality increases, data becomes increasingly
sparse in the space that it occupies
• Definitions of density and distance between points, which is
critical for clustering and outlier detection, become less
meaningful
• Purpose of dimensionality reduction:
 Avoid curse of dimensionality
 Reduce amount of time and memory required by data mining
algorithms
 Allow data to be more easily visualized
 May help to eliminate irrelevant features or reduce noise

• Techniques of dimensionality reduction:

 Principle Component Analysis
 Singular Value Decomposition
31
 Others: supervised and non-linear techniques 32

8
5/26/2020

Feature Subset Selection Feature Selection and Correlation Matrix

• Another way to reduce dimensionality of data
• Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the
amount of sales tax paid
• Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task
of predicting students' GPA 33 34

Feature Subset Selection Feature Creation

• Techniques: • Create new attributes that can capture the
Brute-force approch:
 Try all possible feature subsets as input to data mining
important information in a data set much more
algorithm efficiently than the original attributes
Embedded approaches:
 Feature selection occurs naturally as part of the data mining
• Three general methodologies:
algorithm 1. Feature Extraction domain-specific
Filter approaches: 2. Mapping Data to New Space
 Features are selected before data mining algorithm is run 3. Feature Construction combining features
Wrapper approaches:
 Use the data mining algorithm as a black box to find best
subsetof attributes
35 36

9
5/26/2020

DM Assignment-I
• Compare and contrast DM and RDBMS
Describe the basic differences and similarities;
Describe the Pros and Cons (Merits & Demerits).
 On average, a summarized report of two
pages (Font: Times New Roman 12, 1.5
spacing) should be submitted on May 28,
2020. Use [email protected] to submit
your assignments before the due date.

Data (1) (1)
No ratings yet
Data (1) (1)
81 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
ppt2
No ratings yet
ppt2
57 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Unit3
No ratings yet
Unit3
41 pages
253777
No ratings yet
253777
66 pages
02-KnowYourData
No ratings yet
02-KnowYourData
44 pages
CH 2
No ratings yet
CH 2
36 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Lect 3
No ratings yet
Lect 3
51 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
02Know Your Data Lecture2 3
No ratings yet
02Know Your Data Lecture2 3
53 pages
Unit I
No ratings yet
Unit I
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
02Data
No ratings yet
02Data
24 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Data ch2
No ratings yet
Data ch2
16 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Data Preprocessing for Clustering
No ratings yet
Data Preprocessing for Clustering
40 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
02 Data
No ratings yet
02 Data
35 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Mining
No ratings yet
Data Mining
40 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
16 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Data Mining
No ratings yet
Data Mining
5 pages
Lesch-Nyhan Syndrome
No ratings yet
Lesch-Nyhan Syndrome
8 pages
599F Whole Grade Student Acceleration Form
No ratings yet
599F Whole Grade Student Acceleration Form
14 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Digital Control: Fundamentals: ENGI 7825: Control Systems II Andrew Vardy
No ratings yet
Digital Control: Fundamentals: ENGI 7825: Control Systems II Andrew Vardy
32 pages
Pliant Like The Bamboo
100% (1)
Pliant Like The Bamboo
22 pages
B2 Meeting Strangers First Lesson TV
No ratings yet
B2 Meeting Strangers First Lesson TV
20 pages
Bayesian Networks: A Tutorial
No ratings yet
Bayesian Networks: A Tutorial
73 pages
Speed Networking - Lesson 1
No ratings yet
Speed Networking - Lesson 1
34 pages
Active English Grammar 3 Teacher - Co .Ke
No ratings yet
Active English Grammar 3 Teacher - Co .Ke
16 pages
THE MIDDLE AGES-WPS Office
No ratings yet
THE MIDDLE AGES-WPS Office
8 pages
11 Biology Notes ch20 Locomotion and Movement
No ratings yet
11 Biology Notes ch20 Locomotion and Movement
3 pages
Course File EditED 1.1.1...
No ratings yet
Course File EditED 1.1.1...
54 pages
Shatkarma Six Inner Body Cleansing Techn
No ratings yet
Shatkarma Six Inner Body Cleansing Techn
12 pages
btech-1-sem-soft-skills-bas105-mar-2025
No ratings yet
btech-1-sem-soft-skills-bas105-mar-2025
2 pages
Cryptography and Network Security: Third Edition by William Stallings
No ratings yet
Cryptography and Network Security: Third Edition by William Stallings
17 pages
Lorcaserin Review For The FDA Endocrinologic and Metabolic AC
No ratings yet
Lorcaserin Review For The FDA Endocrinologic and Metabolic AC
16 pages
The Sociology of Risk and Uncertainty Current Stat
No ratings yet
The Sociology of Risk and Uncertainty Current Stat
15 pages
Libow Notices Strange Fax Action
No ratings yet
Libow Notices Strange Fax Action
4 pages
FUTURE TENSES
No ratings yet
FUTURE TENSES
3 pages
Steel Genius Report - The Investor Genius Test - Geniusu
No ratings yet
Steel Genius Report - The Investor Genius Test - Geniusu
8 pages
Mixed Frequency Fcsting
No ratings yet
Mixed Frequency Fcsting
16 pages
Methods For Local Scour Depth Estimation at Complex Bridge Piers
No ratings yet
Methods For Local Scour Depth Estimation at Complex Bridge Piers
7 pages
Bright gr10 Mid Test C
No ratings yet
Bright gr10 Mid Test C
7 pages
Unit 1: Family Life Lesson 1: Speaking: Checking
No ratings yet
Unit 1: Family Life Lesson 1: Speaking: Checking
2 pages
Read The Following Scenario, Process It As A Group, Then Reflect and Answer The Questions Afterwards
No ratings yet
Read The Following Scenario, Process It As A Group, Then Reflect and Answer The Questions Afterwards
1 page
Amharic Sentence Parsing Using Base Phrase Chunking
No ratings yet
Amharic Sentence Parsing Using Base Phrase Chunking
10 pages
Test Blueprint Preparation Manual For Exit Exam Final 03 05 2015
No ratings yet
Test Blueprint Preparation Manual For Exit Exam Final 03 05 2015
11 pages
Electeds Letter Regarding Pacific Park 2016
No ratings yet
Electeds Letter Regarding Pacific Park 2016
5 pages
How Negative Energy Affects Your Life and How To Clear It - The Mind Unleashed
No ratings yet
How Negative Energy Affects Your Life and How To Clear It - The Mind Unleashed
10 pages
Santhwanam Youth Fest - ELOCUTION
No ratings yet
Santhwanam Youth Fest - ELOCUTION
6 pages
FA 3 Class 3rd
No ratings yet
FA 3 Class 3rd
2 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
3 pages
Bye Bye Blackbird
No ratings yet
Bye Bye Blackbird
3 pages
Summative Test in English
No ratings yet
Summative Test in English
2 pages
Your Last Day On Earth v1.0
No ratings yet
Your Last Day On Earth v1.0
1 page
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet

Lect2 - Data Preprocessing

Uploaded by

Lect2 - Data Preprocessing

Uploaded by

5/26/2020

• A collection of attributes describe an object,

Attribute Values Types of Attributes

Discrete and Continuous Attributes

Basic Statistical Descriptions of Data Summary Statistics: Measuring the Central

Frequency and Mode Percentiles

• The notions of frequency and mode are typically

Measures of Location: Mean and Median Arithmetic Mean

Median Measuring the Central Tendency

• If N is even, then the median is not unique; it is

• If X is a numeric attribute in this case, by

Measures of Spread: Range and Variance

Types of data sets Examples of data quality problems

Why Data Preprocessing? Why Preprocessing? Data be

Feature Extraction in Fingerprint Recognition

Forms of Data Preprocessing Why Data Preprocessing?

• Feature creation • No quality data, no quality mining results!

Forms of Data Preprocessing What is Data Exploration?

Aggregation Exploratory Data Analysis Techniques

More “stable” data

Sampling Types of Sampling

Curse of dimensionality Dimensionality Reduction: Curse of Dimensionality

• Techniques of dimensionality reduction:

Feature Subset Selection Feature Selection and Correlation Matrix

Feature Subset Selection Feature Creation

You might also like