0% found this document useful (0 votes)
27 views

Lect2 - Data Preprocessing

Uploaded by

chala mitafa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Lect2 - Data Preprocessing

Uploaded by

chala mitafa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

5/26/2020

What is Data?
• Collection of data objects and their attributes
• An attribute is a property or characteristic of
an object
Data Preprocessing
• Attribute is also known as field, or feature
– Examples: eye color or age of a person

• A collection of attributes describe an object,


entity, or instance

1 2

Attribute Values Types of Attributes


• Attribute values are numbers or symbols assigned to • There are different types of attributes
an attribute  Nominal: categories, states
 Examples: ID numbers, eye color, zip codes
• Distinction between attributes and attribute values  Binary: Nominal attribute with only 2 states (0 or 1)
 Examples: gender
Same attribute can be mapped to different attribute
values  Ordinal: Values have a meaningful order (ranking) but
magnitude between successive values is not known.
Example: height can be measured in feet or meters  Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height {tall, medium, short}
Different attributes can be mapped to the same set of  Interval: Measured on a scale of equal-sized units
values • Values have order
Example: Attribute values for ID and age are integers • Examples: calendar dates, temperatures in Celsius or Fahrenheit.
 Ratio: We can speak of values as being an order of magnitude
But properties of attribute values can be different larger than the unit of measurement
 ID has no limit but age has a maximum and minimum value • Examples: temperature in Kelvin, length, time, counts

3 4

1
5/26/2020

Discrete and Continuous Attributes


• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.

5 6

Basic Statistical Descriptions of Data Summary Statistics: Measuring the Central


Tendency
• Motivation • Summary statistics are numbers that summarize
To better understand the data: central tendency, properties of the data
variation and spread
• Summarized properties include frequency,
• Data dispersion characteristics
location and spread
– median, max, min, quantiles, outliers, variance, etc.
– Examples: location - mean
• Numerical dimensions correspond to sorted
– spread - standard deviation
intervals
– Data dispersion: analyzed with multiple granularities • Most summary statistics can be calculated in a
of precision single pass through the data
• Dispersion analysis on computed measures
– Folding measures into numerical dimensions
7 8

2
5/26/2020

Frequency and Mode Percentiles


• The frequency of an attribute value is the
percentage of time the value occurs in the data
set
– For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
• The mode of a an attribute is the most frequent
• attribute value

• The notions of frequency and mode are typically


used with categorical data
9 10

Measures of Location: Mean and Median Arithmetic Mean


• The mean is the most common measure of the
location of a set of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also
commonly used. The mean of this set of values is

11 12

3
5/26/2020

Median Measuring the Central Tendency


• If N is odd, then the median is the middle value
of the ordered set.

• If N is even, then the median is not unique; it is


the two middlemost values and any value in
between.

• If X is a numeric attribute in this case, by


convention, the median is taken as the average
of the two middlemost values.
13 14

Measures of Spread: Range and Variance


Variance and Standard Deviation

15 16

4
5/26/2020

Types of data sets Examples of data quality problems


• Record • Noise: Refers to modification of original values
Data Matrix
Document Data • Outliers: data that are considerably different than most
of the other data objects in the data set
Transaction Data
• Missing values
• Graph  Reasons for missing values
World Wide Web  Information is not collected (e.g., people decline to give their age and
weight)
Molecular Structures
 Attributes may not be applicable to all cases (e.g., annual income is
not applicable to children)
• Ordered
Spatial Data  Handling missing values
 Eliminate Data Objects
Temporal Data  Estimate Missing Values
Sequential Data  Ignore the Missing Value During Analysis
Genetic Sequence Data  Replace with all possible values (weighted by their probabilities)
17 18

Why Data Preprocessing? Why Preprocessing? Data be


Incomplete!
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
• Attributes of interest are not available (e.g.,
attributes of interest, or containing only aggregate customer information for sales transaction data)
data
• e.g., occupation=“ ”
• Data were not considered important at the time
– noisy: containing errors or outliers
of transactions, so they were not recorded.
• e.g., Salary=“-10” • Data not recorder because of misunderstanding
– inconsistent: containing discrepancies in codes or or malfunctions
names
• e.g., Age=“42” Birthday=“03/07/1997” • Data may have been recorded and later deleted.
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records • Missing/unknown values for some data
– Redundant: including everything, some of which are
irrelevant to our task
May 26, 2020 20

5
5/26/2020

Feature Extraction in Fingerprint Recognition


Fingerprint Recognition Case
• Fingerprint identification at the gym

It is not the points, but what is in between the points that matters... Edward
HOW? German
•Identifying/extracting a good feature set is the most challenging part of
data mining.
Feature vector: 10.2, 0.23, 0.34, 0.34, 20, …

Forms of Data Preprocessing Why Data Preprocessing?


• Aggregation • Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
• Sampling attributes of interest, or containing only aggregate
data
• Dimensionality Reduction noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or
• Feature subset selection names

• Feature creation • No quality data, no quality mining results!


Quality decisions must be based on quality data
• Discretization and Binarization DM needs consistent integration of quality data
• Attribute Transformation 23 24

6
5/26/2020

Forms of Data Preprocessing What is Data Exploration?


Data cleaning
A preliminary exploration of the data to better
understand its characteristics.
Data Integration • Key motivations include
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns
• People can recognize patterns not captured by data
analysis tools
Data transformation
• Related to the area of Exploratory Data Analysis (EDA)
Data reduction – Created by statistician John Tukey
– Seminal book is Exploratory Data Analysis by Tukey
– A nice online introduction can be found in Chapter 1 of the NIST
Engineering Statistics Handbook
Data Exploratory Analysis

Aggregation Exploratory Data Analysis Techniques


• Combining two or more attributes (or objects)
into a single attribute (or object)  Summary Statistics
 Visualization
• Purpose
Data reduction
 Feature Selection (big topic)
– Reduce the number of attributes or objects  Dimension Reduction (big topic)
Change of scale
• Cities aggregated into regions, states, countries, etc

More “stable” data


• Aggregated data tends to have less variability
27

7
5/26/2020

Sampling Types of Sampling


• Sampling is the main technique employed for data selection.
• Simple Random Sampling
• It is often used for both the preliminary investigation of the  There is an equal probability of selecting any particular item
data and the final data analysis.
• Sampling without replacement
• Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.  As each item is selected, it is removed from the population

• Sampling is used in data mining because processing the entire • Sampling with replacement
set of data of interest is too expensive or time consuming.  Objects are not removed from the population as they are
selected for the sample.
• The key principle for effective sampling is the following:  In sampling with replacement, the same object can be picked
 using a sample will work almost as well as using the entire data up more than once
sets, if the sample is representative
 A sample is representative if it has approximately the same • Stratified sampling
property (of interest) as the original set of data  Split the data into several partitions; then draw random
samples from each partition
29 30

Curse of dimensionality Dimensionality Reduction: Curse of Dimensionality


• When dimensionality increases, data becomes increasingly
sparse in the space that it occupies
• Definitions of density and distance between points, which is
critical for clustering and outlier detection, become less
meaningful
• Purpose of dimensionality reduction:
 Avoid curse of dimensionality
 Reduce amount of time and memory required by data mining
algorithms
 Allow data to be more easily visualized
 May help to eliminate irrelevant features or reduce noise

• Techniques of dimensionality reduction:


 Principle Component Analysis
 Singular Value Decomposition
31
 Others: supervised and non-linear techniques 32

8
5/26/2020

Feature Subset Selection Feature Selection and Correlation Matrix


• Another way to reduce dimensionality of data
• Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the
amount of sales tax paid
• Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task
of predicting students' GPA 33 34

Feature Subset Selection Feature Creation


• Techniques: • Create new attributes that can capture the
Brute-force approch:
 Try all possible feature subsets as input to data mining
important information in a data set much more
algorithm efficiently than the original attributes
Embedded approaches:
 Feature selection occurs naturally as part of the data mining
• Three general methodologies:
algorithm 1. Feature Extraction domain-specific
Filter approaches: 2. Mapping Data to New Space
 Features are selected before data mining algorithm is run 3. Feature Construction combining features
Wrapper approaches:
 Use the data mining algorithm as a black box to find best
subsetof attributes
35 36

9
5/26/2020

DM Assignment-I
• Compare and contrast DM and RDBMS
Describe the basic differences and similarities;
Describe the Pros and Cons (Merits & Demerits).
 On average, a summarized report of two
pages (Font: Times New Roman 12, 1.5
spacing) should be submitted on May 28,
2020. Use aastukk@gmail.com to submit
your assignments before the due date.

37

10

You might also like