Class-4-Data Preprocessing
Class-4-Data Preprocessing
Yashvardhan Sharma
30-Jan-24 CS F415 3
Data Mining and Data Warehousing
• Data Warehouse: a centralized data repository which can be queried
for business benefit.
• Data Warehousing makes it possible to
• extract archived operational data
• overcome inconsistencies between different legacy data formats
• integrate data throughout an enterprise, regardless of location, format, or
communication requirements
• incorporate additional or expert information
• OLAP: On-line Analytical Processing
• Multi-Dimensional Data Model (Data Cube)
• Operations:
• Roll-up
• Drill-down
• Slice and dice
• Rotate
30-Jan-24 CS F415 4
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
MDDB
MDDB
Meta Data
30-Jan-24 CS F415 6
Example of DBMS, OLAP and Data Mining: Weather Data
DBMS:
Day outlook temperature humidity windy play
1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
5 rainy 68 80 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
10 rainy 75 80 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes
30-Jan-24 CS F415 8
Example of DBMS, OLAP and Data Mining: Weather Data
OLAP:
• Using OLAP we can create a Multidimensional Model of our data (Data
Cube).
• For example using the dimensions: time, outlook and play we can create
the following model.
30-Jan-24 CS F415 9
Example of DBMS, OLAP and Data Mining: Weather Data
Data Mining:
• Using the ID3 algorithm we can produce the following
decision tree:
• outlook = sunny
• humidity = high: no
• humidity = normal: yes
• outlook = overcast: yes
• outlook = rainy
• windy = true: no
• windy = false: yes
30-Jan-24 CS F415 10
Major Issues in Data Warehousing and Mining
• Mining methodology and user interaction
• Mining different kinds of knowledge in databases
• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge
• Data mining query languages and ad-hoc data mining
• Expression and visualization of data mining results
• Handling noise and incomplete data
• Pattern evaluation: the interestingness problem
• Performance and scalability
• Efficiency and scalability of data mining algorithms
• Parallel, distributed and incremental mining methods
30-Jan-24 CS F415 11
Major Issues in Data Warehousing and Mining
• Issues relating to the diversity of data types
• Handling relational and complex types of data
• Mining information from heterogeneous databases and global information
systems (WWW)
• Issues related to applications and social impacts
• Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
• Integration of the discovered knowledge with existing knowledge: A knowledge
fusion problem
• Protection of data security, integrity, and privacy
30-Jan-24 CS F415 12
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
• Required for both OLAP and Data Mining!
30-Jan-24 CS F415 13
Why can Data be Incomplete?
• Attributes of interest are not available (e.g., customer information for
sales transaction data)
• Data were not considered important at the time of transactions, so they
were not recorded!
• Data not recorded because of misunderstanding or malfunctions
• Data may have been recorded and later deleted!
• Missing/unknown values for some data
30-Jan-24 CS F415 14
Why can Data be Noisy/Inconsistent?
• Faulty instruments for data collection
• Human or computer errors
• Errors in data transmission
• Technology limitations (e.g., sensor data come at a faster rate
than they can be processed)
• Inconsistencies in naming conventions or data codes (e.g.,
2/5/2018 could be 2 May 2018 or 5 Feb 2018)
• Duplicate tuples, which were received twice should also be
removed
30-Jan-24 CS F415 15
What is Data?
• Collection of data objects and
their attributes Attributes
30-Jan-24 CS F415 16
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute
30-Jan-24 CS F415 17
Measurement of Length
• The way you measure an attribute is somewhat may not match
the attributes properties.
5 A 1
B
7 2
8 3
10 4
15 5
30-Jan-24 CS F415 18
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties it possesses:
• Distinctness: =
• Order: < >
• Addition: + -
• Multiplication: */
30-Jan-24 CS F415 19
Types of Attributes
• There are different types of attributes
• Nominal
• Examples: ID numbers, eye color, zip codes
• Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height in {tall, medium, short}
• Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• Examples: temperature in Kelvin, length, time, counts
30-Jan-24 CS F415 20
Attribute Type Description Examples Operations
Nominal The values of a nominal attribute are zip codes, employee ID mode, entropy,
just different names, i.e., nominal numbers, eye color, sex: contingency
attributes provide only enough {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )
Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,
provide enough information to order {good, better, best}, rank correlation, run
objects. (<, >) grades, street numbers tests, sign tests
Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., a temperature in Celsius or deviation, Pearson's
unit of measurement exists. Fahrenheit correlation, t and F
(+, - ) tests
Ratio For ratio variables, both differences and temperature in Kelvin, geometric mean,
ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, length, percent variation
electrical current
30-Jan-24 CS F415 21
Attribute Level Transformation Comments
Ordinal An order preserving change of values, i.e., An attribute encompassing the notion of
new_value = f(old_value) good, better best can be represented
where f is a monotonic function. equally well by the values {1, 2, 3} or
by { 0.5, 1, 10}.
Interval new_value =a * old_value + b where a and b are Thus, the Fahrenheit and Celsius
constants temperature scales differ in terms of
where their zero value is and the size of
a unit (degree).
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite
number of digits.
• Continuous attributes are typically represented as floating-point variables.
30-Jan-24 CS F415 23
Important Characteristics of Structured Data
• Dimensionality
• Curse of Dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
30-Jan-24 CS F415 24
Types of data sets
• Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
30-Jan-24 CS F415 25
Record Data
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
CS F415 26
Transaction Data
• A special type of record data, where
• each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are
the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
30-Jan-24 CS F415 27
Data Matrix
• If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a
distinct attribute
30-Jan-24 CS F415 28
Document – term matrix
• Each document becomes a ‘term’ vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
30-Jan-24 CS F415 29
Graph Data
• Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
5 1 <li>
<a href="papers/papers.html#aaaa">
2 Parallel Solution of Sparse Linear System of Equations </a>
<li>
5 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
30-Jan-24 CS F415 30
Chemical Data
• Benzene Molecule: C6H6
30-Jan-24 CS F415 31
Ordered Data
• Sequences of transactions
Items/Events
An element of
the sequence
30-Jan-24 CS F415 32
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
30-Jan-24 CS F415 33
Ordered Data
• Spatio-Temporal Data
Average
Monthly
Temperature of
land and ocean
30-Jan-24 CS F415 34
30-Jan-24 CS F415 35
Major Tasks in Data Preprocessing
outliers=exceptions!
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
30-Jan-24
numerical data CS F415 37
Forms of data preprocessing
30-Jan-24 CS F415 38
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
30-Jan-24 CS F415 39
Data Cleaning
• Importance
• “Data cleaning is one of the three biggest problems in data warehousing”—
Ralph Kimball
• “Data cleaning is the number one problem in data warehousing”—DCI survey
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
30-Jan-24 CS F415 40
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.
30-Jan-24 CS F415 41
Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
30-Jan-24 CS F415 43
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data
30-Jan-24 CS F415 44
Noise
• Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone
30-Jan-24 CS F415 46
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
• Divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
30-Jan-24 CS F415 47
Binning Methods for Data Smoothing
• Sorted data (e.g., by price)
• 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
• Bin 1: 4, 8, 9, 15
• Bin 2: 21, 21, 24, 25
• Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• Bin 1: 9, 9, 9, 9
• Bin 2: 23, 23, 23, 23
• Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 4, 15
• Bin 2: 21, 21, 25, 25
• Bin 3: 26, 26, 26, 34
30-Jan-24 CS F415 48
Cluster Analysis
30-Jan-24 CS F415 49
Regression
y
Y1
Y1’ y=x+1
X1 x
30-Jan-24 CS F415 50
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects in
the data set
30-Jan-24 CS F415 51
Duplicate Data
• Data set may include data objects that are duplicates, or
almost duplicates of one another
• Major issue when merging data from heterogeneous sources
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
30-Jan-24 CS F415 52
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
30-Jan-24 CS F415 53