0% found this document useful (0 votes)
15 views

Class-4-Data Preprocessing

This document provides an overview of the CS F415: Data Mining course. It discusses topics that will be covered such as data preprocessing, data mining versus database management systems, data warehousing and online analytical processing. Examples are given to illustrate the differences between DBMS, OLAP and data mining approaches. Major issues in data warehousing and mining are also outlined such as performance, data types, applications and privacy. The importance of data preprocessing is emphasized to obtain quality data and ensure quality mining results.

Uploaded by

f20201207
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Class-4-Data Preprocessing

This document provides an overview of the CS F415: Data Mining course. It discusses topics that will be covered such as data preprocessing, data mining versus database management systems, data warehousing and online analytical processing. Examples are given to illustrate the differences between DBMS, OLAP and data mining approaches. Major issues in data warehousing and mining are also outlined such as performance, data types, applications and privacy. The importance of data preprocessing is emphasized to obtain quality data and ensure quality mining results.

Uploaded by

f20201207
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

CS F415: Data Mining

Yashvardhan Sharma

30-Jan-24 C1S F415 1


Today’s Outline
•Introduction - Overview
•Data Preprocessing
• Why preprocess the data?
• Types of Data Sets
• Data Quality
• Steps in Data Preprocessing

30-Jan-24 C1S F415 2


Data Mining vs. DBMS
• Example DBMS Reports
• Last months sales for each service type
• Sales per service grouped by customer sex or age bracket
• List of customers who lapsed their policy

• Questions answered using Data Mining


• What characteristics do customers that lapse their policy have in
common and how do they differ from customers who renew their
policy?
• Which motor insurance policy holders would be potential customers
for my House Content Insurance policy?

30-Jan-24 CS F415 3
Data Mining and Data Warehousing
• Data Warehouse: a centralized data repository which can be queried
for business benefit.
• Data Warehousing makes it possible to
• extract archived operational data
• overcome inconsistencies between different legacy data formats
• integrate data throughout an enterprise, regardless of location, format, or
communication requirements
• incorporate additional or expert information
• OLAP: On-line Analytical Processing
• Multi-Dimensional Data Model (Data Cube)
• Operations:
• Roll-up
• Drill-down
• Slice and dice
• Rotate
30-Jan-24 CS F415 4
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta Data

Filtering&Integration Database API Filtering


Layer1
Data cleaning Data
Databases Data
Data integration Warehouse
30-Jan-24 CS F415 Repository 5
DBMS, OLAP, and Data Mining

DBMS OLAP Data Mining


Knowledge discovery
Extraction of detailed Summaries, trends and
Task of hidden patterns
and summary data forecasts
and insights

Type of result Information Analysis Insight and Prediction

Multidimensional data Induction (Build the


Deduction (Ask the
modeling, model, apply it to
Method question, verify
Aggregation, new data, get the
with data)
Statistics result)

What is the average Who will buy a


Who purchased
income of mutual mutual fund in the
Example question mutual funds in
fund buyers by next 6 months and
the last 3 years?
region by year? why?

30-Jan-24 CS F415 6
Example of DBMS, OLAP and Data Mining: Weather Data
DBMS:
Day outlook temperature humidity windy play

1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
5 rainy 68 80 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
10 rainy 75 80 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes

30-Jan-24 14 rainy 71 CS91F415 true no 7


Example of DBMS, OLAP and Data Mining: Weather Data
• By querying a DBMS containing the above table we may answer
questions like:
• What was the temperature in the sunny days?
{85, 80, 72, 69, 75}
• Which days the humidity was less than 75?
{6, 7, 9, 11}
• Which days the temperature was greater than 70?
{1, 2, 3, 8, 10, 11, 12, 13, 14}
• Which days the temperature was greater than 70 and the
humidity was less than 75?
The intersection of the above two: {11}

30-Jan-24 CS F415 8
Example of DBMS, OLAP and Data Mining: Weather Data
OLAP:
• Using OLAP we can create a Multidimensional Model of our data (Data
Cube).
• For example using the dimensions: time, outlook and play we can create
the following model.

9/5 sunny rainy overcast

Week 1 0/2 2/1 2/0

Week 2 2/1 1/1 2/0

30-Jan-24 CS F415 9
Example of DBMS, OLAP and Data Mining: Weather Data

Data Mining:
• Using the ID3 algorithm we can produce the following
decision tree:

• outlook = sunny
• humidity = high: no
• humidity = normal: yes
• outlook = overcast: yes
• outlook = rainy
• windy = true: no
• windy = false: yes

30-Jan-24 CS F415 10
Major Issues in Data Warehousing and Mining
• Mining methodology and user interaction
• Mining different kinds of knowledge in databases
• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge
• Data mining query languages and ad-hoc data mining
• Expression and visualization of data mining results
• Handling noise and incomplete data
• Pattern evaluation: the interestingness problem
• Performance and scalability
• Efficiency and scalability of data mining algorithms
• Parallel, distributed and incremental mining methods
30-Jan-24 CS F415 11
Major Issues in Data Warehousing and Mining
• Issues relating to the diversity of data types
• Handling relational and complex types of data
• Mining information from heterogeneous databases and global information
systems (WWW)
• Issues related to applications and social impacts
• Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
• Integration of the discovered knowledge with existing knowledge: A knowledge
fusion problem
• Protection of data security, integrity, and privacy

30-Jan-24 CS F415 12
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
• Required for both OLAP and Data Mining!

30-Jan-24 CS F415 13
Why can Data be Incomplete?
• Attributes of interest are not available (e.g., customer information for
sales transaction data)
• Data were not considered important at the time of transactions, so they
were not recorded!
• Data not recorded because of misunderstanding or malfunctions
• Data may have been recorded and later deleted!
• Missing/unknown values for some data

30-Jan-24 CS F415 14
Why can Data be Noisy/Inconsistent?
• Faulty instruments for data collection
• Human or computer errors
• Errors in data transmission
• Technology limitations (e.g., sensor data come at a faster rate
than they can be processed)
• Inconsistencies in naming conventions or data codes (e.g.,
2/5/2018 could be 2 May 2018 or 5 Feb 2018)
• Duplicate tuples, which were received twice should also be
removed

30-Jan-24 CS F415 15
What is Data?
• Collection of data objects and
their attributes Attributes

• An attribute is a property or Tid Refund Marital Taxable


Status Income Cheat
characteristic of an object
1 Yes Single 125K No
• Examples: eye color of a person,
2 No Married 100K No
temperature, etc.
3 No Single 70K No
• Attribute is also known as variable, 4 Yes Married 120K No
field, characteristic, or feature Objects 5 No Divorced 95K Yes

• A collection of attributes describe 6 No Married 60K No

an object 7 Yes Divorced 220K No


8 No Single 85K Yes
• Object is also known as record,
9 No Married 75K No
point, case, sample, entity, or
10 No Single 90K Yes
instance 10

30-Jan-24 CS F415 16
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute

• Distinction between attributes and attribute values


• Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters

• Different attributes can be mapped to the same set of values


• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
• ID has no limit but age has a maximum and minimum value

30-Jan-24 CS F415 17
Measurement of Length
• The way you measure an attribute is somewhat may not match
the attributes properties.
5 A 1

B
7 2

8 3

10 4

15 5

30-Jan-24 CS F415 18
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties it possesses:
• Distinctness: = 
• Order: < >
• Addition: + -
• Multiplication: */

• Nominal attribute: distinctness


• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & addition
• Ratio attribute: all 4 properties

30-Jan-24 CS F415 19
Types of Attributes
• There are different types of attributes
• Nominal
• Examples: ID numbers, eye color, zip codes
• Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height in {tall, medium, short}
• Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• Examples: temperature in Kelvin, length, time, counts

30-Jan-24 CS F415 20
Attribute Type Description Examples Operations

Nominal The values of a nominal attribute are zip codes, employee ID mode, entropy,
just different names, i.e., nominal numbers, eye color, sex: contingency
attributes provide only enough {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )
Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,
provide enough information to order {good, better, best}, rank correlation, run
objects. (<, >) grades, street numbers tests, sign tests

Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., a temperature in Celsius or deviation, Pearson's
unit of measurement exists. Fahrenheit correlation, t and F
(+, - ) tests

Ratio For ratio variables, both differences and temperature in Kelvin, geometric mean,
ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, length, percent variation
electrical current

30-Jan-24 CS F415 21
Attribute Level Transformation Comments

Nominal Any permutation of values If all employee ID numbers were


reassigned, would it make any
difference?

Ordinal An order preserving change of values, i.e., An attribute encompassing the notion of
new_value = f(old_value) good, better best can be represented
where f is a monotonic function. equally well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value =a * old_value + b where a and b are Thus, the Fahrenheit and Celsius
constants temperature scales differ in terms of
where their zero value is and the size of
a unit (degree).

Ratio new_value = a * old_value Length can be measured in meters or


feet.
30-Jan-24 CS F415 22
Discrete and Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes

• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite
number of digits.
• Continuous attributes are typically represented as floating-point variables.

30-Jan-24 CS F415 23
Important Characteristics of Structured Data

• Dimensionality
• Curse of Dimensionality

• Sparsity
• Only presence counts

• Resolution
• Patterns depend on the scale

30-Jan-24 CS F415 24
Types of data sets
• Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
30-Jan-24 CS F415 25
Record Data
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
30-Jan-24 10

CS F415 26
Transaction Data
• A special type of record data, where
• each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are
the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

30-Jan-24 CS F415 27
Data Matrix
• If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a
distinct attribute

• Such data set can be represented by an m by n matrix, where


there are m rows, one for each object, and n columns, one for
each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

30-Jan-24 CS F415 28
Document – term matrix
• Each document becomes a ‘term’ vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
30-Jan-24 CS F415 29
Graph Data
• Examples: Generic graph and HTML Links

<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
5 1 <li>
<a href="papers/papers.html#aaaa">
2 Parallel Solution of Sparse Linear System of Equations </a>
<li>
5 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers

30-Jan-24 CS F415 30
Chemical Data
• Benzene Molecule: C6H6

30-Jan-24 CS F415 31
Ordered Data
• Sequences of transactions

Items/Events

An element of
the sequence
30-Jan-24 CS F415 32
Ordered Data
• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

30-Jan-24 CS F415 33
Ordered Data
• Spatio-Temporal Data

Average
Monthly
Temperature of
land and ocean

30-Jan-24 CS F415 34
30-Jan-24 CS F415 35
Major Tasks in Data Preprocessing

outliers=exceptions!
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
30-Jan-24
numerical data CS F415 37
Forms of data preprocessing

30-Jan-24 CS F415 38
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:


• Noise and outliers
• missing values
• duplicate data

30-Jan-24 CS F415 39
Data Cleaning
• Importance
• “Data cleaning is one of the three biggest problems in data warehousing”—
Ralph Kimball
• “Data cleaning is the number one problem in data warehousing”—DCI survey
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration

30-Jan-24 CS F415 40
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.

30-Jan-24 CS F415 41
Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


• Eliminate Data Objects
• Estimate Missing Values
• Ignore the Missing Value During Analysis
• Replace with all possible values (weighted by their probabilities)
30-Jan-24 CS F415 42
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing
values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision
tree

30-Jan-24 CS F415 43
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data

30-Jan-24 CS F415 44
Noise
• Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone

Two Sine Waves Two Sine Waves + Noise


30-Jan-24 CS F415 45
How to Handle Noisy Data?
• Binning method:
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible
outliers)
• Regression
• smooth by fitting the data into regression functions

30-Jan-24 CS F415 46
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
• Divides the range into N intervals, each containing approximately same
number of samples
• Good data scaling
• Managing categorical attributes can be tricky.

30-Jan-24 CS F415 47
Binning Methods for Data Smoothing
• Sorted data (e.g., by price)
• 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
• Bin 1: 4, 8, 9, 15
• Bin 2: 21, 21, 24, 25
• Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• Bin 1: 9, 9, 9, 9
• Bin 2: 23, 23, 23, 23
• Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 4, 15
• Bin 2: 21, 21, 25, 25
• Bin 3: 26, 26, 26, 34

30-Jan-24 CS F415 48
Cluster Analysis

30-Jan-24 CS F415 49
Regression
y

Y1

Y1’ y=x+1

X1 x

30-Jan-24 CS F415 50
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects in
the data set

30-Jan-24 CS F415 51
Duplicate Data
• Data set may include data objects that are duplicates, or
almost duplicates of one another
• Major issue when merging data from heterogeneous sources

• Examples:
• Same person with multiple email addresses

• Data cleaning
• Process of dealing with duplicate data issues

30-Jan-24 CS F415 52
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation

30-Jan-24 CS F415 53

You might also like