0% found this document useful (0 votes)
36 views

Lecture 2

Uploaded by

yousef
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Lecture 2

Uploaded by

yousef
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

data mining Concepts and Techniques

Dr. Atif Ali Mohamed


Assistant Professor … University of Science and Technology
ICT Head Department
Mobile: 0123393000 … 0912534290
Web side: www.dratifnimir.info
E-mail: [email protected]

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1


1
Data Mining: Data set

What is Data set?


Types of data sets.
Data Quality.
Data Preprocessing.
What is Data set?
 Collection of data objects
and their attributes Attributes

Tid Refund Marital Taxable


 An attribute is a property or Status Income Cheat
characteristic of an object 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
 A collection of attributes
4 Yes Married 120K No
describe an object 5 No Divorced 95K Yes
Object is also known as Objects 6 No Married 60K No

record, point, case, sample, 7 Yes Divorced 220K No

entity, tuble, row, or 8 No Single 85K Yes


9 No Married 75K No
instance 10 No Single 90K Yes
10
Types of Attributes
 There are different types of attributes
Nominal
 Examples: ID numbers, eye color, zip codes
Ordinal
 Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio
 Examples: temperature in Kelvin, length, time, counts
Attribute Description Examples
Type

The values of a nominal attribute


Nominal are just different names, i.e.,
zip codes, employee
ID numbers, eye color,
nominal attributes provide only sex: {male, female}
enough information to distinguish
one object from another. (=, )

The values of an ordinal hardness of minerals,


Ordinal attribute provide enough {good, better, best},
grades, street numbers
information to order objects.
(<, >)
For interval attributes, the calendar dates,
Interval differences between values are temperature in Celsius
meaningful, i.e., a unit of or Fahrenheit
measurement exists.
(+, - )

For ratio variables, both temperature in Kelvin,


Ratio differences and ratios are
monetary quantities,
counts, age, mass,
meaningful. (*, /) length, electrical
current
Discrete and Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values
 Examples: zip codes, counts, or the set of words in a collection of
documents
 Often represented as integer variables.
 Note: binary attributes are a special case of discrete attributes

 Continuous Attribute
 Has real numbers as attribute values
 Examples: temperature, height, or weight.
 Practically, real values can only be measured and represented using a
finite number of digits.
 Continuous attributes are typically represented as floating-point
variables.
Types of data sets
Record
 Data Matrix
Important Characteristics of
 Document Data
Structured Data:
 Transaction Data

Graph –Dimensionality
 World Wide Web  Curse of Dimensionality
 Molecular Structures
–Sparsity
Ordered  Only presence counts
 Spatial Data
 Temporal Data
–Resolution
 Sequential Data  Patterns depend on the scale
 Genetic Sequence Data
Record Data
Data that consists of a collection of records, each of
which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
 If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute

 Such data set can be represented by an m by n matrix, where


there are m rows, one for each object, and n columns, one for
each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data
Each document becomes a `term' vector,
Each term is a component (attribute) of the vector,
The value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
A special type of record data, where
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute
a transaction, while the individual products that were
purchased are the items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
2 Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
5 1 Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
2 N-Body Computation and Dense Linear System Solvers

5
Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?

Examples of data quality problems:


Noise and outliers
missing values
duplicate data
Noise
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on
a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise


Outliers
Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set
Missing Values
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

Handling missing values


Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their
probabilities)
Duplicate Data
Data set may include data objects that are
duplicates, or almost duplicates of one another
Major issue when merging data from heterogeous
sources

Examples:
Same person with multiple email addresses

Data cleaning
Process of dealing with duplicate data issues
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Aggregation
Combining two or more attributes (or objects) into
a single attribute (or object)

Purpose
Data reduction
 Reduce the number of attributes or objects
Change of scale
 Cities aggregated into regions, states, countries, etc
More “stable” data
 Aggregated data tends to have less variability
Sampling
 Sampling is the main technique employed for data selection.
It is often used for both the preliminary investigation of the
data and the final data analysis.
 Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.
 Sampling is used in data mining because processing the entire
set of data of interest is too expensive or time consuming.
 The key principle for effective sampling is the following:
using a sample will work almost as well as using the entire
data sets, if the sample is representative
A sample is representative if it has approximately the same
property (of interest) as the original set of data
Types of Sampling
 Simple Random Sampling
 There is an equal probability of selecting any particular item

 Sampling without replacement


 As each item is selected, it is removed from the population

 Sampling with replacement


 Objects are not removed from the population as they are selected for
the sample.
 In sampling with replacement, the same object can be picked up more
than once

 Stratified sampling
 Split the data into several partitions; then draw random samples
from each partition
Curse of Dimensionality
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

Definitions of density
and distance between
points, which is critical
for clustering and outlier
• Randomly generate 500 points
detection, become less
meaningful • Compute difference between max
and min distance between any pair
of points
Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by data mining
algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce noise

Techniques
Principle Component Analysis (PCA)
Singular Value Decomposition
Others: supervised and non-linear techniques
Feature Subset Selection
 Another way to reduce dimensionality of data

 Redundant features
duplicate much or all of the information contained in one
or more other attributes
Example: purchase price of a product and the amount of
sales tax paid

 Irrelevant features
contain no information that is useful for the data mining
task at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA
Feature Subset Selection
Techniques:
Brute-force approach:
Try all possible feature subsets as input to data mining
algorithm
Embedded approaches:
 Feature selection occurs naturally as part of the data mining
algorithm
Filter approaches:
 Features are selected before data mining algorithm is run
Wrapper approaches:
 Use the data mining algorithm as a black box to find best
subset of attributes
Feature Creation
Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes

Three general methodologies:


Feature Extraction
 domain-specific
Mapping Data to New Space
Feature Construction
 combining features
End of the Lecture

Thanks

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27

You might also like