Data Mining
Data Mining
Distinctness: = ≠
Order: < >
Addition: + -
Multiplication: * /
Given these properties, we can define four
types of attributes:
• Nominal attribute: Uses only distinctness.
Examples: ID numbers, eye color, pin codes
• Ordinal attribute: Uses distinctness & order.
Examples: Grades in {SC, FC, FCD}
Shirt sizes in {S, M, L, XL}
• Interval attribute: Uses distinctness, order & addition
Examples: calendar dates, temperatures in
Celsius or Fahrenheit.
• Ratio attribute: Uses all 4 properties
Examples: temperature in Kelvin, length, time,
counts
The types of attributes can also be described in terms of
transformations that do not change the meaning of an attribute
DESCRIBING ATTRIBUTES BY THE NUMBER OF VALUES
Discrete
Has only a finite or countably infinite set of values.
Examples: pin codes, ID Numbers, or the set of words in a
collection of documents.
Often represented as integer variables.
Binary attributes are a special case of discrete attributes and
assume only 2 values.
E.g. true/false, yes/no, male/female or 0/1
Continuous
Has real numbers as attribute values.
Examples: temperature, height, or weight.
Often represented as floating-point variables.
These attributes are typically represented as floating-point
variables and can be measured and represented with limited
precision.
ASYMMETRIC ATTRIBUTES
Binary attributes where only non-zero values are important are
called asymmetric binary attributes.
Consider a data-set where each object is a student and each
attribute records whether or not a student took a particular
course at a university.
For a specific student, an attribute has a value of 1 if the student
took the course associated with that attribute and a value of 0
otherwise.
Because students take only a small fraction of all available
courses, most of the values in such a data-set would be 0.
Therefore, it is more meaningful and more efficient to focus on
the non-zero values.
This type of attribute is particularly important for association
analysis.
GENERAL CHARACTERISTICS OF DATA SETS
Following 3 characteristics apply to many data-sets:
Dimensionality
Dimensionality of a data-set is no. of attributes that the
objects in the data-set possess.
Data with a small number of dimensions tends to be
qualitatively different than moderate or high-
dimensional data.
The difficulties associated with analyzing high-
dimensional data are sometimes referred to as the
curse of dimensionality.
Because of this, an important motivation in
preprocessing data is dimensionality reduction.
Distribution
The distribution of a data set is the frequency of
occurrence of various values or sets of values for the
attributes comprising data objects
Sparsity
For some data-sets with asymmetric feature,
most attribute of an object have values of 0.
In practical terms, sparsity is an advantage
because usually only the non-zero values need
to be stored & manipulated.
This results in significant savings with respect to
computation-time and storage.
Some DM algorithms work well only for sparse
data.
Resolution
It is frequently possible to obtain data at different
levels of resolution, and often the properties of the
data are different at different resolutions.
Ex: the surface of the earth seems very uneven at a
resolution of few meters, but is relatively smooth at a
resolution of tens of kilometers.
The patterns in the data also depend on the level of
resolution.
If the resolution is too fine, a pattern may not be
visible or may be buried in noise.
If the resolution is too coarse, the pattern may disappear.
TYPES OF DATA SETS
Record data
→ Transaction (or Market based data)
→ Data matrix
→ Document data or Sparse data matrix
Graph data
→ Data with relationship among objects (World Wide
Web)
→ Data with objects that are Graphs (Molecular
Structures)
Ordered data
→ Sequential data (Temporal data)
→ Sequence data
→ Time series data
→ Spatial data
RECORD DATA
Data-set is a collection of records.
Each record consists of a fixed set of
attributes.
Every record has the same set of
attributes.
There is no explicit relationship among
records or attributes.
The data is usually stored either
→ in flat files or → in relational databases
TYPES OF RECORD DATA
Transaction (Market Basket Data)
Each transaction consists of a set of items.
Consider a grocery store.
The set of products purchased by a customer represents
a transaction while the individual products represent
items.
This type of data is called market basket data because
the items in each transaction are the products in a
person's "market basket."
Data can also be viewed as a set of records whose
fields are asymmetric attributes.
Data Matrix
An m*n matrix, where there are m rows,
one for each object, & n columns, one
for each attribute. This matrix is called a
data-matrix.
Since data-matrix consists of numeric
attributes, standard matrix operation can
be applied to manipulate the data.
Sparse Data Matrix
This is a special case of a data-
matrix.
The attributes are of the same type
and are asymmetric i.e. only non-
zero values are important.
Document Data
A document can be represented as a
‘vector’,
where each term is a attribute of the
vector and
value of each attribute is the no. of times
corresponding term occurs in the
document.
GRAPH BASED DATA
Sometimes, a graph can be a convenient and powerful
representation for data.
We consider 2 specific cases:
Data with Relationships among Objects
The relationships among objects frequently convey
important information.
In particular, the data-objects are mapped to nodes of the
graph,
while relationships among objects are captured by link
properties such as direction & weight.
For ex, in web, the links to & from each page provide a
great deal of information about the relevance of a web-
page to a query, and thus, must also be taken into
consideration.
Data with Objects that are Graphs
If the objects contain sub-objects that have
relationships, then such objects are
frequently represented as graphs.
For ex, the structure of chemical
compounds can be represented by a graph,
where nodes are atoms and
links between nodes are chemical bonds.
ORDERED DATA
Sequential Data (Temporal Data)
This can be thought of as an extension of record-
data, where each record has a time associated with
it.
A time can also be associated with each attribute.
For example, each record could be the purchase
history of a customer, with a listing of items
purchased at different times.
Using this information, it is possible to find patterns
such as "people who buy DVD
players tend to buy DVDs in the period immediately
following the purchase."
Sequence Data
This consists of a data-set that is a sequence of
individual entities, such as a sequence of words
or letters.
This is quite similar to sequential data, except
that there are no time stamps; instead,
there are positions in an ordered sequence.
For example, the genetic information of plants
and animals can be represented in the form of
sequences of nucleotides that are known as
genes.
Time Series Data
This is a special type of sequential data in
which a series of measurements are taken
over time.
For example, a financial data-set might
contain objects that are time series of the
daily prices of various stocks.
An important aspect of temporal-data is
temporal-autocorrelation i.e. if two
measurements are close in time, then the
values of those measurements are often very
similar.
Spatial Data
Some objects have spatial attributes,
such as positions or areas.
An example is weather-data
(temperature, pressure) that is
collected for a variety of geographical
location.
An important aspect of spatial-data is
spatial-autocorrelation i.e. objects that
are physically close tend to be similar
in other ways as well.
DATA PREPROCESSING
Data preprocessing is a broad area and consists of
a number of different strategies and techniques
that are interrelated in complex way.
Different data processing techniques are:
1. Aggregation 2. Sampling
3. Dimensionality reduction
4. Feature subset selection
5. Feature creation
6. Discretization and binarization
7. Variable transformation
AGGREGATION
This refers to combining 2 or more attributes (or objects) into a
single attribute (or object).
For example, merging daily sales figures to obtain monthly sales
figures
Motivations for aggregation:
Data reduction: The smaller data-sets require
→ less memory → less processing time.
Because of aggregation, more expensive algorithm can be used.
Aggregation can act as a change of scale by providing a high-
level view of the data instead of a low-level view. E.g. Cities
aggregated into districts, states, countries, etc
The behavior of groups of objects is often more stable than that
of individual objects.
Disadvantage: The potential loss of interesting details.
SAMPLING
This is a method used for selecting a subset of the data-
objects to be analyzed.
This is often used for both
→ preliminary investigation of the data
→ final data analysis
Q: Why sampling?
Ans: Obtaining & processing the entire set of “data of
interest” is too expensive or time consuming.
Sampling can reduce data-size to the point where better
& more expensive algorithm can be used.
Key principle for effective sampling: Using a sample will
work almost as well as using entire data-set, if the
sample is representative.
Sampling Methods
Simple Random Sampling
There is an equal probability of selecting any particular
object.
There are 2 variations on random sampling:
Sampling without Replacement
As each object is selected, it is removed from the
population.
Sampling with Replacement
Objects are not removed from the population as they are
selected for the sample.
The same object can be picked up more than once.
When the population consists of different types(or number)
of objects, simple random
sampling can fail to adequately represent those types of objects
that are less frequent.
Stratified Sampling
This starts with pre-specified groups of
objects.
In the simplest version, equal numbers of
objects are drawn from each group even
though the
groups are of different sizes.
In another variation, the number of objects
drawn from each group is proportional to
the size of that group
Progressive Sampling
If proper sample-size is difficult to determine
then progressive sampling can be used.
This method starts with a small sample, and
then increases the sample-size until a sample of
sufficient size has been obtained.
This method requires a way to evaluate the
sample to judge if it is large enough.
DIMENSIONALITY REDUCTION
Key benefit: many DM algorithms work better if
the dimensionality is lower.
Purpose
May help to eliminate irrelevant features or
reduce noise.
Can lead to a more understandable model
(which can be easily visualized).
Reduce amount of time and memory required
by DM algorithms.
Avoid curse of dimensionality.
The Curse of Dimensionality
Data-analysis becomes significantly harder as the
dimensionality of the data increases.
For classification, this can mean that there are not
enough data-objects to allow the creation of a
model that reliably assigns a class to all possible
objects.
For clustering, the definitions of density and the
distance between points (which are critical for
clustering) become less meaningful.
As a result, we get
→ reduced classification accuracy &
→ poor quality clusters.
FEATURE SUBSET SELECTION
Another way to reduce the dimensionality is to use only
a subset of the features.
This might seem that such approach would lose
information, this is not the case if redundant and
irrelevant features are present.
Redundant features duplicate much or all of the
information contained in one or more other attributes.
For example: purchase price of a product and the
amount of sales tax paid.
Irrelevant features contain almost no useful
information for the DM task at hand.
For example: students' ID numbers are irrelevant to the
task of predicting students' grade point averages.
Techniques for Feature Selection
1. Embedded approaches: Feature selection
occurs naturally as part of DM algorithm.
Specifically, during the operation of the DM
algorithm, the algorithm itself decides which
attributes to use and which to ignore.
2. Filter approaches: Features are selected
before the DM algorithm is run.
3. Wrapper approaches: Use DM algorithm as a
black box to find best subset of attributes.
An Architecture for Feature Subset
Selection
The feature selection process is
viewed as consisting of 4 parts:
A measure of evaluating a subset,
A search strategy that controls the
generation of a new subset of
features,
A stopping criterion and
A validation procedure.
DATA MINING APPLICATIONS
Prediction & Description
Data mining may be used to answer questions like
→ "would this customer buy a product" or
→ "is this customer likely to leave?”
DM techniques may also be used for sales forecasting and
analysis.
Relationship Marketing
Customers have a lifetime value, not just the value of a
single sale.
Data mining can help
→ in analyzing customer profiles and improving direct
marketing plans
→ in identifying critical issues that determine client loyalty
and
→ in improving customer retention
Customer Profiling
This is the process of using the relevant and available
information
→ to describe the characteristics of a group of customers
→ to identify their discriminators from ordinary consumers
and
→ to identify drivers for their purchasing decisions
This can help an enterprise identify its most valuable
customers
so that the enterprise may differentiate their needs and
values.
Outliers Identification & Detecting Fraud
For this, examples include:
→ identifying unusual expense claims by staff
→ identifying anomalies in expenditure between similar units
of an enterprise
→ identifying fraud involving credit cards
Customer Segmentation
This is a way to assess & view individuals in market
based on their status & needs.
Data mining may be used
→ to understand & predict customer behavior and
profitability
→ to develop new products & services and
→ to effectively market new offerings
Web site Design & Promotion
Web mining may be used to discover how users
navigate a web site and the results
can help in improving the site design.
Web mining may also be used in cross-selling by
suggesting to a web customer, items that he may be
interested in.