0% found this document useful (0 votes)
49 views87 pages

Data Mining

Data mining is the process of automatically discovering useful patterns from large amounts of data. It involves techniques from machine learning, statistics, and database systems. The main data mining tasks are predictive modeling, cluster analysis, association analysis, and anomaly detection. These tasks are used to extract useful knowledge from data to solve problems in various application domains like marketing, fraud detection, science, and health care.

Uploaded by

Megha Shenoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views87 pages

Data Mining

Data mining is the process of automatically discovering useful patterns from large amounts of data. It involves techniques from machine learning, statistics, and database systems. The main data mining tasks are predictive modeling, cluster analysis, association analysis, and anomaly detection. These tasks are used to extract useful knowledge from data to solve problems in various application domains like marketing, fraud detection, science, and health care.

Uploaded by

Megha Shenoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 87

WHAT IS DATA MINING?

Data Mining is the process of automatically


discovering useful information in large data-
repositories.
DM techniques
→ can be used to search large DB to find
useful patterns that might otherwise
remain unknown
→ provide capabilities to predict the
outcome of future observations
Why do we need Data Mining?
Conventional database systems provide users
with query & reporting tools.
To some extent the query & reporting tools
can assist in answering questions like, where
did the largest number of students come
from last year?
But these tools cannot provide any
intelligence about why it happened.
Taking an Example of University Database
System
The OLTP system will quickly be able to answer
the query like “how many students are enrolled
in university”
The OLAP system using data warehouse will be
able to show the trends in students’
enrollments
(ex: how many students are preferring BCA),
Data mining will be able to answer where the
university should market.
DATA MINING AND KNOWLEDGE DISCOVERY
Data Mining is an integral part of KDD (Knowledge
Discovery in Databases).
KDD is the overall process of converting raw data
into useful information (Figure: 1.1).
The input-data is stored in various formats such as
flat files, spread sheet or relational tables.

Purpose of preprocessing: to transform the raw


input-data into an appropriate format for
subsequent analysis.

The steps involved in data-preprocessing include


→ combine data from multiple sources
→ clean data to remove noise & duplicate
observations, and
→ select records & features that are relevant to
the DM task at hand
• Data-preprocessing is perhaps the most time-
consuming step in the overall knowledge discovery
process.
• “Closing the loop" refers to the process of integrating
DM results into decision support systems.
• Such integration requires a postprocessing step. This
step ensures that only valid and useful results are
incorporated into the decision support system.
• An example of postprocessing is visualization.
• Visualization can be used to explore data and DM
results from a variety of viewpoints.
• Statistical measures can also be applied during
postprocessing to eliminate bogus DM results.
MOTIVATING CHALLENGES
Scalability

• Nowadays, data-sets with sizes of terabytes or even


petabytes are becoming common.
• DM algorithms must be scalable in order to handle
these massive data sets.
• Scalability may also require the implementation of
novel data structures to access individual records in
an efficient manner.
• Scalability can also be improved by developing
parallel & distributed algorithms.
High Dimensionality
• Traditional data-analysis technique can only
deal with low dimensional data.
• Nowadays, data-sets with hundreds or
thousands of attributes are becoming
common.
• Data-sets with temporal or spatial
components also tend to have high
dimensionality.
• The computational complexity increases
rapidly as the dimensionality increases.
Heterogeneous and Complex Data
Traditional analysis methods can deal with
homogeneous type of attributes.
Recent years have also seen the emergence of
more complex data-objects.
DM techniques for complex objects should take
into consideration relationships in the data, such
as
→ temporal & spatial autocorrelation
Spatio-temporal autocorrelation analysis is an exploratory approach to recognizing
data distribution in space and time.
→ parent-child relationships between the
elements in semi-structured text & XML
documents
Data Ownership & Distribution

• Sometimes, the data is geographically


distributed among resources belonging to
multiple entities.
• Key challenges include:
⮚ How to reduce amount of communication
needed to perform the distributed
computation
⮚ How to effectively consolidate the DM
results obtained from multiple sources &
⮚ How to address data-security issues
Non Traditional Analysis
• The traditional statistical approach is based on a
hypothesized and test paradigm.
• In other words, a hypothesis is proposed, an
experiment is designed to gather the data, and then
the data is analyzed with respect to hypothesis.
• Current data analysis tasks often require the
generation and evaluation of thousands of
hypotheses, and consequently, the development of
some DM techniques has been motivated by the
desire to automate the process of hypothesis
generation and evaluation
THE ORIGIN OF DATA MINING
• Data mining draws upon ideas from
→ Sampling, estimation, and hypothesis test from statistics
→ Search algorithms, modeling techniques machine learning,
learning theories from AI pattern recognition, statistics
database systems
• Traditional techniques may be unsuitable due to
• →Enormity of data
• →High dimensionality of data
• →Heterogeneous nature of data
• Data mining also had been quickly to adopt ideas from other
areas including
• →Optimization →Evolutionary computing →Signal
processing →Information theory
• Database systems are needed to provide
supports for efficient storage, indexing, query
processing.
• The parallel computing and distribute
technology are two major data addressing
issues in datamining to increase the
performance.
DATA MINING TASKS
DM tasks are generally divided into 2 major
categories.
Predictive Tasks
• The objective is to predict the value of a
particular attribute based on the values of
other attributes.
• The attribute to be predicted is commonly
known as the target or dependent variable,
while the attributes used for making the
predication are known as the explanatory or
independent variables.
Predicting the type of a flower

Iris data set


Three species – Setosa, Versicolour,
Virginica

Petal width low and petal length low implies


Setosa
Petal width medium and petal length
medium implies versicolour
Petal width high and petal length high
implies Virginica
Descriptive Tasks
• The objective is to derive patterns
(correlations, trends, clusters,
trajectories and anomalies) that
summarize the relationships in data.
• Descriptive DM tasks are often
exploratory in nature and frequently
require postprocessing techniques to
validate and explain the results.
Four of the Core Data Mining Tasks
Predictive Modeling
• This refers to the task of building
a model for the target variable as
a function of the explanatory
variable.
• The goal is to learn a model that
minimizes the error between the
predicted and true values of the
target variable.
Predictive modeling is a technique that
uses mathematical and computational
methods to predict an event or
outcome. ... Examples include time-series
regression models for predicting airline
traffic volume or predicting fuel efficiency
based on a linear regression model of
engine speed versus load.
There are 2 types of predictive modeling tasks:
i. Classification: used for discrete target
variables

Ex: Web user will make purchase at an online


bookstore is a classification task, because the
target variable is binary valued.

ii. Regression: used for continuous target


variables.
Ex: forecasting the future price of a stock is
regression task because price is a continuous
valued attribute
Cluster Analysis
This seeks to find groups of closely
related observations so that observations
that belong to the same cluster are more
similar to each other than observations
that belong to other clusters.
Clustering has been used
→ to group sets of related customers
→ to find areas of the ocean that have a
significant impact on the Earth's climate
The collection of news articles shown in the table can be
grouped based on their respective topics.

Each article is represented as a set of word-frequency


pairs(w:c) where w is the word and c is the number of
times the word appears in the article.

There are two main natural clusters in the data seet:


Firsts four articles – news about the economy
Second cluster – last four articles – news about health care

A good clustering algorithm should be able to identify


these two clusters based on the similarity between words
that appear in the articles
Association Analysis
This is used to discover patterns that describe
strongly associated features in the data.
The goal is to extract the most interesting patterns
in an efficient manner.
Useful applications include
→ finding groups of genes that have related
functionality or
→ identifying web pages that are accessed together
Ex: market based analysis
Association analysis can be applied to find items
that are frequently bought together by customers
at the checkout counters of a grocery store.

We may discover the rule that {diapers} -> {Milk},


which suggests that customers who buy diapers
also tend to buy milk.

This type of rule can be used to identify potential


cross-selling opportunities among related items.
Anomaly Detection
This is the task of identifying observations
whose characteristics are significantly different
from the rest of the data. Such observations
are known as anomalies or outliers.
The goal is
→ to discover the real anomalies and
→ to avoid falsely labeling normal objects as
anomalous.
Applications include the detection of fraud,
network intrusions, and unusual patterns of
disease.
Example(Credit Card Fraud Detection).
A credit card company records the transactions made
by every credit card holder, along with personal
information such as credit limit, age, annual income,
and address.
Since the number of fraudulent cases is relatively
small compared to the number of legitimate
transactions, anomaly detection techniques can be
applied to build a profile of legitimate transactions
for the users.
When a new transaction arrives, it is compared
against the profile of the user.
If the characteristics of the transaction are very
different from the previously created profile, then the
transaction is flagged as potentially fraudulent
DATA

The type of data


Data sets differ in a number of ways – the attributes used to
describe data objects can be of different types –
Quantitative
Qualitative

Quality of the data


Data is often far from perfect.
Understanding and improving data quality improves the
quality of the resulting analysis

Often, the raw data must be processed in order to make it


suitable for analysis
WHAT IS A DATA OBJECT?
A data-set refers to a collection of data-objects
and their attributes.
Other names for a data-object are record,
transaction, vector, event, entity, sample or
observation.
Data-objects are described by a number of
attributes such as
→ mass of a physical object or
→ time at which an event occurred.
Other names for an attribute are dimension,
variable, field, feature or characteristics.
Example 2.2 (Student Information).
Often, a data-set is a file, in which the objects are records(or
rows) in the file and each field (or column) corresponds to an
attribute.
For example, Table 2.1 shows a data-set that consists of
student information.
Each row corresponds to a student and each column is an
attribute that describes some aspect of a student, such as
grade point average(GPA) or identification number(ID).
WHAT IS AN ATTRIBUTE?
An attribute is a characteristic of an object
that may vary, either
→ from one object to another or
→ from one time to another.
For example, eye color varies from person to
person.
Eye color is a symbolic attribute with a small
no. of possible values {brown, black, blue,
green}.
PROPERTIES OF ATTRIBUTE VALUES

The type of an attribute depends on which


of the following properties it possesses:

Distinctness: = ≠
Order: < >
Addition: + -
Multiplication: * /
Given these properties, we can define four
types of attributes:
• Nominal attribute: Uses only distinctness.
Examples: ID numbers, eye color, pin codes
• Ordinal attribute: Uses distinctness & order.
Examples: Grades in {SC, FC, FCD}
Shirt sizes in {S, M, L, XL}
• Interval attribute: Uses distinctness, order & addition
Examples: calendar dates, temperatures in
Celsius or Fahrenheit.
• Ratio attribute: Uses all 4 properties
Examples: temperature in Kelvin, length, time,
counts
The types of attributes can also be described in terms of
transformations that do not change the meaning of an attribute
DESCRIBING ATTRIBUTES BY THE NUMBER OF VALUES
Discrete
Has only a finite or countably infinite set of values.
Examples: pin codes, ID Numbers, or the set of words in a
collection of documents.
Often represented as integer variables.
Binary attributes are a special case of discrete attributes and
assume only 2 values.
E.g. true/false, yes/no, male/female or 0/1
Continuous
Has real numbers as attribute values.
Examples: temperature, height, or weight.
Often represented as floating-point variables.
These attributes are typically represented as floating-point
variables and can be measured and represented with limited
precision.
ASYMMETRIC ATTRIBUTES
Binary attributes where only non-zero values are important are
called asymmetric binary attributes.
Consider a data-set where each object is a student and each
attribute records whether or not a student took a particular
course at a university.
For a specific student, an attribute has a value of 1 if the student
took the course associated with that attribute and a value of 0
otherwise.
Because students take only a small fraction of all available
courses, most of the values in such a data-set would be 0.
Therefore, it is more meaningful and more efficient to focus on
the non-zero values.
This type of attribute is particularly important for association
analysis.
GENERAL CHARACTERISTICS OF DATA SETS
Following 3 characteristics apply to many data-sets:
Dimensionality
Dimensionality of a data-set is no. of attributes that the
objects in the data-set possess.
Data with a small number of dimensions tends to be
qualitatively different than moderate or high-
dimensional data.
The difficulties associated with analyzing high-
dimensional data are sometimes referred to as the
curse of dimensionality.
Because of this, an important motivation in
preprocessing data is dimensionality reduction.
Distribution
The distribution of a data set is the frequency of
occurrence of various values or sets of values for the
attributes comprising data objects
Sparsity
For some data-sets with asymmetric feature,
most attribute of an object have values of 0.
In practical terms, sparsity is an advantage
because usually only the non-zero values need
to be stored & manipulated.
This results in significant savings with respect to
computation-time and storage.
Some DM algorithms work well only for sparse
data.
Resolution
It is frequently possible to obtain data at different
levels of resolution, and often the properties of the
data are different at different resolutions.
Ex: the surface of the earth seems very uneven at a
resolution of few meters, but is relatively smooth at a
resolution of tens of kilometers.
The patterns in the data also depend on the level of
resolution.
If the resolution is too fine, a pattern may not be
visible or may be buried in noise.
If the resolution is too coarse, the pattern may disappear.
TYPES OF DATA SETS
Record data
→ Transaction (or Market based data)
→ Data matrix
→ Document data or Sparse data matrix
Graph data
→ Data with relationship among objects (World Wide
Web)
→ Data with objects that are Graphs (Molecular
Structures)
Ordered data
→ Sequential data (Temporal data)
→ Sequence data
→ Time series data
→ Spatial data
RECORD DATA
Data-set is a collection of records.
Each record consists of a fixed set of
attributes.
Every record has the same set of
attributes.
There is no explicit relationship among
records or attributes.
The data is usually stored either
→ in flat files or → in relational databases
TYPES OF RECORD DATA
Transaction (Market Basket Data)
Each transaction consists of a set of items.
Consider a grocery store.
The set of products purchased by a customer represents
a transaction while the individual products represent
items.
This type of data is called market basket data because
the items in each transaction are the products in a
person's "market basket."
Data can also be viewed as a set of records whose
fields are asymmetric attributes.
Data Matrix
An m*n matrix, where there are m rows,
one for each object, & n columns, one
for each attribute. This matrix is called a
data-matrix.
Since data-matrix consists of numeric
attributes, standard matrix operation can
be applied to manipulate the data.
Sparse Data Matrix
This is a special case of a data-
matrix.
The attributes are of the same type
and are asymmetric i.e. only non-
zero values are important.
Document Data
A document can be represented as a
‘vector’,
where each term is a attribute of the
vector and
value of each attribute is the no. of times
corresponding term occurs in the
document.
GRAPH BASED DATA
Sometimes, a graph can be a convenient and powerful
representation for data.
We consider 2 specific cases:
Data with Relationships among Objects
The relationships among objects frequently convey
important information.
In particular, the data-objects are mapped to nodes of the
graph,
while relationships among objects are captured by link
properties such as direction & weight.
For ex, in web, the links to & from each page provide a
great deal of information about the relevance of a web-
page to a query, and thus, must also be taken into
consideration.
Data with Objects that are Graphs
If the objects contain sub-objects that have
relationships, then such objects are
frequently represented as graphs.
For ex, the structure of chemical
compounds can be represented by a graph,
where nodes are atoms and
links between nodes are chemical bonds.
ORDERED DATA
Sequential Data (Temporal Data)
This can be thought of as an extension of record-
data, where each record has a time associated with
it.
A time can also be associated with each attribute.
For example, each record could be the purchase
history of a customer, with a listing of items
purchased at different times.
Using this information, it is possible to find patterns
such as "people who buy DVD
players tend to buy DVDs in the period immediately
following the purchase."
Sequence Data
This consists of a data-set that is a sequence of
individual entities, such as a sequence of words
or letters.
This is quite similar to sequential data, except
that there are no time stamps; instead,
there are positions in an ordered sequence.
For example, the genetic information of plants
and animals can be represented in the form of
sequences of nucleotides that are known as
genes.
Time Series Data
This is a special type of sequential data in
which a series of measurements are taken
over time.
For example, a financial data-set might
contain objects that are time series of the
daily prices of various stocks.
An important aspect of temporal-data is
temporal-autocorrelation i.e. if two
measurements are close in time, then the
values of those measurements are often very
similar.
Spatial Data
Some objects have spatial attributes,
such as positions or areas.
An example is weather-data
(temperature, pressure) that is
collected for a variety of geographical
location.
An important aspect of spatial-data is
spatial-autocorrelation i.e. objects that
are physically close tend to be similar
in other ways as well.
DATA PREPROCESSING
Data preprocessing is a broad area and consists of
a number of different strategies and techniques
that are interrelated in complex way.
Different data processing techniques are:
1. Aggregation 2. Sampling
3. Dimensionality reduction
4. Feature subset selection
5. Feature creation
6. Discretization and binarization
7. Variable transformation
AGGREGATION
This refers to combining 2 or more attributes (or objects) into a
single attribute (or object).
For example, merging daily sales figures to obtain monthly sales
figures
Motivations for aggregation:
Data reduction: The smaller data-sets require
→ less memory → less processing time.
Because of aggregation, more expensive algorithm can be used.
Aggregation can act as a change of scale by providing a high-
level view of the data instead of a low-level view. E.g. Cities
aggregated into districts, states, countries, etc
The behavior of groups of objects is often more stable than that
of individual objects.
Disadvantage: The potential loss of interesting details.
SAMPLING
This is a method used for selecting a subset of the data-
objects to be analyzed.
This is often used for both
→ preliminary investigation of the data
→ final data analysis
Q: Why sampling?
Ans: Obtaining & processing the entire set of “data of
interest” is too expensive or time consuming.
Sampling can reduce data-size to the point where better
& more expensive algorithm can be used.
Key principle for effective sampling: Using a sample will
work almost as well as using entire data-set, if the
sample is representative.
Sampling Methods
Simple Random Sampling
There is an equal probability of selecting any particular
object.
There are 2 variations on random sampling:
Sampling without Replacement
As each object is selected, it is removed from the
population.
Sampling with Replacement
Objects are not removed from the population as they are
selected for the sample.
The same object can be picked up more than once.
When the population consists of different types(or number)
of objects, simple random
sampling can fail to adequately represent those types of objects
that are less frequent.
Stratified Sampling
This starts with pre-specified groups of
objects.
In the simplest version, equal numbers of
objects are drawn from each group even
though the
groups are of different sizes.
In another variation, the number of objects
drawn from each group is proportional to
the size of that group
Progressive Sampling
If proper sample-size is difficult to determine
then progressive sampling can be used.
This method starts with a small sample, and
then increases the sample-size until a sample of
sufficient size has been obtained.
This method requires a way to evaluate the
sample to judge if it is large enough.
DIMENSIONALITY REDUCTION
Key benefit: many DM algorithms work better if
the dimensionality is lower.
Purpose
May help to eliminate irrelevant features or
reduce noise.
Can lead to a more understandable model
(which can be easily visualized).
Reduce amount of time and memory required
by DM algorithms.
Avoid curse of dimensionality.
The Curse of Dimensionality
Data-analysis becomes significantly harder as the
dimensionality of the data increases.
For classification, this can mean that there are not
enough data-objects to allow the creation of a
model that reliably assigns a class to all possible
objects.
For clustering, the definitions of density and the
distance between points (which are critical for
clustering) become less meaningful.
As a result, we get
→ reduced classification accuracy &
→ poor quality clusters.
FEATURE SUBSET SELECTION
Another way to reduce the dimensionality is to use only
a subset of the features.
This might seem that such approach would lose
information, this is not the case if redundant and
irrelevant features are present.
Redundant features duplicate much or all of the
information contained in one or more other attributes.
For example: purchase price of a product and the
amount of sales tax paid.
Irrelevant features contain almost no useful
information for the DM task at hand.
For example: students' ID numbers are irrelevant to the
task of predicting students' grade point averages.
Techniques for Feature Selection
1. Embedded approaches: Feature selection
occurs naturally as part of DM algorithm.
Specifically, during the operation of the DM
algorithm, the algorithm itself decides which
attributes to use and which to ignore.
2. Filter approaches: Features are selected
before the DM algorithm is run.
3. Wrapper approaches: Use DM algorithm as a
black box to find best subset of attributes.
An Architecture for Feature Subset
Selection
The feature selection process is
viewed as consisting of 4 parts:
A measure of evaluating a subset,
A search strategy that controls the
generation of a new subset of
features,
A stopping criterion and
A validation procedure.
DATA MINING APPLICATIONS
Prediction & Description
Data mining may be used to answer questions like
→ "would this customer buy a product" or
→ "is this customer likely to leave?”
DM techniques may also be used for sales forecasting and
analysis.
Relationship Marketing
Customers have a lifetime value, not just the value of a
single sale.
Data mining can help
→ in analyzing customer profiles and improving direct
marketing plans
→ in identifying critical issues that determine client loyalty
and
→ in improving customer retention
Customer Profiling
This is the process of using the relevant and available
information
→ to describe the characteristics of a group of customers
→ to identify their discriminators from ordinary consumers
and
→ to identify drivers for their purchasing decisions
This can help an enterprise identify its most valuable
customers
so that the enterprise may differentiate their needs and
values.
Outliers Identification & Detecting Fraud
For this, examples include:
→ identifying unusual expense claims by staff
→ identifying anomalies in expenditure between similar units
of an enterprise
→ identifying fraud involving credit cards
Customer Segmentation
This is a way to assess & view individuals in market
based on their status & needs.
Data mining may be used
→ to understand & predict customer behavior and
profitability
→ to develop new products & services and
→ to effectively market new offerings
Web site Design & Promotion
Web mining may be used to discover how users
navigate a web site and the results
can help in improving the site design.
Web mining may also be used in cross-selling by
suggesting to a web customer, items that he may be
interested in.

You might also like