DM 1 PDF
DM 1 PDF
Fundamentals
Why Data Mining?
Healthcare
Finance
Retail & E-Commerce
Learn about consumer preferences
Countless others!
History of Data Mining
data cleaning:
data integration: Data
data selection: Mining
data transformation:
Data Mining
Data Mining
A typical data mining system may have the
following major components:
A database, data warehouse, or other
information repository, which consists of the set
of databases, data warehouses, spreadsheets etc.
A database or data warehouse server which
fetches the relevant data based on users’ data
mining requests.
Data Mining
A typical data mining system may have the
following major components:
A knowledge base that contains the domain
knowledge used to guide the search or to
evaluate the interestingness of resulting
patterns. For example, the knowledge base may
contain metadata which describes data from
multiple heterogeneous sources.
Data Mining
A typical data mining system may have the
following major components:
A data mining engine, which consists of a set
of functional modules for tasks such as
classification, association, classification, cluster
analysis, and evolution and deviation analysis.
A pattern evaluation module that works in
tandem with the data mining modules by
employing interestingness measures to help
focus the search towards interestingness
patterns.
Data Mining
A typical data mining system may have the
following major components:
A graphical user interface that allows the
user an interactive approach to the data mining
system.
Data Mining
Data to be mined
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
• Object‐oriented and object‐relational databases
• Spatial databases
• Time‐series data and temporal data
• Text databases and multimedia databases
• Heterogeneous and legacy databases
• WWW
Data Mining functionalities
Data mining functionalities are used specify
the kind of patterns to be found in data
mining tasks.
Classification
Regression
Predictive
Time series
Data Mining
analysis
Descriptive
Prediction
Data Mining functionalities
Association
Predictive
rule
Data mining
Descriptive summerization
Clustering
Data Mining functionalities
Concept description: Characterization and
discrimination
Data can be associated with classes or concepts
• Ex. AllElectronics store classes of items for sale
include computer and printers.
Description of class or concept called
class/concept description can be done in 2 ways.
• data characterization, by summarizing the data of the
class under study (often called target class)
• data discrimination, by comparison of the target class
with one or a set of comparative classes (often called
the contrasting classes)
Data Mining functionalities
Data characterization is a summarization of
the general characteristics or features of a
target class of data.
Example : summarizing the characteristics of
customers who spend more than $1,000 a year. The
result could be a general profile of the customers,
such as they are 40–50 years old, employed, and
have excellent credit ratings.
The output of data characterization can be
presented in various forms.
Examples include pie charts, bar charts, curves,
multidimensional data cubes, and multidimensional
tables, including crosstabs.
Data Mining functionalities
data discrimination is a comparison of the
general features of target class data objects
with the general features of objects from one
or a set of contrasting classes.
The target and contrasting classes can be
specified by the user, and the corresponding
data objects retrieved through database
queries.
Example: two groups of customers, such as
those who shop for computer products regularly
versus those who rarely shop for such products
Data Mining functionalities
Association Rules – Tries to find out
relationship between data. Also called as
link analysis or affinity analysis
Best application of this task is association rules,
which is a model identifying specific type of data
associations.
• Example: buys(X; “computer”))buys(X; “software”)
• multidimensional association rule Example:
• age(X, “20:::29”)^income(X, “20K:::29K”))buys(X, “CD
player”)
Data Mining functionalities
Classification – Classification is the process of
finding a model (or function) that describes and
distinguishes data classes or concepts,
for the purpose of being able to use the model to predict
the class of objects whose class label is unknown.
The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label is
known).
It maps data onto predefined groups or classes.
This is called as supervised learning as classes are
decided before examining the data.
Classes are decided based on characteristic of data
already belonging to the class
Data Mining functionalities
Pattern recognition is a type of
classification, where a given pattern is
classified into one of several classes based on
its similarity with predefined patterns.
Data Mining functionalities
Regression – maps a data item to real
valued prediction variable. This function
assumes that target data fits into some
known function and tries to find out best
function that models the given data.
Error analysis is used to determine which
function is the best.
Data Mining functionalities
Prediction – In many DM applications
future data is predicted based on current or
past data.
Examples are
prediction of flooding
Speech recognition
Machine learning
Pattern recognition
Data Mining functionalities
Cluster Analysis:
clustering analyzes data objects without
consulting a known class label
The objects are clustered or grouped based on
the principle of maximizing the intraclass
similarity and minimizing the interclass
similarity.
Data Mining functionalities
Outlier Analysis:
database may contain data objects that do not
comply with the general behaviour or model of
the data. These data objects are outliers.
The analysis of outlier data is referred to as
outlier mining.
Example : Outlier analysis may uncover
fraudulent usage of credit cards by detecting
purchases of extremely large amounts for a
given account number in comparison to regular
charges incurred by the same account.
Interestingness in pattern
A data mining system has the potential to
generate thousands or even millions of
patterns, “So, “are all of the patterns
interesting?” NOT
a pattern is interesting if it is
easily understood by humans
valid on new data with some degree of certainty
potentially useful
Novel
validates a hypothesis that the user sought to
confirm
Interestingness in pattern
Can a data mining system generate all of the
interesting patterns?
is often unrealistic and inefficient for data
mining systems to generate all of the possible
patterns.
user-provided constraints and interestingness
measures should be used to focus the search.
Interestingness in pattern
Can a data mining system generate only
interesting patterns?
is an optimization problem in data mining.
It is highly desirable for data mining systems to
generate only interesting patterns.
Related technologies
Related technologies
Machine learning
is the field of study that gives computers the
capability to learn without being explicitly
programmed.
is an application of artificial intelligence (AI)
that provides systems
• the ability to automatically learn and
• improve from experience
The primary aim is to allow the computers learn
automatically without human intervention or
assistance and adjust actions accordingly.
Related technologies
Statistics:
is a branch of mathematics working with data
collection, organization, analysis, interpretation
and presentation.
Statistics is a term used to summarize a process
that an analyst uses to characterize a data set.
Related technologies
Visualisation
is any technique for creating images, diagrams,
or animations to communicate a message.
It is the art of making data beautiful
information science
the study of processes for storing and retrieving
information.
is a field primarily concerned with the analysis,
collection, classification, manipulation, storage,
retrieval, movement, dissemination, and
protection of information
Related technologies
Database technology
is a computer based record keeping system
which is used to record ,maintain and retrieve
data.
It is an organized collection of interrelated
(persistent) data.
It facilitate the storage, retrieval, modification,
and deletion of data in conjunction with various
data-processing operations
Related technologies
Other technologies : depending upon the
requirement the technology from other
domain can be incorporated.
neural networks
fuzzy logic
rough set theory,
inductive logic programming,
high-performance computing.
Classification of Data Mining
Systems
Classification according to the kinds of
databases mined
Database systems can be classified according to
different criteria (such as data models, or the types
of data or applications involved), each of which may
require its own data mining technique.
• For instance, if classifying according to data models, we
may have a relational, transactional, object-relational, or
data warehouse mining system.
If classifying according to the special types of data
handled, we may have a spatial, time-series, text,
stream data, multimedia data mining system, or a
WorldWideWeb mining system.
Classification of Data Mining
Systems
Classification according to the kinds of
knowledge mined:
Data mining systems can be categorized
according to the kinds of knowledge they mine,
that is, based on data mining functionalities.
• Ex: characterization, discrimination, association and
correlation analysis, classification, prediction,
clustering, outlier analysis, and evolution analysis.
Classification of Data Mining
Systems
Classification according to the on the
granularity or levels of abstraction:
data mining systems can be distinguished based
on the granularity or levels of abstraction of the
knowledge mined.
• generalized knowledge - at a high level of abstraction
• primitive-level knowledge - at a raw data level
• Knowledge at multiple levels (considering several
levels of abstraction).
• An advanced data mining system should facilitate the
discovery of knowledge at multiple levels of
abstraction.
Classification of Data Mining
Systems
Data mining systems can also be categorized
as those that
mine data regularities (commonly occurring
patterns) versus
those that mine data irregularities (such as
exceptions, or outliers).
In general, concept description, association and
correlation analysis, classification, prediction,
clustering mine data regularities, rejecting
outliers as noise.
Classification of Data Mining
Systems
Classification according to the kinds of
techniques utilized:
Data mining systems can be categorized
according to the underlying data mining
techniques employed.
• database-oriented or data warehouse– oriented
techniques,
• Or machine learning, statistics, visualization, pattern
recognition, neural networks, and so on
A sophisticated data mining system will often
adopt multiple data mining techniques
Classification of Data Mining
Systems
Classification according to the applications
adapted:
For example, data mining systems may be
tailored specifically for finance,
telecommunications, DNA, stock markets, e-
mail, and so on.
Different applications often require the
integration of application-specific methods.
Data Mining Task Primitives
A user wants to perform some form of data
analysis.
A data mining task can be specified in the
form of a data mining query.
A data mining query is defined in terms of
data mining task primitives.
These primitives allow the user to
interactively communicate with the data
mining system
Data Mining Task Primitives
The set of task-relevant data to be mined:
This specifies the portions of the database or the
set of data in which the user is interested.
• For example: the database attributes or data
warehouse dimensions of interest. It is also referred
to as the relevant attributes or dimensions.
The kind of knowledge to be mined:
This specifies the data mining functions to be
performed,
• Example characterization, discrimination, association
or correlation analysis, classification, prediction,
clustering, outlier analysis, or evolution analysis.
Data Mining Task Primitives
The background knowledge to be used in the
discovery process:
The domain knowledge is useful for guiding the
knowledge discovery process and for evaluating
the patterns found.
Concept hierarchies are a popular form of
background knowledge, which allow data to be
mined at multiple levels of abstraction.
Data Mining Task Primitives
The interestingness measures and
thresholds for pattern evaluation:
They may be used to guide the mining process
or,
• after discovery, to evaluate the discovered patterns.
The expected representation for visualizing
the discovered patterns:
This refers to the form in which discovered
patterns are to be displayed,
• Such as rules, tables, charts, graphs, decision trees,
and cubes.
Data Mining Task Primitives
Major Issues in Data Mining
Human Interaction:
As data mining problems are not precisely
stated, interfaces may be needed with both
domain and technical experts.
• Technical experts are needed to formulate the queries
and assist in interpreting the results.
• Users are needed to identify training data and desired
results.
Major Issues in Data Mining
Over fitting:
When a model is generated that is associated with a
given database state it is desirable that the model also
fit future database states.
Over fitting occurs when the model does not fit future
states.
There are 2 reasons for over fitting
• It be caused by assumptions that are made about the data or
• Because training database is too small.
• For example, a classification model for an employee database
may be developed to classify employees as short, medium, or tall.
If the training database is quite small, the model might
erroneously indicate that a short person is anyone under five feet
eight inches because there is only one entry in the training
database under five feet eight. In this case, many future
employees would be erroneously classified as short.
Major Issues in Data Mining
Outliers:
There are often many data entries that do not fit
nicely into the derived model.
If a model is developed that includes these
outliers, then the model may not behave well for
data that are not outliers.
Interpretation of results:
The data mining output may require experts to
correctly interpret the results.
Major Issues in Data Mining
Visualization of results:
To easily view and understand the output of
data mining algorithms, visualization of the
results is helpful
Large datasets:
Most of the dataset are massive datasets while
the algorithms are designed for small datasets.
Many modeling applications grow exponentially
on the dataset size and thus are too inefficient
for larger datasets.
This is scalability problem.
Major Issues in Data Mining
High dimensionality:
Not all attributes may be needed to solve a given
data mining problem.
• In fact, the use of some attributes may interfere with the
correct completion of a data mining task.
• Some of other attributes may simply increase the overall
complexity and decrease the efficiency of an algorithm.
This problem is referred as the dimensionality
curse, meaning that there are many attributes
involved and it is difficult to determine which ones
should be used.
One solution to this high dimensionality problem is
to reduce the number of attributes, which is known
as dimensionality reduction. However, determining
which attributes not needed is not always easy to
do.
Major Issues in Data Mining
Multimedia data:
Most previous data mining algorithms are targeted
to traditional data types (numeric, character, text,
etc.).
The use of multimedia data or GIS databases
complicates or invalidates many proposed
algorithms.
Missing data:
During the pre-processing phase of KDD, missing
data may be replaced with estimates.
This and other approaches to handling missing data
can lead to invalid results in the data mining step.
Major Issues in Data Mining
Irrelevant data:
Some attributes in the database might not be of
interest to the data mining task being
developed. How to identify?
Noisy data:
Some attribute values might be invalid or
incorrect.
So, these values are required to corrected before
running data mining applications.
Major Issues in Data Mining
Changing data:
Databases cannot be assumed to be static.
However, most data mining algorithms do assume a
static database.
This requires that the algorithm be completely rerun
anytime the database changes.
Integration:
The KDD process is not currently integrated into normal
data processing activities.
KDD requests may be treated as special, unusual, or
one-time needs.
This makes them inefficient, ineffective, and not general
enough to be used on an ongoing basis.
Integration of data mining functions into traditional
DBMS systems is certainly a desirable goal.
Major Issues in Data Mining
Application:
Determining the intended use for the
information obtained from the data mining
function is a challenge.
Indeed, how business executives can effectively
use the output is sometimes considered the more
difficult part, not the running of the algorithms
themselves.
Because the data are of a type that has not
previously been known, business practices may
have to be modified to determine how to
effectively use the information uncovered.
Data mining metrics
Measuring the effectiveness or usefulness of a
data mining approach is not straightforward.
In fact, different metrics could be used for
different techniques and also based on the
interest level.
From an overall business or usefulness
perspective, a measure such as return on
investment (ROI) could be used.
ROI examines the difference between what the data
mining technique costs and what the savings or
benefits from its use are.
Data mining metrics
But the return is hard to quantify.
It could be measured as increased sales, reduced
advertising expenditure, or both.
In a specific advertising campaign implemented
via targeted catalogue mailings, the percentage
of catalogue recipients and the amount of
purchase per recipient would provide one means
to measure the effectiveness of the mailing.
In a classroom approach accuracy in
classification is mostly used as metrics.