0% found this document useful (0 votes)
158 views67 pages

DM 1 PDF

The document provides an overview of data mining fundamentals. It discusses that data mining helps extract useful information and patterns from large amounts of raw data. Some common uses of data mining mentioned are in healthcare, finance, and retail to learn consumer preferences. The history of data mining is then outlined, from the beginnings of data collection and access in the 1960s-1980s, to data warehousing and decision reports in the 1990s, to the current focus on data mining. Some influential people and events in the development of data mining are also noted. Finally, the document discusses key data mining terminology, components, functionalities, and techniques.

Uploaded by

Rahul Pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views67 pages

DM 1 PDF

The document provides an overview of data mining fundamentals. It discusses that data mining helps extract useful information and patterns from large amounts of raw data. Some common uses of data mining mentioned are in healthcare, finance, and retail to learn consumer preferences. The history of data mining is then outlined, from the beginnings of data collection and access in the 1960s-1980s, to data warehousing and decision reports in the 1990s, to the current focus on data mining. Some influential people and events in the development of data mining are also noted. Finally, the document discusses key data mining terminology, components, functionalities, and techniques.

Uploaded by

Rahul Pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Data Mining

Fundamentals
Why Data Mining?

Almost everything we do leaves electronic


data behind
Needs to extract useful information from
data in order to interpret the data
Data mining helps speed up the process of
finding relationships and patterns in raw
data
Uses of Data Mining

Healthcare
Finance
Retail & E-Commerce
Learn about consumer preferences
Countless others!
History of Data Mining

The term “data


mining” is relatively
new but the concepts
have been around for
many years
Classical statistics,
artificial intelligence
and machine learning
culminated over the
years and evolved into
data mining
History of Data Mining (Cont.)

Data Collection (1960s)- process of storing


information on computers
Technology- computers, tapes and disks

Data Access (1980s)-the introduction of


structured query languages and relational
databases helped us learn more about data
Data available at record level dynamically
History of Data Mining (Cont.)

Data Warehousing and Decision Report


(1990s)-the process of centralized data
management and retrieval
Maintaining a central location for all
organizational data
Helps you analyze data and concentrate on very
specific characteristics
Dynamic data delivery at multiple levels

Data Mining (present)- generalizing


patterns, predictive
Influential People/Events

In 1975, John Henry Holland wrote Adaptation in


Natural and Artificial Systems, a book on genetic
algorithms – start in data mining

1990s, the term “data mining” appeared in the


database community for the first time

In 2001, William S. Cleveland introduced data


mining as an independent discipline

DJ Patil became the first Chief Data Scientist


in the White House in February 2015
Terminology

Data: facts, numbers or text that can be


processed by a computer

Information: the patterns, associations and


relationships of data

Knowledge: understanding of a subject,


synthesize information to gain knowledge
about historical patterns and future trends
Data, Information,
vs. Knowledge
Data Mining
Definitions
Is the process of discovering meaningful, new
correlation pattern and trends by shifting
through large amount of data stored in
repositories, using pattern recognition technique
as well as statistical and mathematical
technique.
It’s a non trivial extraction of implicit,
previously unknown and potentially useful
information from the data.
Data Mining
Definitions
Is the search of relationships and global pattern
that exist in large database but are hidden
among vast amounts of data such as relationship
between patient data and their medical
diagnosis.
Exploration & analysis, by automatic or semi-
automatic means, of large quantities of data in
order to discover meaningful patterns
Data Mining
What is not Data What is Data Mining?
Mining?
Look up phone Certain names are more
number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke, O’Reilly…
in Boston area)
Query a Web Group together similar documents
search engine for returned by search engine according to
information about their context (e.g. Amazon
“Amazon rainforest, Amazon.com,)
Knowledge Discovery from Data
Knowledge Discovery from Data

Knowledge discovery consists of an iterative


sequence of the following steps:
data cleaning: to remove noise or irrelevant data
data integration: where multiple data sources may
be combined
data selection: where data relevant to the analysis
task are retrieved from the database
data transformation: where data are transformed
or consolidated into forms appropriate for mining by
performing summary or aggregation operations
data mining: an essential process where intelligent
methods are applied in order to extract data
patterns
Knowledge Discovery from Data

Knowledge discovery consists of an iterative


sequence of the following steps:
pattern evaluation: to identify the truly
interesting patterns representing knowledge
based on some interestingness measures
knowledge presentation: where visualization
and knowledge representation techniques are
used to present the mined knowledge to the user
Knowledge Discovery from Data

data cleaning:
data integration: Data
data selection: Mining
data transformation:
Data Mining
Data Mining
A typical data mining system may have the
following major components:
A database, data warehouse, or other
information repository, which consists of the set
of databases, data warehouses, spreadsheets etc.
A database or data warehouse server which
fetches the relevant data based on users’ data
mining requests.
Data Mining
A typical data mining system may have the
following major components:
A knowledge base that contains the domain
knowledge used to guide the search or to
evaluate the interestingness of resulting
patterns. For example, the knowledge base may
contain metadata which describes data from
multiple heterogeneous sources.
Data Mining
A typical data mining system may have the
following major components:
A data mining engine, which consists of a set
of functional modules for tasks such as
classification, association, classification, cluster
analysis, and evolution and deviation analysis.
A pattern evaluation module that works in
tandem with the data mining modules by
employing interestingness measures to help
focus the search towards interestingness
patterns.
Data Mining
A typical data mining system may have the
following major components:
A graphical user interface that allows the
user an interactive approach to the data mining
system.
Data Mining
Data to be mined
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
• Object‐oriented and object‐relational databases
• Spatial databases
• Time‐series data and temporal data
• Text databases and multimedia databases
• Heterogeneous and legacy databases
• WWW
Data Mining functionalities
Data mining functionalities are used specify
the kind of patterns to be found in data
mining tasks.

Characterize the general properties


Descriptive
of the data
Data mining
Performs the inference on the data
Predictive
in order to make predictions
Predictive model makes predictions
regarding data values using the results
found from available data. Thus it makes
use of historical data to make predictions
Descriptive model identifies patterns or
relationships in data. It finds out the
properties of existing data and does not
predict the new properties
Data Mining functionalities

Classification

Regression
Predictive
Time series
Data Mining
analysis
Descriptive
Prediction
Data Mining functionalities

Association
Predictive
rule
Data mining
Descriptive summerization

Clustering
Data Mining functionalities
Concept description: Characterization and
discrimination
Data can be associated with classes or concepts
• Ex. AllElectronics store classes of items for sale
include computer and printers.
Description of class or concept called
class/concept description can be done in 2 ways.
• data characterization, by summarizing the data of the
class under study (often called target class)
• data discrimination, by comparison of the target class
with one or a set of comparative classes (often called
the contrasting classes)
Data Mining functionalities
Data characterization is a summarization of
the general characteristics or features of a
target class of data.
Example : summarizing the characteristics of
customers who spend more than $1,000 a year. The
result could be a general profile of the customers,
such as they are 40–50 years old, employed, and
have excellent credit ratings.
The output of data characterization can be
presented in various forms.
Examples include pie charts, bar charts, curves,
multidimensional data cubes, and multidimensional
tables, including crosstabs.
Data Mining functionalities
data discrimination is a comparison of the
general features of target class data objects
with the general features of objects from one
or a set of contrasting classes.
The target and contrasting classes can be
specified by the user, and the corresponding
data objects retrieved through database
queries.
Example: two groups of customers, such as
those who shop for computer products regularly
versus those who rarely shop for such products
Data Mining functionalities
Association Rules – Tries to find out
relationship between data. Also called as
link analysis or affinity analysis
Best application of this task is association rules,
which is a model identifying specific type of data
associations.
• Example: buys(X; “computer”))buys(X; “software”)
• multidimensional association rule Example:
• age(X, “20:::29”)^income(X, “20K:::29K”))buys(X, “CD
player”)
Data Mining functionalities
Classification – Classification is the process of
finding a model (or function) that describes and
distinguishes data classes or concepts,
for the purpose of being able to use the model to predict
the class of objects whose class label is unknown.
The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label is
known).
It maps data onto predefined groups or classes.
This is called as supervised learning as classes are
decided before examining the data.
Classes are decided based on characteristic of data
already belonging to the class
Data Mining functionalities
Pattern recognition is a type of
classification, where a given pattern is
classified into one of several classes based on
its similarity with predefined patterns.
Data Mining functionalities
Regression – maps a data item to real
valued prediction variable. This function
assumes that target data fits into some
known function and tries to find out best
function that models the given data.
Error analysis is used to determine which
function is the best.
Data Mining functionalities
Prediction – In many DM applications
future data is predicted based on current or
past data.
Examples are
prediction of flooding
Speech recognition
Machine learning
Pattern recognition
Data Mining functionalities
Cluster Analysis:
clustering analyzes data objects without
consulting a known class label
The objects are clustered or grouped based on
the principle of maximizing the intraclass
similarity and minimizing the interclass
similarity.
Data Mining functionalities
Outlier Analysis:
database may contain data objects that do not
comply with the general behaviour or model of
the data. These data objects are outliers.
The analysis of outlier data is referred to as
outlier mining.
Example : Outlier analysis may uncover
fraudulent usage of credit cards by detecting
purchases of extremely large amounts for a
given account number in comparison to regular
charges incurred by the same account.
Interestingness in pattern
A data mining system has the potential to
generate thousands or even millions of
patterns, “So, “are all of the patterns
interesting?” NOT
a pattern is interesting if it is
easily understood by humans
valid on new data with some degree of certainty
potentially useful
Novel
validates a hypothesis that the user sought to
confirm
Interestingness in pattern
Can a data mining system generate all of the
interesting patterns?
is often unrealistic and inefficient for data
mining systems to generate all of the possible
patterns.
user-provided constraints and interestingness
measures should be used to focus the search.
Interestingness in pattern
Can a data mining system generate only
interesting patterns?
is an optimization problem in data mining.
It is highly desirable for data mining systems to
generate only interesting patterns.
Related technologies
Related technologies
Machine learning
is the field of study that gives computers the
capability to learn without being explicitly
programmed.
is an application of artificial intelligence (AI)
that provides systems
• the ability to automatically learn and
• improve from experience
The primary aim is to allow the computers learn
automatically without human intervention or
assistance and adjust actions accordingly.
Related technologies
Statistics:
is a branch of mathematics working with data
collection, organization, analysis, interpretation
and presentation.
Statistics is a term used to summarize a process
that an analyst uses to characterize a data set.
Related technologies
Visualisation
is any technique for creating images, diagrams,
or animations to communicate a message.
It is the art of making data beautiful
information science
the study of processes for storing and retrieving
information.
is a field primarily concerned with the analysis,
collection, classification, manipulation, storage,
retrieval, movement, dissemination, and
protection of information
Related technologies
Database technology
is a computer based record keeping system
which is used to record ,maintain and retrieve
data.
It is an organized collection of interrelated
(persistent) data.
It facilitate the storage, retrieval, modification,
and deletion of data in conjunction with various
data-processing operations
Related technologies
Other technologies : depending upon the
requirement the technology from other
domain can be incorporated.
neural networks
fuzzy logic
rough set theory,
inductive logic programming,
high-performance computing.
Classification of Data Mining
Systems
Classification according to the kinds of
databases mined
Database systems can be classified according to
different criteria (such as data models, or the types
of data or applications involved), each of which may
require its own data mining technique.
• For instance, if classifying according to data models, we
may have a relational, transactional, object-relational, or
data warehouse mining system.
If classifying according to the special types of data
handled, we may have a spatial, time-series, text,
stream data, multimedia data mining system, or a
WorldWideWeb mining system.
Classification of Data Mining
Systems
Classification according to the kinds of
knowledge mined:
Data mining systems can be categorized
according to the kinds of knowledge they mine,
that is, based on data mining functionalities.
• Ex: characterization, discrimination, association and
correlation analysis, classification, prediction,
clustering, outlier analysis, and evolution analysis.
Classification of Data Mining
Systems
Classification according to the on the
granularity or levels of abstraction:
data mining systems can be distinguished based
on the granularity or levels of abstraction of the
knowledge mined.
• generalized knowledge - at a high level of abstraction
• primitive-level knowledge - at a raw data level
• Knowledge at multiple levels (considering several
levels of abstraction).
• An advanced data mining system should facilitate the
discovery of knowledge at multiple levels of
abstraction.
Classification of Data Mining
Systems
Data mining systems can also be categorized
as those that
mine data regularities (commonly occurring
patterns) versus
those that mine data irregularities (such as
exceptions, or outliers).
In general, concept description, association and
correlation analysis, classification, prediction,
clustering mine data regularities, rejecting
outliers as noise.
Classification of Data Mining
Systems
Classification according to the kinds of
techniques utilized:
Data mining systems can be categorized
according to the underlying data mining
techniques employed.
• database-oriented or data warehouse– oriented
techniques,
• Or machine learning, statistics, visualization, pattern
recognition, neural networks, and so on
A sophisticated data mining system will often
adopt multiple data mining techniques
Classification of Data Mining
Systems
Classification according to the applications
adapted:
For example, data mining systems may be
tailored specifically for finance,
telecommunications, DNA, stock markets, e-
mail, and so on.
Different applications often require the
integration of application-specific methods.
Data Mining Task Primitives
A user wants to perform some form of data
analysis.
A data mining task can be specified in the
form of a data mining query.
A data mining query is defined in terms of
data mining task primitives.
These primitives allow the user to
interactively communicate with the data
mining system
Data Mining Task Primitives
The set of task-relevant data to be mined:
This specifies the portions of the database or the
set of data in which the user is interested.
• For example: the database attributes or data
warehouse dimensions of interest. It is also referred
to as the relevant attributes or dimensions.
The kind of knowledge to be mined:
This specifies the data mining functions to be
performed,
• Example characterization, discrimination, association
or correlation analysis, classification, prediction,
clustering, outlier analysis, or evolution analysis.
Data Mining Task Primitives
The background knowledge to be used in the
discovery process:
The domain knowledge is useful for guiding the
knowledge discovery process and for evaluating
the patterns found.
Concept hierarchies are a popular form of
background knowledge, which allow data to be
mined at multiple levels of abstraction.
Data Mining Task Primitives
The interestingness measures and
thresholds for pattern evaluation:
They may be used to guide the mining process
or,
• after discovery, to evaluate the discovered patterns.
The expected representation for visualizing
the discovered patterns:
This refers to the form in which discovered
patterns are to be displayed,
• Such as rules, tables, charts, graphs, decision trees,
and cubes.
Data Mining Task Primitives
Major Issues in Data Mining
Human Interaction:
As data mining problems are not precisely
stated, interfaces may be needed with both
domain and technical experts.
• Technical experts are needed to formulate the queries
and assist in interpreting the results.
• Users are needed to identify training data and desired
results.
Major Issues in Data Mining
Over fitting:
When a model is generated that is associated with a
given database state it is desirable that the model also
fit future database states.
Over fitting occurs when the model does not fit future
states.
There are 2 reasons for over fitting
• It be caused by assumptions that are made about the data or
• Because training database is too small.
• For example, a classification model for an employee database
may be developed to classify employees as short, medium, or tall.
If the training database is quite small, the model might
erroneously indicate that a short person is anyone under five feet
eight inches because there is only one entry in the training
database under five feet eight. In this case, many future
employees would be erroneously classified as short.
Major Issues in Data Mining
Outliers:
There are often many data entries that do not fit
nicely into the derived model.
If a model is developed that includes these
outliers, then the model may not behave well for
data that are not outliers.
Interpretation of results:
The data mining output may require experts to
correctly interpret the results.
Major Issues in Data Mining
Visualization of results:
To easily view and understand the output of
data mining algorithms, visualization of the
results is helpful
Large datasets:
Most of the dataset are massive datasets while
the algorithms are designed for small datasets.
Many modeling applications grow exponentially
on the dataset size and thus are too inefficient
for larger datasets.
This is scalability problem.
Major Issues in Data Mining
High dimensionality:
Not all attributes may be needed to solve a given
data mining problem.
• In fact, the use of some attributes may interfere with the
correct completion of a data mining task.
• Some of other attributes may simply increase the overall
complexity and decrease the efficiency of an algorithm.
This problem is referred as the dimensionality
curse, meaning that there are many attributes
involved and it is difficult to determine which ones
should be used.
One solution to this high dimensionality problem is
to reduce the number of attributes, which is known
as dimensionality reduction. However, determining
which attributes not needed is not always easy to
do.
Major Issues in Data Mining
Multimedia data:
Most previous data mining algorithms are targeted
to traditional data types (numeric, character, text,
etc.).
The use of multimedia data or GIS databases
complicates or invalidates many proposed
algorithms.
Missing data:
During the pre-processing phase of KDD, missing
data may be replaced with estimates.
This and other approaches to handling missing data
can lead to invalid results in the data mining step.
Major Issues in Data Mining
Irrelevant data:
Some attributes in the database might not be of
interest to the data mining task being
developed. How to identify?
Noisy data:
Some attribute values might be invalid or
incorrect.
So, these values are required to corrected before
running data mining applications.
Major Issues in Data Mining
Changing data:
Databases cannot be assumed to be static.
However, most data mining algorithms do assume a
static database.
This requires that the algorithm be completely rerun
anytime the database changes.
Integration:
The KDD process is not currently integrated into normal
data processing activities.
KDD requests may be treated as special, unusual, or
one-time needs.
This makes them inefficient, ineffective, and not general
enough to be used on an ongoing basis.
Integration of data mining functions into traditional
DBMS systems is certainly a desirable goal.
Major Issues in Data Mining
Application:
Determining the intended use for the
information obtained from the data mining
function is a challenge.
Indeed, how business executives can effectively
use the output is sometimes considered the more
difficult part, not the running of the algorithms
themselves.
Because the data are of a type that has not
previously been known, business practices may
have to be modified to determine how to
effectively use the information uncovered.
Data mining metrics
Measuring the effectiveness or usefulness of a
data mining approach is not straightforward.
In fact, different metrics could be used for
different techniques and also based on the
interest level.
From an overall business or usefulness
perspective, a measure such as return on
investment (ROI) could be used.
ROI examines the difference between what the data
mining technique costs and what the savings or
benefits from its use are.
Data mining metrics
But the return is hard to quantify.
It could be measured as increased sales, reduced
advertising expenditure, or both.
In a specific advertising campaign implemented
via targeted catalogue mailings, the percentage
of catalogue recipients and the amount of
purchase per recipient would provide one means
to measure the effectiveness of the mailing.
In a classroom approach accuracy in
classification is mostly used as metrics.

You might also like