Chapter 1 Data Mining Lecture Note
Chapter 1 Data Mining Lecture Note
Chapter One
Introduction
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data: Business, science and society
The computing power is available and is affordable
DM commercial products and machine learning algorithms are available
The competitive pressure is very strong
• How to gain competitive advantage?
• How to control the volatile market?
• How to satisfy customers (prosumers) need?
“Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets. We are drowning in data, but starving for knowledge! 2
We are data rich, but information poor.
3
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
What is not data mining?
Simple search and query processing
Expert systems or small statistical programs
4
Knowledge Discovery (KDD) Process
Data Cleaning
Data Integration
Databases 5
KDD Process: Several Key Steps
Learning the application domain
Relevant prior knowledge and goals of application
Identifying a target data set
Data processing
Data cleaning (remove noise and inconsistent data)
Data integration (multiple data sources maybe combined)
Data selection (data relevant to the analysis task are retrieved from database)
Data transformation (consolidated into forms appropriate for mining)
Data mining (an essential process where intelligent methods are applied to extract
data patterns)
Pattern evaluation (identify the truly interesting patterns)
Knowledge presentation (mined knowledge is presented to the user with
representation techniques)
10
Database Processing vs. Data Mining Processing
11
Query Examples
Database
Find all credit applicants with first name ‘Alex’.
Identify customers who have purchased more than Birr
10,000 in the last month.
Find all customers who have purchased Bread
Data Mining
Find all credit applicants who have no credit risks.
(classification)
Identify customers with similar buying habits.
(Clustering)
Find all items which are frequently purchased with
Bread. (association rules)
12
Data Mining Functionalities
What kind of patterns can be mined?
Data mining functionalities are used to specify the kind
of patterns to be found in data mining tasks.
In general, data mining tasks can be classified into two
categories: descriptive and predictive.
Descriptive mining tasks characterize the general
properties of the data in the database.
Predictive mining tasks perform inference on the
current data in order to make predictions.
Users may have no idea regarding what kinds of
patterns in their data may be interesting, and hence may
like to search for several different kinds of patterns in
parallel.
13
Data Mining Functionalities
Association and Correlation Analysis
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in your
Walmart?
Association, correlation vs. causality
A typical association rule
Diaper Beer [0.5%, 75%] (support, confidence)
Are strongly associated items also strongly correlated?
How to mine such patterns and rules efficiently in large
datasets?
How to use such patterns for classification, clustering, and
other applications?
14
Data Mining Functionalities
Classification and Prediction
Classification
The process of finding a model that describes and distinguishes the
data classes or concepts, for the purpose of being able to use the model
to predict the class of objects whose class label is unknown.
The derived model is based on the analysis of a set of training data
(data objects whose class label is known).
Prediction
Predict missing or unavailable numerical data values
Typical methods
Decision trees, naïve Bayesian classification, support vector machines, neural
networks, rule-based classification, pattern-based classification, logistic
regression, …
Typical applications:
Credit card fraud detection, direct marketing, classifying stars, diseases, web-
pages, … 15
Data Mining Functionalities
Cluster Analysis and Outlier Analysis
Cluster Analysis
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
Principle: Maximizing intra-class similarity and minimizing
interclass similarity
Outlier analysis
Outlier: A data object that does not comply with the general behavior
of the data
Noise or exception? ― One person’s garbage could be another
person’s treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis 16
Are All of the Patterns Interesting?
Data mining may generate thousands of patterns: Not all of them
are interesting
A pattern is interesting if it is
17
Are All of the Patterns Interesting?
Objective measures
Based on statistics and structures of patterns, e.g., support,
confidence, etc. (Rules that do not satisfy a threshold are
considered uninteresting.)
Subjective measures
Information Visualization
Retrieval Data Mining
19
Data Mining: Confluence of Multiple Disciplines
Statistics: studies the collection, analysis, interpretation or
explanation, and presentation of data.
Machine learning investigates how computers can learn (or
improve their performance) based on data. A main research area is
for computer programs to automatically learn to recognize complex
patterns and make intelligent decisions based on data.
Database Systems and Data Warehouses- Database
systems research focuses on the creation, maintenance, and use
of databases for organizations and end-users. A data warehouse
integrates data originating from multiple sources and various
timeframes.
Information retrieval (IR) is the science of searching for
documents or information in documents. Documents can be text
or multimedia, and may reside on the Web.
20
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
21
Major Issues in Data Mining
The major issues in data mining classified into five groups:
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
22
Major Issues in Data Mining (2)
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
23
Data Mining System Classification
A data mining system can be classified according to
the following criteria:
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Apart from these, a data mining system can also be
classified based on the kind of (a) databases mined,
(b) knowledge mined, (c) techniques utilized, and (d)
applications adapted.
24
Data Mining System Classification
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of
databases mined. Database system can be classified according
to different criteria such as data models, types of data, etc.
And the data mining system can be classified accordingly.
For example, if we classify a database according to the data
model, then we may have a relational, transactional, object-
relational, or data warehouse mining system.
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of
techniques used. We can describe these techniques according
to the degree of user interaction involved or the methods of
analysis employed. 25
Data Mining System Classification
Classification Based on the Kind of Knowledge Mined
We can classify a data mining system according to the kind of
knowledge mined. It means the data mining system is
classified on the basis of functionalities such as:
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
26
Data Mining System Classification
Classification Based on the Applications Adapted
We can classify a data mining system according to
the applications adapted. These applications are as
follows:
Finance
Telecommunications
DNA
Stock Markets
E-mail
27
Architecture of Data Mining
A typical data mining system may have the following major components
28
Architecture of Data Mining
Knowledge Base:
This is the domain knowledge that is used to guide the search
or evaluate the interestingness of resulting patterns.
Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different le
vels of abstraction.
Knowledge such as user beliefs, which can be used to assess apat
tern’s interestingness based on its unexpectedness, may
also be included.
Other examples of domain knowledge are additional
interestingness constraints or thresholds, and metadata (e.g.,
describing data from multiple heterogeneous sources).
29
Architecture of Data Mining
Data Mining Engine:
This is essential to the data mining system and ideally consists
of a set of functional modules for tasks such as
characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and
evolution analysis.
Pattern Evaluation Module:
This component typically employs interestingness measures
interacts with the data mining modules so as to focus the
search toward interesting patterns.
It may use interestingness thresholds to filter out discovered
patterns. Alternatively, the pattern evaluation module may be
integrated with the mining module, depending on the
implementation of the datamining method used. 30
Architecture of Data Mining
User interface:
This module communicates between users and the data
mining system, allowing the user to interact with the system
by specifying a data mining query or task, providing
information to help focus the search, and performing
exploratory data mining based on the intermediate data
mining results.
In addition, this component allows the user to browse
database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in different
forms.
31