Unit - I
Introduction to Data Mining: Data Mining is the extraction of
interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount
of data [ Fayyad ].
The process of discovering interesting knowledge from large
amounts of data stored in databases, data warehouses, or
other information repositories.[Han and Kamber].
Father of Data Mining is Dr. Rakesh Agrawal.
Data Mining is process of extracting previously non known,
valid and actionable information from large data to make
crucial business and strategic decisions.[NET-2013]
TEXT BOOK: Data Mining : Concepts and Techniques - Jiawei Han
& Micheline Kamber, Morgan Kaufmann Publishers, Elsevier,
Second Edition, 2008. 2
Father of Data Mining Dr. Rakesh Agrawal - Indian
Institute of Technology, Kanpur, Technical Fellow,
Microsoft Research
3
Data Mining consists of
1. What is Data Mining
2. Data Mining Functionalities
3. Classification of Data Mining System
4. Data Mining Task Primitives
5. Major Issues in Data Mining
6. Data Mining Applications
4
1. What is Data Mining: Data Mining is the extraction of
interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge
amount of data
It consists of
a) Definition of Data Mining
b) Motivation of Data Mining
c) Why is it Important
d) Evolution of Database Technology
e) Why Data Mining
f) KDD Process
g) Architecture of a typical Data Mining System
h) Data Mining on different kinds of Data
i) Data Mining Techniques
j) Top-10 Algorithms
k) Data Mining Tools
l) Application areas for data mining
5
a) Definition of Data Mining: Data Mining is the extraction of
interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
[ Fayyad ].
Data Mining is also known as Knowledge-Discovery in
Databases(KDD), Decision Support System(DSS), Data
Archeology, Data Dredging, Information Harvesting, Data analysis,
Pattern analysis, Business Intelligence etc.
KDD is the process of automatically searching large volumes of
data for patterns.
Data Mining applies many older computational techniques from
statistics, machine learning and pattern recognition .
6
b)Motivation of Data Mining:
Necessity is the Mother of Invention.
Data is rich, but information is poor.
Drowning in data, but starving for knowledge.
c)Why is it Important?
Data explosion
Lots of data is being collected and warehoused
Web data and e-commerce
Purchases at Department/Grocery Stores
Bank/Credit Card Transactions
Remote sensors on a satellite
Telescopes scanning the skies
Microarrays generating gene expression data
Scientific simulations generating terabytes of data 7
d)Evolution of Database Technology
1960s: Data collection, Database creation,
1970s: Relational Data model
1980s: RDBMS
1990s: Data Mining, Data Warehousing, Multimedia Databases
2000s: Stream data management and Web Databases
2010s: Integration with DM and DW with Web Mining/ Databases,
8
e) Why Data Mining?
Too much data and too little information.
There is a need to extract useful information from the data and to interpret the data
To analyze the statistics of population data
To analyze the statistics of stock market data ( predicting future stock prices) ex:
recommendations when to purchase the stocks. We use decision tree techniques ( ID3 –
Iterative Dichotomiser, C4.5)
To analyze the statistics of bank transactions data ( risk management, fraud detection,
credit card management, loan suggestions and predictions etc).
To analyze the statistics of weather data. ( based on geo location, height of areas,
numerical predictions etc)
To analyze the statistics of defense data.
To analyse the statistics of health care (cancer, diabetics, cardiac data sets predictions)
9
f) KDD(Knowledge Discovery Database Process)
KDD stands for Knowledge Discovery Databases.
KDD is the process of automatically searching large volumes of
data for patterns
KDD is the process of extracting meaningful knowledge from large
data bases.
KDD is also called as Data Mining or Decision Support System.
Knowledge Discovery process includes data cleaning, data
integration, data selection, data transformation, data mining,
pattern evaluation, and knowledge presentation.
10
Data Mining as a step in the
process of knowledge
discovery process. Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
Fig: Knowledge Discovery (KDD) Process 11
Knowledge Discovery Database Process
1. Data cleaning : To remove noise and inconsistent data
2. Data integration : Where multiple data sources may be combined.
3. Data selection: Where data relevant to the analysis task are
retrieved fromthe database.
4. Data transformation : Where data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations, for instance.
5. Data mining : An essential process where intelligent methods are
applied in order to extract data patterns.
6. Pattern evaluation: To identify the truly interesting patterns
representing knowledge based on some interestingness
measures;
7. Knowledge presentation: Where visualization and knowledge
representation techniques are used to present the mined
knowledge to the user. 12
g) Architecture of a typical Data Mining System
Data Mining is the process of discovering interesting
knowledge from large amounts of data stored in
databases, data warehouses, or other information
repositories.
Major components
1. Database, DW, WWW or other information repository.
2. Database or data warehouse server
3. Knowledge base
4. Data mining engine
5. Pattern evaluation module
6. User interface
13
Graphical User Interface
Pattern Evaluation
Knowledge-
Data Mining Engine Base
Database or Data Warehouse
Server
Data cleaning, integration, and selection
Data World-Wide Other Info
Database Warehouse Web Repositories
Fig: Architecture of a typical Data Mining System
14
These are following Architecture of a typical Data Mining System
1. Database, DWH, WWW or other information repository: These are
Data cleaning and integration techniques may be performed.
2. Database or data warehouse server: These are Fetch data for users.
3. Knowledge Base: These are Domain knowledge (CHG, constraints,
metadata)
4. Data Mining Engine: These are Functions (Association,
Classification)
5. Pattern Evaluation Module: These are Search interesting patterns
only.
6. User Interface: These are Visualize patterns for a Data Mining
Query/task. 15
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Other
Algorithm Disciplines
Fig: Data Mining: Confluence of Multiple Disciplines
h) Data Mining on different kinds of Data
1. Relational databases: Collection of tables having unique names,
each table consisting of attributes and usually store large set of
tuples.
2. Data warehouses: Repository of information collected from
multiple sources stored under unified schema.
Data warehouse constructed via process of data cleaning, data
transformation, data integration, data loading, periodical
refreshing. Its usually modeled by multidimensional database
structure using data cubes.
EX: How many sales transactions in December month?
Which of my agent has highest amount of sales?
17
3. Transactional databases: file which stores each record
as a transaction that includes transaction_id, list of
items.
Transaction_id List of items
T_100 I1, I3, I9, I16, 20
4. Object-Relational Databases: Each entity is considered
as an object. Variables represents the attributes and also
it involves operations like get emp_id etc.
Ex: consider a class as Employee then sales_person is
sub class of main class employee (inheritance)
18
5. Temporal, Sequence and Time-Series Databases
Ex: whether patterns, stock exchange, sea level pressures,
banking etc
6. Spatial Databases and Spatiotemporal Databases
7. Text Databases and Multimedia Databases: Ex: web
pages, audio, video files, content based retrieval etc.
8. Heterogeneous Databases and Legacy Databases: it
includes combining data from relational databases,
network databases, hierarchical databases, multimedia
databases etc.
9. Data streams: Ex: Network traffic, video surveillance,
whether monitoring etc.
10. WWW – ex: web content mining, web page selections,
link mining etc. 19
i) Data Mining Techniques
Association
Classification
Prediction
Clustering
Link Mining
Bagging and Boosting
Web Mining
Rough Sets
Graph Mining
Outlier Analysis
20
j) Top-10 Algorithms
1. C4.5: Programs for Machine Learning( 61 votes)
2. K-Means (60 votes)
3. SVM (58 votes)
4. Apriori (52 votes)
5. EM: Finite Mixture Models & Association Analysis(48votes)
6. PageRank (46 votes)
7. AdaBoost (45 votes)
8. kNN : K Nearest Neighbours (45 votes)
9. Naive Bayes (45 votes)
10.CART: Classification and Regression Trees (34 votes)
21
Data mining Algorithms - Categories
Classification:
C4.5
CART ( classification and regression trees), KNN (k – Nearest Neighbor)
Naives bayes.
Statistical Learning:
SVM – Support Vector Machine
EM – Expectation Maximization
Association Analysis:
Apriori
FP Growth pattern
Link Analysis:
Page Ranking
HITS (hyperlink included text searching)
Clustering :
K means
Birch - Balanced Iterative Reducing and Clustering using Hierarchies
DATA MINING TOOLS
It is the process that involves exploration and analysis of large volumes of data
using a combination of machine learning, artificial intelligence etc.,
It is used for knowledge discovery from large volumes of data.
DATAMINING SOFTWARE
Open source Commercial
JHepWork
k) Data Mining Tools
Weka 3.7.5
Rapid Miner
R
Ggobi
Orange
Clementine
Tanagra
Sipina
Alpha Miner
Yale
24
l) APPLICATION AREAS FOR DATA MINING
Financial Data Analysis
Data mining in the retail industry
Telecommunication Industry
Biological Data Analysis
Scientific Applications
Intrusion Detection
Pattern Recognition
Sales/Marketing
Banking / Finance
Health Care and Insurance
e - Commerce
25
2. Data Mining Functionalities
Data Mining functionalities are used to specify
the kind of patterns to be found in DM tasks.
Data Mining functionalities include the discovery
of concept/class descriptions, associations and
correlations, classification, prediction, clustering,
trend analysis, outlier and deviation analysis, and
similarity analysis.
Characterization and Discrimination are forms of
data summarization. 26
DM tasks can be classified into two functionalities
1. Descriptive Data Mining : Tasks characterize the general
properties of the data in the database. Find Human-
Interpretable patterns that describe the data.
2. Predictive Data Mining: Tasks perform inference on the
current data in order to make predictions. Use some variables
to predict unknown or future values of other variables.
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
27
Data Mining functionalities, and the kinds of
patterns they can discover.
1. Concept/Class Description: Characterization and
Discrimination
2. Mining Frequent Patterns, Associations, and
Correlations
3. Classification and Prediction
4. Cluster Analysis
5. Outlier Analysis
6. Evolution Analysis
28
1. Concept/Class Description: Characterization and
Discrimination: Data characterization is by summarizing
the data of the class under study, often called the target
class. Data discrimination is by comparison of the target
class with one or a set of comparative classes or
contrasting classes.
2. Mining Frequent Patterns, Associations, and
Correlations: Frequent patterns are patterns that occur
frequently in data. A frequent itemset typically refers to a
set of items that frequently appear together in a
transactional data set. Association rules that contain a
single predicate are referred to as Single-dimensional
association rules. Association rules that contain more
than one attribute, or predicate referred to as a Multi
dimensional association rule. 29
Fig: Classification and Prediction 30
3. Classification and Prediction : Classification is the
process of finding a model or function that describes
and distinguishes data classes or concepts. Predict some
unknown or missing numerical values .
4. Cluster analysis: Group of objects having similar
properties. Maximizing intra-class similarity &
Minimizing interclass similarity.
5. Outlier analysis : Data object that does not comply with
the general behavior of the data.
6. Evolution analysis: It describes and models regularities
(or) trends for trends for objects whose behavior
changes over time.
31
3. Classification of Data Mining System
Data Mining System is a systematic approach to
collecting, organizing, processing, analyzing and
visualizing datasets.
Data Mining is an interdisciplinary field, the confluence of
a set of disciplines, including database systems, statistics,
machine learning, visualization, and information science.
Data Mining System can be classified according to the
kinds of databases mined, the kinds of knowledge mined,
the techniques used, and the applications adapted.
32
DM tasks can be classified into two functionalities
1. Descriptive data mining : Tasks characterize the general
properties of the data in the database. Find human-
interpretable patterns that describe the data.
Ex: Clustering, Association rule discovery and Sequence
pattern discovery.
2. Predictive data mining: Tasks perform inference on the
current data in order to make predictions. Use some
variables to predict unknown or future values of other
variables.
Ex: Classification, Regression and Deviation detection.
33
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Other
Algorithm Disciplines
Fig: Data Mining: Confluence of Multiple Disciplines
34
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
Classification
Classification according to the
according to 2 3 kinds of
the kinds of
knowledge
Classification of techniques
utilized
mined Data Mining
Classification
according to
System Classification
1 4 according to the
the kinds of applications
databases adapted
mined
Fig: Classification of Data Mining System 35
It consists of
1.Classification according to the kinds of
databases mined: A data mining system can be classified
according to the kinds of databases mined.
E.g.: Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW.
2.Classification according to the kinds of
knowledge mined: Data mining systems can be categorized
according to the kinds of knowledge they mine, that is, based on
data mining functionalities.
E.g.: Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis. 36
3. Classification according to the kinds of techniques
utilized: Data mining systems can be categorized
according to the underlying data mining techniques
employed.
Ex: Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization.
4. Classification according to the applications adapted:
Data mining systems can also be categorized according to
the applications they adapt.
Ex: Retail, telecommunication, banking, fraud analysis,
bio-data mining, stock market analysis, text mining, Web
mining.
37
4. Data Mining Task Primitives
Data Mining task can be specified in the form of a data mining
query, which is input to the data mining system.
Data Mining query is defined in terms of data mining task primitives.
Data Mining Query Language can be designed to incorporate these
primitives, allowing users to flexibly interact with data mining
systems.
These primitives for specifying a data mining task in the form of a
data mining query.
Primitives are classified into 5 types
1. Task-relevant data,
2. Knowledge type to be mined,
3. Background knowledge,
4. Pattern interestingness measures,
5. Visualization of discovered patterns.
38
Fig: Types of Data Mining Task Primitives 39
Knowledge type to be mined
Background knowledge
2 3
Classification of
Data Mining
Task-relevant data System Pattern interestingness
measures
1 4
Visualization of discovered
patterns
Fig: Data Mining Task Primitives
40
Fig: Primitives for specifying a Data Mining Task 41
1. The set of task-relevant data to be mined: This specifies the
portions of the database or the set of data in which the user is
interested. Task-relevant data
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
2. The kind of knowledge to be mined: This specifies the data
mining functions to be performed, such as characterization,
discrimination, association or correlation analysis, classification,
prediction, clustering, outlier analysis, or evolution analysis. It
contains Schema hierarchy, Set-grouping hierarchy, Operation-
derived hierarchy and Rule-based hierarchy.
Characterization
Discrimination
Association
Classification/prediction
Clustering
Outlier analysis and Other data mining tasks 42
3. The background knowledge to be used in the discovery process:
This knowledge about the domain to be mined is useful for guiding
the knowledge discovery process and for evaluating the patterns
found.
Concept hierarchies
Schema hierarchy
E.g: street < city < province_or_state < country
Set-grouping hierarchy
E.g: {20-39} = young, {40-59} = middle_aged
Operation-derived hierarchy
email address:
[email protected] login-name < department < university < country
Rule-based hierarchy
low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50
43
4. The interestingness measures and thresholds for pattern
evaluation: They may be used to guide the mining process or,
after discovery, to evaluate the discovered patterns. Simplicity:
rule length, tree size. Certainty: E.g., confidence, Utility:
potential usefulness and Novelty: not previously known.
Simplicity: E.g: (association) rule length, (decision) tree size
Certainty: E.g: confidence, P(A|B) = #(A and B)/ #(B),
classification reliability or accuracy, certainty factor, rule strength,
rule quality, discriminating weight, etc.
Utility: Potential usefulness, e.g., support (association), noise
threshold (description)
Novelty: Not previously known, surprising (used to remove
redundant rules, e.g., Illinois vs. Champaign rule implication
support ratio).
44
5. The expected representation for visualizing the discovered
patterns: This refers to the form in which discovered patterns are to
be displayed, which may include rules, tables, charts, graphs,
decision trees, and cubes. Different backgrounds/usages may require
different forms of representation,
E.g: Rules, tables, crosstabs, pie/bar chart, etc.
Different backgrounds/usages may require different forms of
representation
E.g., rules, tables, crosstabs, pie/bar chart, etc.
Concept hierarchy is also important
Discovered knowledge might be more understandable when
represented at high level of abstraction
Interactive drill up/down, pivoting, slicing and dicing provide
different perspectives to data
Different kinds of knowledge require different representation:
association, classification, clustering, etc.
45
5. Major issues in Data Mining
Major issues in Data Mining is challenges or problems.
Major issues in data mining regarding mining
methodology, user interaction, performance, diverse
data types and social impacts.
Other issues include the exploration of data mining
applications and their social impacts.
Data preparation or preprocessing is a big issue for both
data mining and data warehousing.
The issues are considered major requirements and
challenges for the further evolution of data mining
technology. 46
Major issues are consists of
1. Mining methodology
2. User interaction issues.
3. Performance issues
4. Issues relating to the diversity of database types.
5. Applications and Social impacts.
47
Performance
User Interaction
2 5 Diversity of DB
types
Major issues in
Data Mining
Applications and
1 6 Social impacts
Mining
Methodology
Fig: Major issues in Data Mining
48
1. Mining Methodology: It explains the what kind of
methodology(Association, Classification and Clustering) is used
for extracting knowledge.
These reflect the kinds of knowledge mined, the ability to mine
knowledge at multiple granularities, the use of domain knowledge,
ad- hoc mining, and knowledge visualization.
– Mining different kinds of knowledge from diverse data types,
e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one:
knowledge fusion
49
2. User interaction issues: It explains the how easily interacting
with Data Mining system. It uses Data mining query languages
and ad-hoc mining, Expression and visualization of data mining
results and Interactive mining of knowledge at multiple levels of
abstraction.
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of
abstraction
3. Performance issues: These include efficiency, scalability, and
parallelization of data mining algorithms. The performance levels
are good, average and bad. It includes Efficiency and scalability of
data mining algorithms and Parallel, distributed, and incremental
mining algorithms.
50
4. Issues relating to the diversity of database types: It
handling of relational and complex types of data such as
space and Radar data. Mining information from
heterogeneous databases and global information
systems.
5. Applications and social impacts: The security and
privacy levels in Data Mining system. It includes
Domain-specific data mining & invisible data mining
and Protection of data security, integrity, and privacy.
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy.
51
6. Data Mining Applications
Data Mining is an interdisciplinary field with wide and
diverse applications.
Many customized data mining tools have been developed
for domain-specific applications, including finance, the
retail industry, telecommunications, bioinformatics,
intrusion detection, and other science, engineering, and
government data analysis.
Some application domains 5
1. Financial data analysis 1 Data Mining 3
2. Retail industry Applications
3. Telecommunication industry 2 4
4. Biological data analysis
5. Data Mining in other Scientific Applications 6
6. Data Mining for Intrusion Detection.
52
1. Data Mining for Financial Data Analysis: Financial data
collected in banks and financial institutions are often relatively
complete, reliable, and of high quality. Loan payment prediction and
customer credit policy analysis. Classification and clustering of
customers for targeted marketing. Detection of money laundering
and other financial crimes.
Examples: Cyber Crime, Internet Crimes
2. Data Mining for Retail Industry: Retail industry is huge
amounts of data on sales, customer shopping history, etc.
Applications of retail data mining is identify customer buying
behaviors, Discover customer shopping patterns and trends and
improve the quality of customer service.
Examples: Supermarket, Sales and Shopping mall.
Multidimensional analysis of sales, customers, products, time, and
region. 53
3. Data Mining for Telecommunication Industry: A rapidly
expanding and highly competitive industry and a great demand for
data mining. Multidimensional analysis of telecommunication data is
intrinsically multidimensional i.e. calling-time, duration, location of
caller, location of callee, type of call, etc.
Examples: Fraudulent pattern analysis and the identification of
unusual patterns.
4. Data Mining for Biomedical Data Analysis: DNA sequences are 4
basic building blocks (nucleotides): adenine (A), cytosine (C), guanine
(G), and thymine (T). Gene is a sequence of hundreds of individual
nucleotides arranged in a particular order. Humans have around
30,000 genes.
Examples: Semantic integration of heterogeneous, distributed
genomic and proteomic databases. Discovery of structural patterns
and analysis of genetic networks and protein pathways.
54
5. Data Mining in Other Scientific Applications
Satellite Imagery, GIS, Rocket Launching.
Graph-based mining.
Visualization tools and domain-specific knowledge.
6. Data Mining for Intrusion Detection
Development of data mining algorithms for intrusion detection.
Association and correlation analysis, and aggregation to help
select and build discriminating attributes.
Analysis of stream data.
Distributed data mining
Visualization and querying tools.
55