0% found this document useful (0 votes)
51 views55 pages

DWDM - Unit - II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views55 pages

DWDM - Unit - II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Unit - I

 Introduction to Data Mining: Data Mining is the extraction of


interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount
of data [ Fayyad ].

 The process of discovering interesting knowledge from large


amounts of data stored in databases, data warehouses, or
other information repositories.[Han and Kamber].

 Father of Data Mining is Dr. Rakesh Agrawal.

 Data Mining is process of extracting previously non known,


valid and actionable information from large data to make
crucial business and strategic decisions.[NET-2013]
TEXT BOOK: Data Mining : Concepts and Techniques - Jiawei Han
& Micheline Kamber, Morgan Kaufmann Publishers, Elsevier,
Second Edition, 2008. 2
Father of Data Mining Dr. Rakesh Agrawal - Indian
Institute of Technology, Kanpur, Technical Fellow,
Microsoft Research
3
Data Mining consists of
1. What is Data Mining
2. Data Mining Functionalities
3. Classification of Data Mining System
4. Data Mining Task Primitives
5. Major Issues in Data Mining
6. Data Mining Applications

4
1. What is Data Mining: Data Mining is the extraction of
interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge
amount of data

 It consists of
a) Definition of Data Mining
b) Motivation of Data Mining
c) Why is it Important
d) Evolution of Database Technology
e) Why Data Mining
f) KDD Process
g) Architecture of a typical Data Mining System
h) Data Mining on different kinds of Data
i) Data Mining Techniques
j) Top-10 Algorithms
k) Data Mining Tools
l) Application areas for data mining
5
 a) Definition of Data Mining: Data Mining is the extraction of
interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
[ Fayyad ].

 Data Mining is also known as Knowledge-Discovery in


Databases(KDD), Decision Support System(DSS), Data
Archeology, Data Dredging, Information Harvesting, Data analysis,
Pattern analysis, Business Intelligence etc.

 KDD is the process of automatically searching large volumes of


data for patterns.

 Data Mining applies many older computational techniques from


statistics, machine learning and pattern recognition .
6
 b)Motivation of Data Mining:
Necessity is the Mother of Invention.
Data is rich, but information is poor.
Drowning in data, but starving for knowledge.

 c)Why is it Important?
Data explosion
Lots of data is being collected and warehoused
Web data and e-commerce
Purchases at Department/Grocery Stores
Bank/Credit Card Transactions
Remote sensors on a satellite
Telescopes scanning the skies
Microarrays generating gene expression data
Scientific simulations generating terabytes of data 7
d)Evolution of Database Technology

 1960s: Data collection, Database creation,

 1970s: Relational Data model

 1980s: RDBMS

 1990s: Data Mining, Data Warehousing, Multimedia Databases

 2000s: Stream data management and Web Databases

 2010s: Integration with DM and DW with Web Mining/ Databases,

8
e) Why Data Mining?

 Too much data and too little information.

 There is a need to extract useful information from the data and to interpret the data

 To analyze the statistics of population data

 To analyze the statistics of stock market data ( predicting future stock prices) ex:
recommendations when to purchase the stocks. We use decision tree techniques ( ID3 –
Iterative Dichotomiser, C4.5)

 To analyze the statistics of bank transactions data ( risk management, fraud detection,
credit card management, loan suggestions and predictions etc).

 To analyze the statistics of weather data. ( based on geo location, height of areas,
numerical predictions etc)

 To analyze the statistics of defense data.

 To analyse the statistics of health care (cancer, diabetics, cardiac data sets predictions)

9
f) KDD(Knowledge Discovery Database Process)
 KDD stands for Knowledge Discovery Databases.

 KDD is the process of automatically searching large volumes of


data for patterns

 KDD is the process of extracting meaningful knowledge from large


data bases.

 KDD is also called as Data Mining or Decision Support System.

 Knowledge Discovery process includes data cleaning, data


integration, data selection, data transformation, data mining,
pattern evaluation, and knowledge presentation.
10
 Data Mining as a step in the
process of knowledge
discovery process. Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases

Fig: Knowledge Discovery (KDD) Process 11


Knowledge Discovery Database Process
1. Data cleaning : To remove noise and inconsistent data
2. Data integration : Where multiple data sources may be combined.
3. Data selection: Where data relevant to the analysis task are
retrieved fromthe database.
4. Data transformation : Where data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations, for instance.
5. Data mining : An essential process where intelligent methods are
applied in order to extract data patterns.
6. Pattern evaluation: To identify the truly interesting patterns
representing knowledge based on some interestingness
measures;
7. Knowledge presentation: Where visualization and knowledge
representation techniques are used to present the mined
knowledge to the user. 12
g) Architecture of a typical Data Mining System
 Data Mining is the process of discovering interesting
knowledge from large amounts of data stored in
databases, data warehouses, or other information
repositories.

 Major components
1. Database, DW, WWW or other information repository.
2. Database or data warehouse server
3. Knowledge base
4. Data mining engine
5. Pattern evaluation module
6. User interface
13
Graphical User Interface

Pattern Evaluation

Knowledge-
Data Mining Engine Base

Database or Data Warehouse


Server

Data cleaning, integration, and selection

Data World-Wide Other Info


Database Warehouse Web Repositories

Fig: Architecture of a typical Data Mining System


14
 These are following Architecture of a typical Data Mining System

1. Database, DWH, WWW or other information repository: These are


Data cleaning and integration techniques may be performed.

2. Database or data warehouse server: These are Fetch data for users.

3. Knowledge Base: These are Domain knowledge (CHG, constraints,


metadata)

4. Data Mining Engine: These are Functions (Association,


Classification)

5. Pattern Evaluation Module: These are Search interesting patterns


only.

6. User Interface: These are Visualize patterns for a Data Mining


Query/task. 15
Database
Technology Statistics

Machine Visualization
Data Mining
Learning

Pattern
Recognition Other
Algorithm Disciplines

Fig: Data Mining: Confluence of Multiple Disciplines


h) Data Mining on different kinds of Data

1. Relational databases: Collection of tables having unique names,


each table consisting of attributes and usually store large set of
tuples.

2. Data warehouses: Repository of information collected from


multiple sources stored under unified schema.
 Data warehouse constructed via process of data cleaning, data
transformation, data integration, data loading, periodical
refreshing. Its usually modeled by multidimensional database
structure using data cubes.

 EX: How many sales transactions in December month?


Which of my agent has highest amount of sales?

17
3. Transactional databases: file which stores each record
as a transaction that includes transaction_id, list of
items.
Transaction_id List of items

T_100 I1, I3, I9, I16, 20

4. Object-Relational Databases: Each entity is considered


as an object. Variables represents the attributes and also
it involves operations like get emp_id etc.
 Ex: consider a class as Employee then sales_person is
sub class of main class employee (inheritance)

18
5. Temporal, Sequence and Time-Series Databases
Ex: whether patterns, stock exchange, sea level pressures,
banking etc
6. Spatial Databases and Spatiotemporal Databases

7. Text Databases and Multimedia Databases: Ex: web


pages, audio, video files, content based retrieval etc.
8. Heterogeneous Databases and Legacy Databases: it
includes combining data from relational databases,
network databases, hierarchical databases, multimedia
databases etc.
9. Data streams: Ex: Network traffic, video surveillance,
whether monitoring etc.
10. WWW – ex: web content mining, web page selections,
link mining etc. 19
i) Data Mining Techniques
 Association
 Classification
 Prediction
 Clustering
 Link Mining
 Bagging and Boosting
 Web Mining
 Rough Sets
 Graph Mining
 Outlier Analysis

20
j) Top-10 Algorithms
1. C4.5: Programs for Machine Learning( 61 votes)
2. K-Means (60 votes)
3. SVM (58 votes)
4. Apriori (52 votes)
5. EM: Finite Mixture Models & Association Analysis(48votes)
6. PageRank (46 votes)
7. AdaBoost (45 votes)
8. kNN : K Nearest Neighbours (45 votes)
9. Naive Bayes (45 votes)
10.CART: Classification and Regression Trees (34 votes)
21
Data mining Algorithms - Categories
 Classification:
C4.5
CART ( classification and regression trees), KNN (k – Nearest Neighbor)
Naives bayes.
 Statistical Learning:
SVM – Support Vector Machine
EM – Expectation Maximization
 Association Analysis:
Apriori
FP Growth pattern
 Link Analysis:
Page Ranking
HITS (hyperlink included text searching)
 Clustering :
K means
Birch - Balanced Iterative Reducing and Clustering using Hierarchies
DATA MINING TOOLS

 It is the process that involves exploration and analysis of large volumes of data
using a combination of machine learning, artificial intelligence etc.,
 It is used for knowledge discovery from large volumes of data.
DATAMINING SOFTWARE

Open source Commercial

JHepWork
k) Data Mining Tools
Weka 3.7.5
Rapid Miner
R
Ggobi
Orange
Clementine
Tanagra
Sipina
Alpha Miner
Yale
24
l) APPLICATION AREAS FOR DATA MINING
 Financial Data Analysis
 Data mining in the retail industry
 Telecommunication Industry
 Biological Data Analysis
 Scientific Applications
 Intrusion Detection
 Pattern Recognition
 Sales/Marketing
 Banking / Finance
 Health Care and Insurance
 e - Commerce
25
2. Data Mining Functionalities
Data Mining functionalities are used to specify
the kind of patterns to be found in DM tasks.

Data Mining functionalities include the discovery


of concept/class descriptions, associations and
correlations, classification, prediction, clustering,
trend analysis, outlier and deviation analysis, and
similarity analysis.

Characterization and Discrimination are forms of


data summarization. 26
 DM tasks can be classified into two functionalities
1. Descriptive Data Mining : Tasks characterize the general
properties of the data in the database. Find Human-
Interpretable patterns that describe the data.
2. Predictive Data Mining: Tasks perform inference on the
current data in order to make predictions. Use some variables
to predict unknown or future values of other variables.

 Different views lead to different classifications


 Data view: Kinds of data to be mined
 Knowledge view: Kinds of knowledge to be discovered
 Method view: Kinds of techniques utilized
 Application view: Kinds of applications adapted
27
Data Mining functionalities, and the kinds of
patterns they can discover.

1. Concept/Class Description: Characterization and


Discrimination
2. Mining Frequent Patterns, Associations, and
Correlations
3. Classification and Prediction
4. Cluster Analysis
5. Outlier Analysis
6. Evolution Analysis

28
1. Concept/Class Description: Characterization and
Discrimination: Data characterization is by summarizing
the data of the class under study, often called the target
class. Data discrimination is by comparison of the target
class with one or a set of comparative classes or
contrasting classes.
2. Mining Frequent Patterns, Associations, and
Correlations: Frequent patterns are patterns that occur
frequently in data. A frequent itemset typically refers to a
set of items that frequently appear together in a
transactional data set. Association rules that contain a
single predicate are referred to as Single-dimensional
association rules. Association rules that contain more
than one attribute, or predicate referred to as a Multi
dimensional association rule. 29
Fig: Classification and Prediction 30
3. Classification and Prediction : Classification is the
process of finding a model or function that describes
and distinguishes data classes or concepts. Predict some
unknown or missing numerical values .
4. Cluster analysis: Group of objects having similar
properties. Maximizing intra-class similarity &
Minimizing interclass similarity.
5. Outlier analysis : Data object that does not comply with
the general behavior of the data.
6. Evolution analysis: It describes and models regularities
(or) trends for trends for objects whose behavior
changes over time.

31
3. Classification of Data Mining System
 Data Mining System is a systematic approach to
collecting, organizing, processing, analyzing and
visualizing datasets.

 Data Mining is an interdisciplinary field, the confluence of


a set of disciplines, including database systems, statistics,
machine learning, visualization, and information science.

 Data Mining System can be classified according to the


kinds of databases mined, the kinds of knowledge mined,
the techniques used, and the applications adapted.
32
DM tasks can be classified into two functionalities
1. Descriptive data mining : Tasks characterize the general
properties of the data in the database. Find human-
interpretable patterns that describe the data.
Ex: Clustering, Association rule discovery and Sequence
pattern discovery.

2. Predictive data mining: Tasks perform inference on the


current data in order to make predictions. Use some
variables to predict unknown or future values of other
variables.
Ex: Classification, Regression and Deviation detection.
33
Database
Technology Statistics

Machine Visualization
Data Mining
Learning

Pattern
Recognition Other
Algorithm Disciplines

Fig: Data Mining: Confluence of Multiple Disciplines


34
 Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted

Classification
Classification according to the
according to 2 3 kinds of
the kinds of
knowledge
Classification of techniques
utilized
mined Data Mining
Classification
according to
System Classification
1 4 according to the
the kinds of applications
databases adapted
mined

Fig: Classification of Data Mining System 35


It consists of
1.Classification according to the kinds of
databases mined: A data mining system can be classified
according to the kinds of databases mined.
E.g.: Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW.

2.Classification according to the kinds of


knowledge mined: Data mining systems can be categorized
according to the kinds of knowledge they mine, that is, based on
data mining functionalities.
E.g.: Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis. 36
3. Classification according to the kinds of techniques
utilized: Data mining systems can be categorized
according to the underlying data mining techniques
employed.
 Ex: Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization.
4. Classification according to the applications adapted:
Data mining systems can also be categorized according to
the applications they adapt.
 Ex: Retail, telecommunication, banking, fraud analysis,
bio-data mining, stock market analysis, text mining, Web
mining.
37
4. Data Mining Task Primitives
 Data Mining task can be specified in the form of a data mining
query, which is input to the data mining system.
 Data Mining query is defined in terms of data mining task primitives.
 Data Mining Query Language can be designed to incorporate these
primitives, allowing users to flexibly interact with data mining
systems.
 These primitives for specifying a data mining task in the form of a
data mining query.
 Primitives are classified into 5 types
1. Task-relevant data,
2. Knowledge type to be mined,
3. Background knowledge,
4. Pattern interestingness measures,
5. Visualization of discovered patterns.
38
Fig: Types of Data Mining Task Primitives 39
Knowledge type to be mined
Background knowledge
2 3
Classification of
Data Mining
Task-relevant data System Pattern interestingness
measures
1 4

Visualization of discovered
patterns

Fig: Data Mining Task Primitives


40
Fig: Primitives for specifying a Data Mining Task 41
1. The set of task-relevant data to be mined: This specifies the
portions of the database or the set of data in which the user is
interested. Task-relevant data
 Database or data warehouse name
 Database tables or data warehouse cubes
 Condition for data selection
 Relevant attributes or dimensions
 Data grouping criteria
2. The kind of knowledge to be mined: This specifies the data
mining functions to be performed, such as characterization,
discrimination, association or correlation analysis, classification,
prediction, clustering, outlier analysis, or evolution analysis. It
contains Schema hierarchy, Set-grouping hierarchy, Operation-
derived hierarchy and Rule-based hierarchy.
 Characterization
 Discrimination
 Association
 Classification/prediction
 Clustering
 Outlier analysis and Other data mining tasks 42
3. The background knowledge to be used in the discovery process:
This knowledge about the domain to be mined is useful for guiding
the knowledge discovery process and for evaluating the patterns
found.
 Concept hierarchies
 Schema hierarchy
E.g: street < city < province_or_state < country
 Set-grouping hierarchy
E.g: {20-39} = young, {40-59} = middle_aged
 Operation-derived hierarchy
email address: [email protected]
login-name < department < university < country
 Rule-based hierarchy
low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50
43
4. The interestingness measures and thresholds for pattern
evaluation: They may be used to guide the mining process or,
after discovery, to evaluate the discovered patterns. Simplicity:
rule length, tree size. Certainty: E.g., confidence, Utility:
potential usefulness and Novelty: not previously known.

 Simplicity: E.g: (association) rule length, (decision) tree size


 Certainty: E.g: confidence, P(A|B) = #(A and B)/ #(B),
classification reliability or accuracy, certainty factor, rule strength,
rule quality, discriminating weight, etc.
 Utility: Potential usefulness, e.g., support (association), noise
threshold (description)
 Novelty: Not previously known, surprising (used to remove
redundant rules, e.g., Illinois vs. Champaign rule implication
support ratio).
44
5. The expected representation for visualizing the discovered
patterns: This refers to the form in which discovered patterns are to
be displayed, which may include rules, tables, charts, graphs,
decision trees, and cubes. Different backgrounds/usages may require
different forms of representation,
E.g: Rules, tables, crosstabs, pie/bar chart, etc.

 Different backgrounds/usages may require different forms of


representation
E.g., rules, tables, crosstabs, pie/bar chart, etc.
 Concept hierarchy is also important
 Discovered knowledge might be more understandable when
represented at high level of abstraction
 Interactive drill up/down, pivoting, slicing and dicing provide
different perspectives to data
 Different kinds of knowledge require different representation:
association, classification, clustering, etc.
45
5. Major issues in Data Mining
 Major issues in Data Mining is challenges or problems.
 Major issues in data mining regarding mining
methodology, user interaction, performance, diverse
data types and social impacts.
 Other issues include the exploration of data mining
applications and their social impacts.

 Data preparation or preprocessing is a big issue for both


data mining and data warehousing.
 The issues are considered major requirements and
challenges for the further evolution of data mining
technology. 46
Major issues are consists of

1. Mining methodology
2. User interaction issues.
3. Performance issues
4. Issues relating to the diversity of database types.

5. Applications and Social impacts.

47
Performance

User Interaction
2 5 Diversity of DB
types
Major issues in
Data Mining
Applications and
1 6 Social impacts
Mining
Methodology

Fig: Major issues in Data Mining


48
1. Mining Methodology: It explains the what kind of
methodology(Association, Classification and Clustering) is used
for extracting knowledge.
 These reflect the kinds of knowledge mined, the ability to mine
knowledge at multiple granularities, the use of domain knowledge,
ad- hoc mining, and knowledge visualization.
– Mining different kinds of knowledge from diverse data types,
e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one:
knowledge fusion
49
2. User interaction issues: It explains the how easily interacting
with Data Mining system. It uses Data mining query languages
and ad-hoc mining, Expression and visualization of data mining
results and Interactive mining of knowledge at multiple levels of
abstraction.
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of
abstraction

3. Performance issues: These include efficiency, scalability, and


parallelization of data mining algorithms. The performance levels
are good, average and bad. It includes Efficiency and scalability of
data mining algorithms and Parallel, distributed, and incremental
mining algorithms.
50
4. Issues relating to the diversity of database types: It
handling of relational and complex types of data such as
space and Radar data. Mining information from
heterogeneous databases and global information
systems.

5. Applications and social impacts: The security and


privacy levels in Data Mining system. It includes
Domain-specific data mining & invisible data mining
and Protection of data security, integrity, and privacy.
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy.

51
6. Data Mining Applications
 Data Mining is an interdisciplinary field with wide and
diverse applications.
 Many customized data mining tools have been developed
for domain-specific applications, including finance, the
retail industry, telecommunications, bioinformatics,
intrusion detection, and other science, engineering, and
government data analysis.

 Some application domains 5


1. Financial data analysis 1 Data Mining 3
2. Retail industry Applications
3. Telecommunication industry 2 4
4. Biological data analysis
5. Data Mining in other Scientific Applications 6
6. Data Mining for Intrusion Detection.
52
1. Data Mining for Financial Data Analysis: Financial data
collected in banks and financial institutions are often relatively
complete, reliable, and of high quality. Loan payment prediction and
customer credit policy analysis. Classification and clustering of
customers for targeted marketing. Detection of money laundering
and other financial crimes.
 Examples: Cyber Crime, Internet Crimes

2. Data Mining for Retail Industry: Retail industry is huge


amounts of data on sales, customer shopping history, etc.
Applications of retail data mining is identify customer buying
behaviors, Discover customer shopping patterns and trends and
improve the quality of customer service.
 Examples: Supermarket, Sales and Shopping mall.
Multidimensional analysis of sales, customers, products, time, and
region. 53
3. Data Mining for Telecommunication Industry: A rapidly
expanding and highly competitive industry and a great demand for
data mining. Multidimensional analysis of telecommunication data is
intrinsically multidimensional i.e. calling-time, duration, location of
caller, location of callee, type of call, etc.
 Examples: Fraudulent pattern analysis and the identification of
unusual patterns.

4. Data Mining for Biomedical Data Analysis: DNA sequences are 4


basic building blocks (nucleotides): adenine (A), cytosine (C), guanine
(G), and thymine (T). Gene is a sequence of hundreds of individual
nucleotides arranged in a particular order. Humans have around
30,000 genes.
 Examples: Semantic integration of heterogeneous, distributed
genomic and proteomic databases. Discovery of structural patterns
and analysis of genetic networks and protein pathways.

54
5. Data Mining in Other Scientific Applications
 Satellite Imagery, GIS, Rocket Launching.
 Graph-based mining.
 Visualization tools and domain-specific knowledge.

6. Data Mining for Intrusion Detection


 Development of data mining algorithms for intrusion detection.
 Association and correlation analysis, and aggregation to help
select and build discriminating attributes.
 Analysis of stream data.
 Distributed data mining
 Visualization and querying tools.

55

You might also like