Module 2 Data Mining
Module 2 Data Mining
(CSA4003)
Module-2
Introduction To Data Mining
1/29/2023
Module-2 Data Mining
1/29/2023
1/29/2023
Motivation: Why data mining?
1/29/2023
Evolution of Database Technology
1/29/2023
Evolution of Database Technology
1970s - early 1980s:
Data Base Management Systems
Hierarchical and network database systems
Relational database Systems
Query languages: SQL
Transactions, concurrency control(Concurrency Control in DBMS is a procedure
of managing simultaneous transactions ensuring their atomicity, isolation, consistency, and
Durability) and recovery.
On-line transaction processing (OLTP)
1/30/2023
Evolution of Database Technology
1/29/2023
Evolution of Database Technology
Late 1980s-present
Advanced Data Analysis
Data warehouse and OLAP
Data mining and knowledge discovery
Advanced data mining applications
Data mining and society
1990s-present:
XML-based database systems
Integration with information retrieval
Data and information integration
1/29/2023
Evolution of Database Technology
Present – future:
New generation of integrated data and information
system.
1/29/2023
What Is Data Mining?
1/29/2023
What Is Data Mining?
1/29/2023
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery process.
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases 1/29/2023
Steps of a KDD Process
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentaion
1/29/2023
Steps of a KDD Process
Learning the application domain:
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing
Data reduction and transformation:
Find useful features, dimensionality/variable reduction,
invariant representation.
1/29/2023
Steps of a KDD Process
1/29/2023
Architecture of a Typical Data
Mining System
Graphical user interface
Pattern evaluation
Data
Databases Warehouse
1/29/2023
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
1/29/2023
Answer : On any kind of data.
1/29/2023
1/29/2023
Data Mining: On What Kind of Data?
Relational databases
Data warehouses
Transactional databases
1/29/2023
Data Mining: On What Kind of Data?
1/29/2023
1/30/2023
Data Mining Functionalities -What
kind of patterns can be mined?
1/29/2023
Data mining tasks generally classified into two categories.
1/29/2023
Data Mining Functionalities
Concept description: Characterization and
discrimination
Data can be associated with classes or concepts
Ex. All Electronics store classes of items for sale include
computer and printers.
Description of class or concept is called class/concept
description.
Data characterization : summarization of general features
of target class of data.
Data discrimination : comparison of target class with one
or more contrasting classes.
1/29/2023
Data Mining Functionalities
Association Analysis
Multi-dimensional vs. single-dimensional association
age(X, “20..29”) ^ income(X, “20..29K”) => buys(X,
“PC”) [support = 2%, confidence = 60%]
contains(T, “computer”) => contains(x, “software”)
[support=1%, confidence=75%]
1/29/2023
Data Mining Functionalities
Classification and Prediction
Finding models (functions) that describe and distinguish data
classes or concepts for predict the class whose label is
unknown
E.g., classify countries based on climate, or classify cars based
on gas mileage
Models: decision-tree, classification rules (if-then), neural
network
1/29/2023
Data Mining Functionalities
Cluster analysis
Analyze class-labeled data objects, clustering analyze
data objects without consulting a known class label.
Clustering based on the principle: maximizing the intra-
class similarity and minimizing the interclass similarity
1/29/2023
Data Mining Functionalities
Outlier analysis
Outlier: a data object that does not comply(fulfill) with the general
behavior of the model of the data
It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
1/29/2023
Data Mining: Classification Schemes
1/29/2023
Data Mining: Confluence of Multiple
Disciplines
Database
Statistics
Technology
Information
Science Data Mining MachineLearning
Visualization Other
Disciplines
1/29/2023
Data Mining systems: Classification Schemes
General functionality
Descriptive data mining
Predictive data mining
1/29/2023
Data Mining: Classification Schemes
Databases to be mined
Relational, transactional, object-oriented, object-
relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW, etc.
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend, deviation and outlier
analysis, etc.
Multiple/integrated functions and mining at multiple
levels
1/29/2023
Data Mining: Classification Schemes
Techniques utilized
Database-oriented, data warehouse
(OLAP), machine learning, statistics,
visualization, neural network, etc.
Applications adopted
Retail, telecommunication, banking,
fraud analysis, DNA mining, stock market
1/29/2023
Major Issues in Data Mining
1/29/2023
Major Issues in Data Mining
1/29/2023
Major Issues in Data Mining
2. Performance issues
Efficiency and scalability of data mining algorithms
Parallel, distributed and incremental mining methods
1/29/2023
Major Issues in Data Mining
1/29/2023
Data Mining Task Primitives
1/29/2023
1/29/2023
1/29/2023
1/29/2023
ARCHITECTURE OF A TYPICAL DATA MINING
SYSTEM
1/29/2023
Integration Schemes
1/29/2023
1/29/2023
1/29/2023
1/29/2023
1/29/2023
Q::::: Describe the differences between the following approaches for
the integration of a data mining system with a database or data
warehouse system: no coupling, loose coupling, semitight coupling, and
tight coupling.
State which approach you think is the most popular and why?
1/29/2023