0% found this document useful (0 votes)

43 views32 pages

Introduction To Data Mining 1604

This document provides an introduction to data mining concepts and techniques. It discusses why data mining is needed due to the massive growth of data, and how data mining can extract useful knowledge and patterns from large datasets. The key steps of data mining including data cleaning, integration, selection, transformation and pattern evaluation are described. The document also covers what types of data can be mined, common data mining functionalities like classification, clustering, association rule mining, and the typical architecture of a data mining system.

Uploaded by

Akash Ranjan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views32 pages

Introduction To Data Mining 1604

Uploaded by

Akash Ranjan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

DATA WAREHOUSING

&
DATA MINING

Prepared by:
Anita Parmar

1
2
3
DATA MINING:
CONCEPTS AND TECHNIQUES

— CHAPTER 2 —

Introduction to Data Mining

4
CHAPTER 2. INTRODUCTION

 Motivation: Why data mining?

 What is data mining?
 Data Mining: On what kind of data?
 Data mining functionality
 Classification of data mining systems
 Data mining task primitives
 Major issues in data mining
5
WHY DATA MINING?

 The Explosive Growth of Data: from terabytes to petabytes

 Data collection and data availability
 Automated data collection tools, database systems, Web, computerized society
 Major sources of rich data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
6
EVOLUTION OF DATABASE
TECHNOLOGY
 1960s:
 Data collection, database creation, file creation
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web databases
 2000s
 Data mining and its applications
 Web technology (XML, data integration)
7
data rich but information poor
8
We want to know ...
 Which types of transactions are likely to be fraudulent
given the transactional history of a particular customer?
 If I raise the price of my product by Rs. 2, what is the
effect on my business?
 If I offer only 2,500 as an incentive to purchase rather than
5,000, how many lost responses will result?
 If I emphasize ease-of-use of the product as opposed to its
technical capabilities, what will be the net effect on my
revenues?
 Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
9
WHAT IS DATA MINING?

 Data mining (knowledge discovery from data)

 Extraction of interesting patterns or knowledge from huge amount of
data.
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data dredging(searching), information
harvesting(gathering), business intelligence, etc.

10
11
KNOWLEDGE DISCOVERY FROM DATA (KDD)
PROCESS
 Data mining—core of knowledge
discovery process Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse Selection and

transformation

Data Cleaning

Data Integration
12

Databases
KDD PROCESS: SEVERAL KEY STEPS
1.Data cleaning : to remove noise and inconsistent data (may take 60% of
effort!)
2. Data integration : Where multiple data sources may be combined.
3. Data selection :
 Where data relevant to the analysis task are retrieved from the database.
4. Data Transformation
 Where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation

13
CONTINUE…

5. Data mining: search for patterns of interest.

 An essential process where intelligent methods are applied in
order to extract data patterns.

6. Pattern evaluation: to identify the truly interesting patterns

representing knowledge based on some interestingness measures

7. Knowledge presentation : visualization and knowledge

representation techniques are used to present the mined knowledge
to the user.

14
DATA MINING AND BUSINESS INTELLIGENCE

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business

Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

DBA
15
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
ARCHITECTURE OF A DATA MINING SYSTEM

16
CONTINUE…
 Database, Data warehouse, WWW or other information repository:
 A set of Database, data warehouse, spreadsheets, or other kind of
information repositories.
 Data cleaning and data integration techniques may be performed on
the data.

 Database or data warehouse server :

 Responsible for fetching the relevant data, based on the user’s data
mining request.

 Knowledge base:
 Domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. For ex.,
 Concept hierarchies, used to organize attributes or attribute values
into different levels of abstraction,
 User beliefs, which can be used to assess a pattern’s interestingness
17
based on its unexpectedness, may also be included.
 Additional interestingness constraints or thresholds and
metadata.
CONTINUE…
 Data mining engine:
 Essential to the data mining system
 Consists of a set of functional modules.

 Pattern evaluation module:

 Employs interestingness measures and interacts with the data mining modules so
as to focus the search toward interesting patterns.
 It may use interestingness thresholds to filter out discovered patterns.
 In many system pattern evaluation module may be integrated with the mining
module, depending on the implementation of the data mining method used.

 User interface:
 Communicate between users and the data mining system.
 Allowing the user to interact with the system by specifying a data mining query or
task, providing information to help focus the search,
 performing exploratory data mining based on the intermediate data mining
results.
 Allows the user to browse database and data warehouse schemas or data 18
structures,
 evaluate mined patterns,
 and visualize the patterns in different forms.
DATA MINING: ON WHAT KINDS OF DATA?

 Database-oriented data sets and applications

 Relational database,
 data warehouse,
 transactional database
 Advanced data sets and advanced applications
 Object-relational databases
 Temporal data, sequence data (incl. bio-sequences), Time-series data
 Time related, customer shopping sequence, sequence of values repeated over time(hourly,
monthly,daily)
 Spatial data and spatiotemporal data
 Geographic database, VLSI data, satellite images etc.

19
DATA MINING: ON WHAT KINDS OF DATA?

 Heterogeneous databases and legacy databases

 Ex. Information of students performance at different schools
 Data streams
 Multimedia database
 Text databases
 The World-Wide Web

20
DATA MINING FUNCTIONALITIES : WHAT KINDS OF PATTERNS CAN BE MINED

 Concept description: Characterization and discrimination

 Generalize, summarize, and contrast data characteristics,
 Eg. Find characteristics of Customers who spend more than 10,000 per
month
 Eg. Compare customers who shop regularly verses who shop rarely
 Frequent patterns, association, correlation
 computer  printer [0.5%, 75%]
 Classification and prediction
 Construct models (functions) that describe and distinguish classes or
concepts for future prediction
 E.g., classify countries based on (climate), or classify cars based on (gas
21
mileage)
 Predict some unknown or missing numerical values
DATA MINING FUNCTIONALITIES (2)
 Cluster analysis
 Class label is unknown: Group data to form new classes, e.g., cluster houses
to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass similarity
 Outlier analysis
 Outlier: Data object that does not comply with the general behavior of the
data
 Noise or exception? Useful in fraud detection, rare events analysis
 Trend and evolution analysis
 Regularities or trends for object whose behavior changes over time.
 Ex. Stock exchange

22
ARE ALL THE “DISCOVERED” PATTERNS INTERESTING?

 Data mining may generate thousands of patterns: Not all of them are
interesting
 Interestingness measures
 A pattern is interesting if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
 Objective vs. subjective interestingness measures
 Objective: based on statistics and structures of patterns, e.g., support, confidence,
etc.
 Subjective: based on user’s belief in the data

23
FIND ALL AND ONLY INTERESTING PATTERNS?

 Find all the interesting patterns: Completeness

 Can a data mining system find all the interesting patterns? Do we need to
find all of the interesting patterns?
 Association vs. classification vs. clustering
 Search for only interesting patterns: An optimization problem
 Can a data mining system find only the interesting patterns?
 Approaches
 First general all the patterns and then filter out the uninteresting ones
 Generate only the interesting patterns—mining query optimization

24
CLASSIFICATION OF DATA MINING SYSTEM

Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Algorithm Disciplines
25
CLASSIFICATION OF DATA MINING SYSTEM
 Kinds of Databases to be mined
 Relational, data warehouse, transactional, stream, object-oriented/relational,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
 Kinds of Knowledge to be mined
 Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels
 Kinds of Techniques utilized
 Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market
analysis, text mining, Web mining, etc. 26
PRIMITIVES THAT DEFINE A DATA MINING TASK

 Task-relevant data
 Database or data warehouse name
 Database tables or data warehouse cubes
 Condition for data selection
 Relevant attributes or dimensions
 Data grouping criteria
 Type of knowledge to be mined
 Characterization, discrimination, association, classification, prediction,
clustering, outlier analysis, other data mining tasks
 Background knowledge
 Pattern interestingness measurements 27
 Visualization/presentation of discovered patterns
PRIMITIVE 3: BACKGROUND KNOWLEDGE

 A typical kind of background knowledge: Concept hierarchies

 Schema hierarchy
 E.g., street < city < province_or_state < country
 Set-grouping hierarchy
 E.g., {20-39} = young, {40-59} = middle_aged
 Operation-derived hierarchy
 email address: [email protected]
login-name < department < university < country
 Rule-based hierarchy
 low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50
28
INTEGRATION OF DATA MINING AND DATA WAREHOUSING

 Data mining systems, DBMS, Data warehouse systems coupling

 No coupling, loose-coupling, semi-tight-coupling, tight-coupling

29
COUPLING DATA MINING WITH DB/DW SYSTEMS

 No coupling—flat file processing, not recommended

 Loose coupling
 Fetching data from DB/DW
 Semi-tight coupling—enhanced DM performance
 Provide efficient implement a few data mining primitives in a DB/DW
system, e.g., sorting, indexing, aggregation, histogram analysis,
multiway join, precomputation of some stat functions
 Tight coupling—A uniform information processing
environment
 DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query, indexing, query processing methods, 30
etc.
MAJOR ISSUES IN DATA MINING

 Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one: knowledge fusion

 User interaction
 Data mining query languages and ad-hoc mining
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of abstraction

31
SUMMARY

 Data mining: Discovering interesting patterns from large amounts of data

 A natural evolution of database technology, in great demand, with wide
applications
 A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
 Mining can be performed in a variety of information repositories
 Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
 Data mining systems and architectures
 Major issues in data mining
32

Major Issues in Data Mining
75% (4)
Major Issues in Data Mining
45 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
27 pages
IM204 Evaluation of Business Performance
50% (4)
IM204 Evaluation of Business Performance
2 pages
Data Mining and Datawarehousing CS-303
No ratings yet
Data Mining and Datawarehousing CS-303
34 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
CAST Solutioning Guidance - CAS V3.96
No ratings yet
CAST Solutioning Guidance - CAS V3.96
31 pages
Data Mining Notes
100% (1)
Data Mining Notes
45 pages
4 Data Mining & Preprocessing L 11,12,13,14,15,16
No ratings yet
4 Data Mining & Preprocessing L 11,12,13,14,15,16
100 pages
Installation and User Guide: Autoclient 2.1
No ratings yet
Installation and User Guide: Autoclient 2.1
90 pages
What Is The Main Difference Between GSM and GSM-R - News Incs
No ratings yet
What Is The Main Difference Between GSM and GSM-R - News Incs
6 pages
Chapter-01 - Distributed Systems
No ratings yet
Chapter-01 - Distributed Systems
60 pages
TRBOnet PLUS User Manual v6.1 PDF
No ratings yet
TRBOnet PLUS User Manual v6.1 PDF
399 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
75 pages
Module 3
No ratings yet
Module 3
187 pages
CAP170 Practice
No ratings yet
CAP170 Practice
153 pages
Veeam Certified Engineer Training Program: Textbook
No ratings yet
Veeam Certified Engineer Training Program: Textbook
203 pages
Data Mining
No ratings yet
Data Mining
27 pages
Data Mining
No ratings yet
Data Mining
88 pages
4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16
No ratings yet
4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16
80 pages
Unit I Dbmi
No ratings yet
Unit I Dbmi
35 pages
DWDM - Unit - II
No ratings yet
DWDM - Unit - II
55 pages
Lecture 1.1.1 1.1.2
No ratings yet
Lecture 1.1.1 1.1.2
32 pages
02-Introduction To Data Mining
No ratings yet
02-Introduction To Data Mining
40 pages
Inf 444e - Datamining N Advanced Databases Introduction 2019
No ratings yet
Inf 444e - Datamining N Advanced Databases Introduction 2019
32 pages
Data Protection For VMware Installation
No ratings yet
Data Protection For VMware Installation
142 pages
01 Intro 1
No ratings yet
01 Intro 1
33 pages
Intro Data Mining
No ratings yet
Intro Data Mining
51 pages
Grade 8
No ratings yet
Grade 8
58 pages
DM 1
No ratings yet
DM 1
47 pages
Data Mining 1
No ratings yet
Data Mining 1
39 pages
DWDM LS1 Fall 24 25
No ratings yet
DWDM LS1 Fall 24 25
42 pages
Lecture 01 11jan
No ratings yet
Lecture 01 11jan
29 pages
Day8. JavaConceptsComparison (WWW - Yellowcoder.in)
No ratings yet
Day8. JavaConceptsComparison (WWW - Yellowcoder.in)
21 pages
Module1 IntroToDataMining
No ratings yet
Module1 IntroToDataMining
36 pages
Credit Card Analysis
No ratings yet
Credit Card Analysis
33 pages
DWDM R13 Unit 1 PDF
No ratings yet
DWDM R13 Unit 1 PDF
10 pages
Introduction
No ratings yet
Introduction
46 pages
Introduction
No ratings yet
Introduction
27 pages
01 Intro
No ratings yet
01 Intro
28 pages
Vedang NMS Operations Document-1 - From Avinash 230316 - 1
No ratings yet
Vedang NMS Operations Document-1 - From Avinash 230316 - 1
43 pages
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
No ratings yet
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
43 pages
DWDM
No ratings yet
DWDM
30 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
37 pages
01 - Data Mining Introduction
No ratings yet
01 - Data Mining Introduction
21 pages
A System Review On Measuring and Evaluating Web Usability in Model Driven Web Development
No ratings yet
A System Review On Measuring and Evaluating Web Usability in Model Driven Web Development
10 pages
Chapter 1 Intro
No ratings yet
Chapter 1 Intro
23 pages
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
No ratings yet
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
43 pages
Intro of Data Mining
No ratings yet
Intro of Data Mining
27 pages
Jitesh Sah
No ratings yet
Jitesh Sah
29 pages
Combine 056
No ratings yet
Combine 056
57 pages
Chapter 1 Data Mining Lecture Note
No ratings yet
Chapter 1 Data Mining Lecture Note
31 pages
01 Intro
No ratings yet
01 Intro
40 pages
Remote - Trigger - Process Chain PDF
No ratings yet
Remote - Trigger - Process Chain PDF
10 pages
Computer Proficiency Certification Test (CPCT) Rule Book For Examinees
No ratings yet
Computer Proficiency Certification Test (CPCT) Rule Book For Examinees
13 pages
Data Mining: Concepts and Techniques: - Chapter 1
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1
37 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
DWDM 01 Introduction
No ratings yet
DWDM 01 Introduction
43 pages
01 Intro
No ratings yet
01 Intro
29 pages
Data Mining and Scientific Research
No ratings yet
Data Mining and Scientific Research
31 pages
ICS 2408 Lecture 1 Introduction
No ratings yet
ICS 2408 Lecture 1 Introduction
32 pages
AF302 Exam
No ratings yet
AF302 Exam
14 pages
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
No ratings yet
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
36 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
Data Mining
No ratings yet
Data Mining
26 pages
CSPractical Exam QP01 To 04
No ratings yet
CSPractical Exam QP01 To 04
4 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
Anaum Hamid: Lecture 01 - Introduction To DM
No ratings yet
Anaum Hamid: Lecture 01 - Introduction To DM
50 pages
Documentation
No ratings yet
Documentation
13 pages
Cti Slide
No ratings yet
Cti Slide
5 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
Unit 1: Data Warehousing & Data Mining
No ratings yet
Unit 1: Data Warehousing & Data Mining
54 pages
Unit - I
No ratings yet
Unit - I
22 pages
Data Warehousing Data Mining Lecture Notes On UNIT 1
No ratings yet
Data Warehousing Data Mining Lecture Notes On UNIT 1
22 pages
Company: Intra
No ratings yet
Company: Intra
11 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
Chap 1
No ratings yet
Chap 1
45 pages
B UCSM CLI Configuration Guide 2 0 Chapter 0110
No ratings yet
B UCSM CLI Configuration Guide 2 0 Chapter 0110
24 pages
1 Intro
No ratings yet
1 Intro
33 pages
Mobile Data Security
No ratings yet
Mobile Data Security
7 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
25 pages
Data Mining: Concepts and Techniques: Sujata Chakravarty Associate Professor RCMA, Bhubaneswar
No ratings yet
Data Mining: Concepts and Techniques: Sujata Chakravarty Associate Professor RCMA, Bhubaneswar
17 pages
Ajeet Chouksey: Work Experience Skills
No ratings yet
Ajeet Chouksey: Work Experience Skills
1 page
Access To Free Nutanix NCP-MCI-5.20 Practice Exam Questions - FreeTestShare4
No ratings yet
Access To Free Nutanix NCP-MCI-5.20 Practice Exam Questions - FreeTestShare4
4 pages
01 Intro
No ratings yet
01 Intro
23 pages
Taira Tetra Server: The Power of Modern TETRA
No ratings yet
Taira Tetra Server: The Power of Modern TETRA
2 pages
Review Questions 3
No ratings yet
Review Questions 3
3 pages
Tax Invoice Cum Acknowledgement Receipt of PAN Application (Form 49A)
No ratings yet
Tax Invoice Cum Acknowledgement Receipt of PAN Application (Form 49A)
1 page
Architecture and Functions of The SAP Web Dispatcher
No ratings yet
Architecture and Functions of The SAP Web Dispatcher
3 pages
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet

Introduction To Data Mining 1604

Uploaded by

Introduction To Data Mining 1604

Uploaded by

DATA WAREHOUSING

Introduction to Data Mining

 Motivation: Why data mining?

 The Explosive Growth of Data: from terabytes to petabytes

 Data mining (knowledge discovery from data)

Data Warehouse Selection and

5. Data mining: search for patterns of interest.

6. Pattern evaluation: to identify the truly interesting patterns

7. Knowledge presentation : visualization and knowledge

Data Presentation Business

Data Preprocessing/Integration, Data Warehouses

 Database or data warehouse server :

 Pattern evaluation module:

 Database-oriented data sets and applications

 Heterogeneous databases and legacy databases

 Concept description: Characterization and discrimination

 Find all the interesting patterns: Completeness

 A typical kind of background knowledge: Concept hierarchies

 Data mining systems, DBMS, Data warehouse systems coupling

 No coupling—flat file processing, not recommended

 Data mining: Discovering interesting patterns from large amounts of data

You might also like