0% found this document useful (0 votes)
43 views32 pages

Introduction To Data Mining 1604

This document provides an introduction to data mining concepts and techniques. It discusses why data mining is needed due to the massive growth of data, and how data mining can extract useful knowledge and patterns from large datasets. The key steps of data mining including data cleaning, integration, selection, transformation and pattern evaluation are described. The document also covers what types of data can be mined, common data mining functionalities like classification, clustering, association rule mining, and the typical architecture of a data mining system.

Uploaded by

Akash Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views32 pages

Introduction To Data Mining 1604

This document provides an introduction to data mining concepts and techniques. It discusses why data mining is needed due to the massive growth of data, and how data mining can extract useful knowledge and patterns from large datasets. The key steps of data mining including data cleaning, integration, selection, transformation and pattern evaluation are described. The document also covers what types of data can be mined, common data mining functionalities like classification, clustering, association rule mining, and the typical architecture of a data mining system.

Uploaded by

Akash Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

DATA WAREHOUSING

&
DATA MINING

Prepared by:
Anita Parmar

1
2
3
DATA MINING:
CONCEPTS AND TECHNIQUES

— CHAPTER 2 —

Introduction to Data Mining

4
CHAPTER 2. INTRODUCTION

 Motivation: Why data mining?


 What is data mining?
 Data Mining: On what kind of data?
 Data mining functionality
 Classification of data mining systems
 Data mining task primitives
 Major issues in data mining
5
WHY DATA MINING?

 The Explosive Growth of Data: from terabytes to petabytes


 Data collection and data availability
 Automated data collection tools, database systems, Web, computerized society
 Major sources of rich data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
6
EVOLUTION OF DATABASE
TECHNOLOGY
 1960s:
 Data collection, database creation, file creation
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web databases
 2000s
 Data mining and its applications
 Web technology (XML, data integration)
7
data rich but information poor
8
We want to know ...
 Which types of transactions are likely to be fraudulent
given the transactional history of a particular customer?
 If I raise the price of my product by Rs. 2, what is the
effect on my business?
 If I offer only 2,500 as an incentive to purchase rather than
5,000, how many lost responses will result?
 If I emphasize ease-of-use of the product as opposed to its
technical capabilities, what will be the net effect on my
revenues?
 Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
9
WHAT IS DATA MINING?

 Data mining (knowledge discovery from data)


 Extraction of interesting patterns or knowledge from huge amount of
data.
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data dredging(searching), information
harvesting(gathering), business intelligence, etc.

10
11
KNOWLEDGE DISCOVERY FROM DATA (KDD)
PROCESS
 Data mining—core of knowledge
discovery process Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse Selection and


transformation

Data Cleaning

Data Integration
12

Databases
KDD PROCESS: SEVERAL KEY STEPS
1.Data cleaning : to remove noise and inconsistent data (may take 60% of
effort!)
2. Data integration : Where multiple data sources may be combined.
3. Data selection :
 Where data relevant to the analysis task are retrieved from the database.
4. Data Transformation
 Where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation

13
CONTINUE…

5. Data mining: search for patterns of interest.


 An essential process where intelligent methods are applied in
order to extract data patterns.

6. Pattern evaluation: to identify the truly interesting patterns


representing knowledge based on some interestingness measures

7. Knowledge presentation : visualization and knowledge


representation techniques are used to present the mined knowledge
to the user.

14
DATA MINING AND BUSINESS INTELLIGENCE

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
15
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
ARCHITECTURE OF A DATA MINING SYSTEM

16
CONTINUE…
 Database, Data warehouse, WWW or other information repository:
 A set of Database, data warehouse, spreadsheets, or other kind of
information repositories.
 Data cleaning and data integration techniques may be performed on
the data.

 Database or data warehouse server :


 Responsible for fetching the relevant data, based on the user’s data
mining request.

 Knowledge base:
 Domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. For ex.,
 Concept hierarchies, used to organize attributes or attribute values
into different levels of abstraction,
 User beliefs, which can be used to assess a pattern’s interestingness
17
based on its unexpectedness, may also be included.
 Additional interestingness constraints or thresholds and
metadata.
CONTINUE…
 Data mining engine:
 Essential to the data mining system
 Consists of a set of functional modules.

 Pattern evaluation module:


 Employs interestingness measures and interacts with the data mining modules so
as to focus the search toward interesting patterns.
 It may use interestingness thresholds to filter out discovered patterns.
 In many system pattern evaluation module may be integrated with the mining
module, depending on the implementation of the data mining method used.

 User interface:
 Communicate between users and the data mining system.
 Allowing the user to interact with the system by specifying a data mining query or
task, providing information to help focus the search,
 performing exploratory data mining based on the intermediate data mining
results.
 Allows the user to browse database and data warehouse schemas or data 18
structures,
 evaluate mined patterns,
 and visualize the patterns in different forms.
DATA MINING: ON WHAT KINDS OF DATA?

 Database-oriented data sets and applications


 Relational database,
 data warehouse,
 transactional database
 Advanced data sets and advanced applications
 Object-relational databases
 Temporal data, sequence data (incl. bio-sequences), Time-series data
 Time related, customer shopping sequence, sequence of values repeated over time(hourly,
monthly,daily)
 Spatial data and spatiotemporal data
 Geographic database, VLSI data, satellite images etc.

19
DATA MINING: ON WHAT KINDS OF DATA?

 Heterogeneous databases and legacy databases


 Ex. Information of students performance at different schools
 Data streams
 Multimedia database
 Text databases
 The World-Wide Web

20
DATA MINING FUNCTIONALITIES : WHAT KINDS OF PATTERNS CAN BE MINED

 Concept description: Characterization and discrimination


 Generalize, summarize, and contrast data characteristics,
 Eg. Find characteristics of Customers who spend more than 10,000 per
month
 Eg. Compare customers who shop regularly verses who shop rarely
 Frequent patterns, association, correlation
 computer  printer [0.5%, 75%]
 Classification and prediction
 Construct models (functions) that describe and distinguish classes or
concepts for future prediction
 E.g., classify countries based on (climate), or classify cars based on (gas
21
mileage)
 Predict some unknown or missing numerical values
DATA MINING FUNCTIONALITIES (2)
 Cluster analysis
 Class label is unknown: Group data to form new classes, e.g., cluster houses
to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass similarity
 Outlier analysis
 Outlier: Data object that does not comply with the general behavior of the
data
 Noise or exception? Useful in fraud detection, rare events analysis
 Trend and evolution analysis
 Regularities or trends for object whose behavior changes over time.
 Ex. Stock exchange

22
ARE ALL THE “DISCOVERED” PATTERNS INTERESTING?

 Data mining may generate thousands of patterns: Not all of them are
interesting
 Interestingness measures
 A pattern is interesting if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
 Objective vs. subjective interestingness measures
 Objective: based on statistics and structures of patterns, e.g., support, confidence,
etc.
 Subjective: based on user’s belief in the data

23
FIND ALL AND ONLY INTERESTING PATTERNS?

 Find all the interesting patterns: Completeness


 Can a data mining system find all the interesting patterns? Do we need to
find all of the interesting patterns?
 Association vs. classification vs. clustering
 Search for only interesting patterns: An optimization problem
 Can a data mining system find only the interesting patterns?
 Approaches
 First general all the patterns and then filter out the uninteresting ones
 Generate only the interesting patterns—mining query optimization

24
CLASSIFICATION OF DATA MINING SYSTEM

Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Algorithm Disciplines
25
CLASSIFICATION OF DATA MINING SYSTEM
 Kinds of Databases to be mined
 Relational, data warehouse, transactional, stream, object-oriented/relational,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
 Kinds of Knowledge to be mined
 Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels
 Kinds of Techniques utilized
 Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market
analysis, text mining, Web mining, etc. 26
PRIMITIVES THAT DEFINE A DATA MINING TASK

 Task-relevant data
 Database or data warehouse name
 Database tables or data warehouse cubes
 Condition for data selection
 Relevant attributes or dimensions
 Data grouping criteria
 Type of knowledge to be mined
 Characterization, discrimination, association, classification, prediction,
clustering, outlier analysis, other data mining tasks
 Background knowledge
 Pattern interestingness measurements 27
 Visualization/presentation of discovered patterns
PRIMITIVE 3: BACKGROUND KNOWLEDGE

 A typical kind of background knowledge: Concept hierarchies


 Schema hierarchy
 E.g., street < city < province_or_state < country
 Set-grouping hierarchy
 E.g., {20-39} = young, {40-59} = middle_aged
 Operation-derived hierarchy
 email address: [email protected]
login-name < department < university < country
 Rule-based hierarchy
 low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50
28
INTEGRATION OF DATA MINING AND DATA WAREHOUSING

 Data mining systems, DBMS, Data warehouse systems coupling


 No coupling, loose-coupling, semi-tight-coupling, tight-coupling

29
COUPLING DATA MINING WITH DB/DW SYSTEMS

 No coupling—flat file processing, not recommended


 Loose coupling
 Fetching data from DB/DW
 Semi-tight coupling—enhanced DM performance
 Provide efficient implement a few data mining primitives in a DB/DW
system, e.g., sorting, indexing, aggregation, histogram analysis,
multiway join, precomputation of some stat functions
 Tight coupling—A uniform information processing
environment
 DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query, indexing, query processing methods, 30
etc.
MAJOR ISSUES IN DATA MINING

 Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one: knowledge fusion

 User interaction
 Data mining query languages and ad-hoc mining
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of abstraction

31
SUMMARY

 Data mining: Discovering interesting patterns from large amounts of data


 A natural evolution of database technology, in great demand, with wide
applications
 A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
 Mining can be performed in a variety of information repositories
 Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
 Data mining systems and architectures
 Major issues in data mining
32

You might also like