0% found this document useful (0 votes)
55 views36 pages

DWDM-Unit 2 CH-1

Uploaded by

karunakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views36 pages

DWDM-Unit 2 CH-1

Uploaded by

karunakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 36

Data Mining:

Concepts and Techniques

— Unit I: Chapter 1 —

1
Unit I: Chapter 1. Introduction

 Data Mining Fundamentals:


 Motivation: Why data mining?
 What is data mining?
 Data Mining: On what kind of data?
 Data mining functionality
 Classification of data mining systems
 Top-10 most popular data mining algorithms
 Major issues in data mining
 Overview of the course
November 12, 2024 Data Mining: Concepts and Techniques 2
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability

Automated data collection tools, database systems, Web, computerized
society
 Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets

November 12, 2024 Data Mining: Concepts and Techniques 3


Evolution of Database
Technology
 1960s:
 Data collection, database creation, IMS(Hierarchical) and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems

November 12, 2024 Data Mining: Concepts and Techniques 4


Need of Data Mining
• In field of Information technology we have huge amount of data
available that need to be turned into useful information.

• This information further can be used for various applications


such as market analysis, fraud detection, customer retention,
production control, science exploration etc.
Data Mining Applications

 Market Analysis and Management


 Corporate Analysis & Risk Management
 Fraud Detection
 Customer Retention
 Financial Banking
 Future Health care
 Criminal Investigation
 Market-Basket analysis
 Other Applications
Market Analysis and
Management

Following are the various fields of market where data mining is used:
• Customer Profiling - Data Mining helps to determine what kind of people buy
what kind of products.
• Identifying Customer Requirements - Data Mining helps in identifying the
best products for different customers. It uses prediction to find the factors that
may attract new customers.
• Cross Market Analysis - Data Mining performs Association/correlations
between product sales.
• Target Marketing - Data Mining helps to find clusters of model customers
who share the same characteristics such as interest, spending habits, income etc
• Determining Customer purchasing pattern - Data mining helps in
determining customer purchasing pattern
• Providing Summary Information - Data Mining provide us various
multidimensional summary reports
Corporate Analysis & Risk
Management
Following are the various fields of Corporate Sector where data mining is used:

• Finance Planning and Asset Evaluation - It involves cash flow analysis


and prediction, contingent claim analysis to evaluate assets.

• Resource Planning - It involves summarizing and comparing the resources


and spending.

• Competition - It involves monitoring competitors and market directions.


Fraud Detection

• Data Mining is also used in fields of credit card services and


telecommunication to detect fraud. In fraud telephone call it helps to find
destination of call, duration of call, time of day or week. It also analyze the
patterns that deviate from an expected norms.

Other Applications

• Data Mining also used in other fields such as sports, astrology and Internet
Web Surf-Aid.
What Is Data Mining?

 Data mining (knowledge discovery from data)


Extracting or “mining” knowledge from large amount of data

Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data

Data mining: a misnomer?
 Alternative names

Knowledge Discovery (mining) in Databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?

Simple search and query processing

(Deductive) expert systems
November 12, 2024 Data Mining: Concepts and Techniques 10
Knowledge Discovery (KDD) Process

 Data mining—core of Pattern Evaluation


knowledge discovery
process
Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
November 12, 2024 Data Mining: Concepts and Techniques 11
7 steps in KDD process
 Data Cleaning: to remove noise and inconsistent data
 Data integration : where multiple data sources may be combined
 Data selection: where data relevant to the analysis task are retrieved
from the data base.
 Data transformation: where data are transformed and consolidated
into forms appropriate for mining by performing summary or
aggregation operations.
 Data mining: an essential process where intelligent methods are
applied to extract data patterns
 Pattern evaluation: to identify the truly interesting patterns
representing knowledge based on interestingness measures.
 Knowledge presentation: where visualization and knowledge
representation techniques are used to present mined knowledge to
users
Architecture: Typical Data Mining
System

November 12, 2024 Data Mining: Concepts and Techniques 13


 Database, data warehouse, World Wide Web, or other
information repository:
 This is one or a set of databases, data warehouses,
spreadsheets, or other kinds of information repositories.
 Data cleaning and data integration techniques may be
performed on the data.

 Database or data warehouse server: The database or data


warehouse server is responsible for fetching the relevant data,
based on the user’s data mining request.
 Knowledge base: This is the domain knowledge that is used to
guide the search or evaluate the interestingness of resulting
patterns.
 Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels of
abstraction.

 Data mining engine: This is essential to the data mining


system and ideally consists of a set of functional modules for
tasks such a characterization, association and correlation
analysis, classification, prediction, cluster analysis, outlier
analysis, and evolution analysis.
 Pattern evaluation module: This component typically employs
interestingness measures and interacts with the data mining
modules so as to focus the search toward interesting patterns.

 User interface: This module communicates between users and


the data mining system, allowing the user to interact with the
system by specifying a data mining query or task, providing
information to help focus the search, and performing
exploratory data mining based on the intermediate data mining
results.
Why Not Traditional Data
Analysis?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-
bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications

November 12, 2024 Data Mining: Concepts and Techniques 17


Data Mining: On What Kinds of
Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-
sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web
November 12, 2024 Data Mining: Concepts and Techniques 18
 Relational Databases:
 It is a collection of tables
 These databases can be accessed by database queries
 Data Warehouse:
 It is a repository of information collected from multiple
sources, stored under uniform schema, and which
usually resides at a single site
 It is constructed via a process of data cleaning, data
transformation, data integration, data loading and
periodic data refreshing.
 It is usually modeled by a multidimensional data cube.

November 12, 2024 Data Mining: Concepts and Techniques 19


 Transactional Databases:

A transactional database consists of a file where each


record represents a transaction.

November 12, 2024 Data Mining: Concepts and Techniques 20


 Object relational databases: constructed based on object-
relational data model
 Spatial databases: contain spatial related information like
geographic databases, medical and satellite image databases.
Spatial data may be represented in raster format. Maps in vector
format
 Temporal databases and Time-series databases:
They store time-related data. Temporal databases store relational
data that include time-related attributes(involve several
timestamps). A time series databases store sequence of values
that change with time

November 12, 2024 Data Mining: Concepts and Techniques 21


 Text Databases and Multimedia Databases:
 Text databases are databases that contain word descriptions for

objects like bug reports, summary reports.


 Multimedia databases store image, audio, and video data.

 For multimedia database mining, storage and search techniques

need to be integrated with standard data mining methods(like


construction of data cubes..)
 Heterogeneous Databases and Legacy Databases:

 A legacy database is a group of heterogeneous databases that

combine different kinds of data systems.


World Wide Web:
Understanding user access patterns(mining path traversal patterns)

November 12, 2024 Data Mining: Concepts and Techniques 22


Data Mining Functionalities
The following are the functionalities and the kinds of
pattern they discover:

 Class/Concept Description: Characterization and


Discrimination
 Mining frequent patterns: Association and Correlation
Analysis
 Classification and Prediction
 Cluster Analysis
 Outlier Analysis
 Evolution Analysis

November 12, 2024 Data Mining: Concepts and Techniques 23


Class/Concept Description: Characterization and
Discrimination
 Information integration and data warehouse
construction
 Data cleaning, transformation, integration, and

multidimensional data model


 Data cube technology
 Scalable methods for computing (i.e.,

materializing) multidimensional aggregates


 OLAP (online analytical processing)

 Multidimensional concept description:


Characterization and discrimination
 Generalize, summarize, and contrast data

characteristics, e.g., dry vs. wet regions

November 12, 2024 Data Mining: Concepts and Techniques 24


Association and Correlation Analysis
 Frequent patterns (or frequent item sets)
 What items are frequently purchased together in your Wal-
Mart?
 Association, correlation vs. causality
 A typical association rule

Age(X,”20..29”) ^ income(X,”20k..30k”)=>buys(X,”CD
player”)[support=2%, Confidence=60%]
 Are strongly associated items also strongly correlated?
 How to mine such patterns and rules efficiently in large
datasets?
 How to use such patterns for classification, clustering, and
other applications? Data Mining: Concepts and Techniques
November 12, 2024 25
Classification and Prediction

 Classification and prediction


 Construct models (functions) based on some training examples
 Describe and distinguish classes or concepts for future prediction

E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
 Predict some unknown or missing numerical values
 Typical methods
 Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification,
logistic regression, …
 Typical applications:
 Credit card fraud detection, direct marketing, classifying stars, diseases,
web-pages, …

November 12, 2024 Data Mining: Concepts and Techniques 26


Cluster and Outlier Analysis
 Cluster analysis
 Unsupervised learning (i.e., Class label is unknown)
 Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
 Principle: Maximizing intra-class similarity & minimizing
interclass similarity
 Many methods and applications
 Outlier analysis
 Outlier: A data object that does not comply with the
general behavior of the data
 Noise or exception? ― One person’s garbage could be
another person’s treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis
November 12, 2024 Data Mining: Concepts and Techniques 27
Data Mining: Confluence of Multiple
Disciplines

Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Algorithm Disciplines

November 12, 2024 Data Mining: Concepts and Techniques 28


Data mining overlaps with many
disciplines

 Statistics
 Machine Learning
 Information Retrieval (Web mining)
 Distributed Computing
 Database Systems
Classification of Data Mining Systems

 Different views lead to different classifications


 Data view: Kinds of data to be mined
 Knowledge view: Kinds of knowledge to be
discovered
 Method view: Kinds of techniques utilized
 Application view: Kinds of applications adapted

November 12, 2024 Data Mining: Concepts and Techniques 30


Classification of Data Mining
Systems
 Classification according to the Kinds of Data bases mined
 Relational, data warehouse, transactional, stream,
object-oriented/relational( according to data model), spatial, time-series, text,
multi-media, heterogeneous, legacy, WWW(according to type of data handled).
 Classification according to the Kinds of Knowledge mined
 Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels
 Classification according to the Kinds of Techniques utilized
 Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
 Classification according to the Applications adapted:
 Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.

November 12, 2024 Data Mining: Concepts and Techniques 31


Data Mining Task Primitives

A data mining task can be specified in the form of a data mining query,
which is input to the data mining system. A data mining query is defined in
terms of data mining task primitives. These primitives allow the user to
interactively communicate with the data mining system during discovery to
direct the mining process or examine the findings from different angles or
depths. The data mining primitives specify the following
1. Set of task-relevant data to be mined.
2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.

November 12, 2024 Data Mining: Concepts and Techniques 32


Data Mining Task Primitives
 1. The set of task-relevant data to be mined
This specifies the portions of the database or the set of data in
which the user is interested. This includes the database attributes or
data warehouse dimensions of interest (the relevant attributes or
dimensions).
 2. The kind of knowledge to be mined
This specifies the data mining functions to be performed, such
as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution
analysis.
 3. The background knowledge to be used in the discovery process
This knowledge about the domain to be mined is useful for
guiding the knowledge discovery process and evaluating the patterns
found. Concept hierarchies are a popular form of background
knowledge, which allows data to be mined at multiple levels of
abstraction.

November 12, 2024 Data Mining: Concepts and Techniques 33


Data Mining Task Primitives
 4. The interestingness measures and thresholds for pattern evaluation
Different kinds of knowledge may have different interesting
measures. They may be used to guide the mining process or, after discovery,
to evaluate the discovered patterns. For example, interesting measures for
association rules include support and confidence. Rules whose support and
confidence values are below user-specified thresholds are considered
uninteresting.
 5. The expected representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be
displayed, which may include rules, tables, cross tabs, charts, graphs,
decision trees, cubes, or other visual representations.

November 12, 2024 Data Mining: Concepts and Techniques 34


Major Issues in Data Mining
 Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web
 Handling noise and incomplete data: data cleaning and data analysis
methods that can handle noise are required. outlier mining methods for
discovery and analysis of exceptional cases.
 Incorporation of background knowledge: domain knowledge is required
to guide the discovery process and express patterns in concise terms
and at different levels of abstraction.
 Pattern evaluation: the interestingness problem
 Performance: efficiency, effectiveness, and scalability :running time of
data mining algorithm must be predictable and acceptable.
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one

November 12, 2024 Data Mining: Concepts and Techniques 35


Major Issues in Data Mining
 User interaction
 Data mining query languages and ad-hoc mining
 Data mining Query language need to be developed to allow users to
describe ad- hoc data mining tasks.
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of abstraction
 Applications and social impacts
 Domain-specific data mining & invisible data mining
Eg. Companies like Amazon keeps track of customer profiles
 Protection of data security, integrity, and privacy
 Need to observe data sensitivity and preserve peoples privacy while
performing successful data mining.

November 12, 2024 Data Mining: Concepts and Techniques 36

You might also like