Data Mining:
Concepts and Techniques
— Unit I: Chapter 1 —
1
Unit I: Chapter 1. Introduction
Data Mining Fundamentals:
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Classification of data mining systems
Top-10 most popular data mining algorithms
Major issues in data mining
Overview of the course
November 12, 2024 Data Mining: Concepts and Techniques 2
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web, computerized
society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
November 12, 2024 Data Mining: Concepts and Techniques 3
Evolution of Database
Technology
1960s:
Data collection, database creation, IMS(Hierarchical) and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
November 12, 2024 Data Mining: Concepts and Techniques 4
Need of Data Mining
• In field of Information technology we have huge amount of data
available that need to be turned into useful information.
• This information further can be used for various applications
such as market analysis, fraud detection, customer retention,
production control, science exploration etc.
Data Mining Applications
Market Analysis and Management
Corporate Analysis & Risk Management
Fraud Detection
Customer Retention
Financial Banking
Future Health care
Criminal Investigation
Market-Basket analysis
Other Applications
Market Analysis and
Management
Following are the various fields of market where data mining is used:
• Customer Profiling - Data Mining helps to determine what kind of people buy
what kind of products.
• Identifying Customer Requirements - Data Mining helps in identifying the
best products for different customers. It uses prediction to find the factors that
may attract new customers.
• Cross Market Analysis - Data Mining performs Association/correlations
between product sales.
• Target Marketing - Data Mining helps to find clusters of model customers
who share the same characteristics such as interest, spending habits, income etc
• Determining Customer purchasing pattern - Data mining helps in
determining customer purchasing pattern
• Providing Summary Information - Data Mining provide us various
multidimensional summary reports
Corporate Analysis & Risk
Management
Following are the various fields of Corporate Sector where data mining is used:
• Finance Planning and Asset Evaluation - It involves cash flow analysis
and prediction, contingent claim analysis to evaluate assets.
• Resource Planning - It involves summarizing and comparing the resources
and spending.
• Competition - It involves monitoring competitors and market directions.
Fraud Detection
• Data Mining is also used in fields of credit card services and
telecommunication to detect fraud. In fraud telephone call it helps to find
destination of call, duration of call, time of day or week. It also analyze the
patterns that deviate from an expected norms.
Other Applications
• Data Mining also used in other fields such as sports, astrology and Internet
Web Surf-Aid.
What Is Data Mining?
Data mining (knowledge discovery from data)
Extracting or “mining” knowledge from large amount of data
Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge Discovery (mining) in Databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
November 12, 2024 Data Mining: Concepts and Techniques 10
Knowledge Discovery (KDD) Process
Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
November 12, 2024 Data Mining: Concepts and Techniques 11
7 steps in KDD process
Data Cleaning: to remove noise and inconsistent data
Data integration : where multiple data sources may be combined
Data selection: where data relevant to the analysis task are retrieved
from the data base.
Data transformation: where data are transformed and consolidated
into forms appropriate for mining by performing summary or
aggregation operations.
Data mining: an essential process where intelligent methods are
applied to extract data patterns
Pattern evaluation: to identify the truly interesting patterns
representing knowledge based on interestingness measures.
Knowledge presentation: where visualization and knowledge
representation techniques are used to present mined knowledge to
users
Architecture: Typical Data Mining
System
November 12, 2024 Data Mining: Concepts and Techniques 13
Database, data warehouse, World Wide Web, or other
information repository:
This is one or a set of databases, data warehouses,
spreadsheets, or other kinds of information repositories.
Data cleaning and data integration techniques may be
performed on the data.
Database or data warehouse server: The database or data
warehouse server is responsible for fetching the relevant data,
based on the user’s data mining request.
Knowledge base: This is the domain knowledge that is used to
guide the search or evaluate the interestingness of resulting
patterns.
Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels of
abstraction.
Data mining engine: This is essential to the data mining
system and ideally consists of a set of functional modules for
tasks such a characterization, association and correlation
analysis, classification, prediction, cluster analysis, outlier
analysis, and evolution analysis.
Pattern evaluation module: This component typically employs
interestingness measures and interacts with the data mining
modules so as to focus the search toward interesting patterns.
User interface: This module communicates between users and
the data mining system, allowing the user to interact with the
system by specifying a data mining query or task, providing
information to help focus the search, and performing
exploratory data mining based on the intermediate data mining
results.
Why Not Traditional Data
Analysis?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-
bytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
November 12, 2024 Data Mining: Concepts and Techniques 17
Data Mining: On What Kinds of
Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-
sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
November 12, 2024 Data Mining: Concepts and Techniques 18
Relational Databases:
It is a collection of tables
These databases can be accessed by database queries
Data Warehouse:
It is a repository of information collected from multiple
sources, stored under uniform schema, and which
usually resides at a single site
It is constructed via a process of data cleaning, data
transformation, data integration, data loading and
periodic data refreshing.
It is usually modeled by a multidimensional data cube.
November 12, 2024 Data Mining: Concepts and Techniques 19
Transactional Databases:
A transactional database consists of a file where each
record represents a transaction.
November 12, 2024 Data Mining: Concepts and Techniques 20
Object relational databases: constructed based on object-
relational data model
Spatial databases: contain spatial related information like
geographic databases, medical and satellite image databases.
Spatial data may be represented in raster format. Maps in vector
format
Temporal databases and Time-series databases:
They store time-related data. Temporal databases store relational
data that include time-related attributes(involve several
timestamps). A time series databases store sequence of values
that change with time
November 12, 2024 Data Mining: Concepts and Techniques 21
Text Databases and Multimedia Databases:
Text databases are databases that contain word descriptions for
objects like bug reports, summary reports.
Multimedia databases store image, audio, and video data.
For multimedia database mining, storage and search techniques
need to be integrated with standard data mining methods(like
construction of data cubes..)
Heterogeneous Databases and Legacy Databases:
A legacy database is a group of heterogeneous databases that
combine different kinds of data systems.
World Wide Web:
Understanding user access patterns(mining path traversal patterns)
November 12, 2024 Data Mining: Concepts and Techniques 22
Data Mining Functionalities
The following are the functionalities and the kinds of
pattern they discover:
Class/Concept Description: Characterization and
Discrimination
Mining frequent patterns: Association and Correlation
Analysis
Classification and Prediction
Cluster Analysis
Outlier Analysis
Evolution Analysis
November 12, 2024 Data Mining: Concepts and Techniques 23
Class/Concept Description: Characterization and
Discrimination
Information integration and data warehouse
construction
Data cleaning, transformation, integration, and
multidimensional data model
Data cube technology
Scalable methods for computing (i.e.,
materializing) multidimensional aggregates
OLAP (online analytical processing)
Multidimensional concept description:
Characterization and discrimination
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
November 12, 2024 Data Mining: Concepts and Techniques 24
Association and Correlation Analysis
Frequent patterns (or frequent item sets)
What items are frequently purchased together in your Wal-
Mart?
Association, correlation vs. causality
A typical association rule
Age(X,”20..29”) ^ income(X,”20k..30k”)=>buys(X,”CD
player”)[support=2%, Confidence=60%]
Are strongly associated items also strongly correlated?
How to mine such patterns and rules efficiently in large
datasets?
How to use such patterns for classification, clustering, and
other applications? Data Mining: Concepts and Techniques
November 12, 2024 25
Classification and Prediction
Classification and prediction
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
Predict some unknown or missing numerical values
Typical methods
Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification,
logistic regression, …
Typical applications:
Credit card fraud detection, direct marketing, classifying stars, diseases,
web-pages, …
November 12, 2024 Data Mining: Concepts and Techniques 26
Cluster and Outlier Analysis
Cluster analysis
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
Principle: Maximizing intra-class similarity & minimizing
interclass similarity
Many methods and applications
Outlier analysis
Outlier: A data object that does not comply with the
general behavior of the data
Noise or exception? ― One person’s garbage could be
another person’s treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
November 12, 2024 Data Mining: Concepts and Techniques 27
Data Mining: Confluence of Multiple
Disciplines
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
November 12, 2024 Data Mining: Concepts and Techniques 28
Data mining overlaps with many
disciplines
Statistics
Machine Learning
Information Retrieval (Web mining)
Distributed Computing
Database Systems
Classification of Data Mining Systems
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be
discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
November 12, 2024 Data Mining: Concepts and Techniques 30
Classification of Data Mining
Systems
Classification according to the Kinds of Data bases mined
Relational, data warehouse, transactional, stream,
object-oriented/relational( according to data model), spatial, time-series, text,
multi-media, heterogeneous, legacy, WWW(according to type of data handled).
Classification according to the Kinds of Knowledge mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Classification according to the Kinds of Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Classification according to the Applications adapted:
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
November 12, 2024 Data Mining: Concepts and Techniques 31
Data Mining Task Primitives
A data mining task can be specified in the form of a data mining query,
which is input to the data mining system. A data mining query is defined in
terms of data mining task primitives. These primitives allow the user to
interactively communicate with the data mining system during discovery to
direct the mining process or examine the findings from different angles or
depths. The data mining primitives specify the following
1. Set of task-relevant data to be mined.
2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.
November 12, 2024 Data Mining: Concepts and Techniques 32
Data Mining Task Primitives
1. The set of task-relevant data to be mined
This specifies the portions of the database or the set of data in
which the user is interested. This includes the database attributes or
data warehouse dimensions of interest (the relevant attributes or
dimensions).
2. The kind of knowledge to be mined
This specifies the data mining functions to be performed, such
as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution
analysis.
3. The background knowledge to be used in the discovery process
This knowledge about the domain to be mined is useful for
guiding the knowledge discovery process and evaluating the patterns
found. Concept hierarchies are a popular form of background
knowledge, which allows data to be mined at multiple levels of
abstraction.
November 12, 2024 Data Mining: Concepts and Techniques 33
Data Mining Task Primitives
4. The interestingness measures and thresholds for pattern evaluation
Different kinds of knowledge may have different interesting
measures. They may be used to guide the mining process or, after discovery,
to evaluate the discovered patterns. For example, interesting measures for
association rules include support and confidence. Rules whose support and
confidence values are below user-specified thresholds are considered
uninteresting.
5. The expected representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be
displayed, which may include rules, tables, cross tabs, charts, graphs,
decision trees, cubes, or other visual representations.
November 12, 2024 Data Mining: Concepts and Techniques 34
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web
Handling noise and incomplete data: data cleaning and data analysis
methods that can handle noise are required. outlier mining methods for
discovery and analysis of exceptional cases.
Incorporation of background knowledge: domain knowledge is required
to guide the discovery process and express patterns in concise terms
and at different levels of abstraction.
Pattern evaluation: the interestingness problem
Performance: efficiency, effectiveness, and scalability :running time of
data mining algorithm must be predictable and acceptable.
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one
November 12, 2024 Data Mining: Concepts and Techniques 35
Major Issues in Data Mining
User interaction
Data mining query languages and ad-hoc mining
Data mining Query language need to be developed to allow users to
describe ad- hoc data mining tasks.
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Eg. Companies like Amazon keeps track of customer profiles
Protection of data security, integrity, and privacy
Need to observe data sensitivity and preserve peoples privacy while
performing successful data mining.
November 12, 2024 Data Mining: Concepts and Techniques 36