0% found this document useful (0 votes)
77 views51 pages

1 DMiningKuliah 1 Introduction

Data mining involves analyzing large amounts of data to discover hidden patterns and relationships. It has the potential to help organizations understand their data better and make more informed decisions. The data mining process involves cleaning and preparing data, applying data mining algorithms to discover patterns, and evaluating and presenting the results. Common data mining techniques include classification, estimation, prediction, clustering, and association rule mining.

Uploaded by

Ricky Chandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views51 pages

1 DMiningKuliah 1 Introduction

Data mining involves analyzing large amounts of data to discover hidden patterns and relationships. It has the potential to help organizations understand their data better and make more informed decisions. The data mining process involves cleaning and preparing data, applying data mining algorithms to discover patterns, and evaluating and presenting the results. Common data mining techniques include classification, estimation, prediction, clustering, and association rule mining.

Uploaded by

Ricky Chandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Data Mining

Data Mining 1
Introduction
■ Why data mining?

■ What is Data Mining / Knowledge Data Discovery?

■ Origins of Data Mining

■ Potential Applications

■ Data Mining: On what kind of data?

■ Data Mining Functionalities

■ OLAP Mining System

Data Mining 2
Why Data Mining:
Trends leading to Data Flood
More data is generated:
■ Bank, telecom, other
business transactions
...
■ Scientific data:
astronomy, biology, etc
■ Web, text, and
e-commerce

Data Mining 3
Scale Of Data

Data Mining 4
Data Growth Rate
■ Twice as much information was created in
2002 as in 1999 (~30% growth rate)
■ Other growth rate estimates even higher
■ And THE PROBLEM IS:
■ Very little data will ever be looked at by a
human
■ We are drowning in data, but starving for
knowledge
■ Knowledge Discovery is NEEDED to make
sense and use of data.
Data Mining 5
Why Mine Data?
■ There is often information “hidden” in the data that is not readily
evident
■ Human analysts may take weeks to discover useful information
■ Much of the data is never analyzed at all

Data Mining 6
Why Mine Data?

Data Mining 7
What Is Data Mining:
Many Names of Data Mining
■ Data Fishing, Data Dredging: 1960-
■ used by Statistician (as a bad name)
■ Data Mining :1990-
■ used DB, business
■ in 2003 – bad image because of TIA
■ Knowledge Discovery in Databases: 1989-
■ used by AI, Machine Learning Community
■ also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...

Currently: Data Mining and Knowledge Discovery in


Databases (KDD) are used interchangeably
Data Mining 8
Knowledge Data Discovery (KDD)
■ Knowledge Discovery in Data
is the non-trivial process of
identifying
■ valid
■ novel
■ potentially useful
■ and ultimately
understandable patterns in
data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
Data Mining 9
What is (not) Data Mining?
What is not Data What is Data Mining?
Mining?
– Look up phone number – Certain names are more
in phone directory prevalent in certain US
locations (O’Brien,
O’Rurke, O’Reilly… in
Boston area)
– Query a Web search – Group together similar
engine for information documents returned by
about “Amazon” search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,
etc)
Data Mining 10
Origins of Data Mining
■ Draws ideas from
machine
learning/AI,
pattern
recognition,
statistics,
and
database
systems

Data Mining 11
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Technology

Machine
Learning
Data Mining Visualization

Information Other
Science Disciplines
Data Mining 12
What is Data Mining: A KDD Process

Data mining: the core of


Knowledge Data Discovery
process. Pattern Evaluation

Data Mining
Task-relevant
Data
Selection
Data
Warehouse
Data
Cleaning

Data Integration
Databases
Data Mining 13
Steps of a KDD Process
1. Learning the application domain
■ relevant prior knowledge and goals of application
2. Creating a target data set → data selection
3. Data cleaning and preprocessing (may take 60% of effort!)
4. Data reduction and transformation
■ Find useful features, dimensionality/variable reduction,
invariant representation.
5. Choosing functions of data mining
■ summarization, classification, regression, association,
clustering.
6. Choosing the mining algorithm(s)
7. Data mining → search for patterns of interest
8. Pattern evaluation and knowledge presentation
■ visualization, transformation, removing redundant patterns,
etc.
9. Use of discovered knowledge
Data Mining 14
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data Mining
Information Discovery
Data
Data Exploration Analyst
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers,
Data Mining Database Systems, OLTP 15
Architecture of a Typical Data
Mining System
Graphical user interface

Pattern evaluation

Data mining engine

(Database / data Knowledge-base


warehouse) server
Data cleaning & data integration Filtering

Data
Databases Warehouse
Data Mining 16
What Tasks Can Data Mining
Accomplish?

The most common data mining tasks.


■ Description
■ Classification
■ Estimation
■ Prediction
■ Clustering
■ Association

Data Mining 17
Task 1: Description
■ Find ways to describe patterns and trends
lying within data.
■ For example:
■ A pollster can uncover evidence that those who
have been laid off are less likely to support the
present incumbent in the presidential election.
■ From descriptions of patterns and trends we knew
that they are now less well off financially than
before the incumbent was elected, and so would
tend to prefer an alternative.

Data Mining 18
Task 1: Description
■ The models should be as transparent as
possible.
■ High-quality description can often be
accomplished by exploratory data
analysis , a graphical method of
exploring data in search of patterns and
trends.

Data Mining 19
Task 2: Classification
The data mining model examines a large set of records, each record
containing information on the target variable as well as a set of
input or predictor variables.
■ For example, consider the excerpt data set.

■ After “learns” the data, the algorithm can classify new records,
for which no information about income bracket is available.

Data Mining 20
Task 2: Classification
Examples of classification tasks in business and research include:
■ Determining whether a particular credit card transaction is
fraudulent
■ Placing a new student into a particular track with regard to
special needs
■ Assessing whether a mortgage application is a good or bad
credit risk
■ Diagnosing whether a particular disease is present
■ Determining whether a will was written by the actual deceased,
or fraudulently by someone else
■ Classifying type of drug a patient should be prescribed, based
on certain patient characteristics.
■ Etc.

Data Mining 21
Task 2: Classification
■ Common data mining methods
used for classification are:
■ k -nearest neighbor
■ decision tree
■ neural network

Data Mining 22
Task 3: Estimation
■ Similar to classification except that the target
variable is numerical rather than categorical.
■ Models are built using “complete ” records,
which provide the value of the target variable
as well as the predictors.
■ Then, for new observations, estimates of the
value of the target variable are made, based
on the values of the predictors.

Data Mining 23
Task 3: Estimation
Examples of estimation tasks in business and research include:
■ Estimating the amount of money a randomly chosen family of
four will spend for back-to-school shopping this fall.
■ Estimating the percentage decrease in rotary-movement
sustained by a National Football League running back with a
knee injury.
■ Estimating the number of points per game that Patrick Ewing will
score when double-teamed in the playoffs.
■ Estimating the grade-point average (GPA) of a graduate student,
based on that student ’s undergraduate GPA.
■ Estimating person yearly incomes based on the description and
personal data, ie: age, jobs, home addresses, etc.
■ Etc.

Data Mining 24
Task 3: Estimation
■ Common data mining methods used for
estimation are:
■ Statistical analysis:
■ Point estimation
■ Confidence interval estimations
■ Simple linear regression
■ Multiple regression
■ Correlation
■ Neural networks

Data Mining 25
Task 4: Prediction
Similar to classification and estimation, except that for
prediction, the results lie in the future.
■ For example, predicting the price of a stock three
months in the future.

Data Mining 26
Task 4: Prediction
Examples of prediction tasks in business and research
include:
■ Predicting the price of a stock three months into the
future
■ Predicting the percentage increase in traffic deaths
next year if the speed limit is increased
■ Predicting the winner of this fall’s baseball World
Series, based on a comparison of team statistics
■ Predicting whether a particular molecule in drug
discovery will lead to a profitable new drug for a
pharmaceutical company
Data Mining 27
Task 4: Prediction
■ Any of the methods and techniques
used for classification and estimation
may also be used for prediction. These
include:
■ Statistical methods
■ Neural Networks
■ Decision tree
■ k-nearest neighbor
Data Mining 28
Task 5: Clustering
■ Grouping of records, observations, or cases into
classes of similar objects.
■ A cluster is a collection of records that are similar to
one another, and dissimilar to records in other
clusters.
■ The clustering task does not try to classify, estimate,
or predict the value of a target variable.
■ It seek to segment the entire data set into relatively
homogeneous subgroups or clusters.

Data Mining 29
Task 5: Clustering
■ For Example, PRIZM segmentation system, which
describes every U.S. zip code area in terms of
distinct lifestyle types.
■ For illustration, the clusters for zip code 90210,
Beverly Hills, California, are:
■ Cluster 01: Blue Blood Estates
■ Cluster 10: Bohemian Mix
■ Cluster 02: Winner ’s Circle
■ Cluster 07: Money and Brains
■ Cluster 08: Young Literati

Data Mining 30
Task 5: Clustering
Examples of clustering tasks in business and research
include:
■ Target marketing of a niche product for a
small-capitalization business that does not have a large
marketing budget
■ For accounting auditing purposes, to segment financial
behavior into benign and suspicious categories
■ As a dimension-reduction tool when the data set has
hundreds of attributes
■ For gene expression clustering, where very large
quantities of genes may exhibit similar behavior

Data Mining 31
Task 5: Clustering
Common data mining methods used for
clustering are:
■ Hierarchical clustering (AgNes, DiAna, etc)
■ Partitional clustering (K–means, PAM, etc)
■ DB-Scan
■ Kohonen networks

Data Mining 32
Task 6: Association
■ Finding which attributes “go together. ”
■ Most prevalent in the business world.
■ It is known as affinity analysis or
market basket analysis
■ The task of association seeks to
uncover rules for quantifying the
relationship between two or more
attributes.
Data Mining 33
Task 6: Association
■ For example, a particular supermarket may
find that of the 1000 customers shopping on a
Thursday night, 200 bought diapers, and of
those 200 who bought diapers, 50 bought
beer.
■ Thus, the association rule would be “If buy
diapers, then buy beer” with a support of
200/1000 = 20% and a confidence of 50/200
= 25%.

Data Mining 34
Task 6: Association
Examples of association tasks in business and research
include:
■ Examining the proportion of children whose parents read to
them who are themselves good readers
■ Predicting degradation in telecommunications networks
■ Finding out which items in a supermarket are purchased
together and which items are never purchased together
■ Determining the proportion of cases in which a new drug
will exhibit dangerous side effects
■ Cross-selling analysis of the products.
■ Optimize the performance of online banner advertisement,
which presents discount offers on various investment
products
Data Mining 35
Task 6: Association
Common data mining methods used for
association are:
■ Apriori Algorithm
■ FP-Tree
■ Generalized Rule Induction Method
■ Etc.

Data Mining 36
Potential Applications
■ Database analysis and decision support
■ Market analysis and management
■ target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
■ Risk analysis and management
■ Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
■ Fraud detection and management
■ Other Applications
■ Text mining (news, email, documents) and Web analysis.
■ Intelligent query answering

Data Mining 37
Market Analysis and Management (1)
■ The Data Sources
■ Sales transactions, credit card transactions, loyalty cards,
discount coupons, customer complaint calls, plus (public)
lifestyle studies
■ Target marketing
■ Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
■ Determine customer purchasing patterns over time
■ Conversion of single to a joint bank account: marriage, etc.
■ Cross-market analysis
■ Associations/co-relations between product sales
■ Prediction based on the association information
Data Mining 38
Market Analysis and Management (2)
■ Customer profiling
■ data mining can tell you what types of customers buy what
products (clustering or classification)

■ Identifying customer requirements


■ identifying the best products for different customers
■ use prediction to find what factors will attract new customers
■ Provides summary information
■ various multidimensional summary reports
■ statistical summary information (data central tendency and
variation)
Data Mining 39
Corporate Analysis and Risk Management
■ Finance planning and asset evaluation:
■ cash flow analysis and prediction
■ claim analysis to evaluate assets
■ cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
■ Resource planning:
■ summarize and compare the resources and spending
■ Competition:
■ monitor competitors and market directions
■ group customers into classes and a class-based pricing
procedure
■ set pricing strategy in a highly competitive market

Data Mining 40
Successful e-commerce – Case Study

Data Mining 41
Fraud Detection and Management (1)
■ Applications
■ widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
■ Approach
■ use historical data to build models of fraudulent behavior and
use data mining to help identify similar instances
■ Examples
■ auto insurance: detect a group of people who stage accidents
to collect on insurance
■ money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
■ medical insurance: detect professional patients and ring of
doctors and ring of references
Data Mining 42
Fraud Detection and Management (2)
■ Detecting inappropriate medical treatment
■ Australian Health Insurance Commission identifies that in many
cases blanket screening tests were requested (save Australian
$1m/yr).
■ Detecting telephone fraud
■ Telephone call model: destination of the call, duration, time of
day or week. Analyze patterns that deviate from an expected
norm.
■ British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and broke
a multimillion dollar fraud.
■ Retail
■ Analysts estimate that 38% of retail shrink is due to dishonest
employees.
Data Mining 43
Other Applications
■ Sports
■ IBM Advanced Scout analyzed NBA game statistics (shots blocked,
assists, and fouls) to gain competitive advantage for New York Knicks
and Miami Heat
■ Astronomy
■ JPL and the Palomar Observatory discovered 22 quasars with the help
of data mining
■ Internet Web Surf-Aid
■ IBM Surf-Aid applies data mining algorithms to Web access logs for
market-related pages to discover customer preference and behavior
pages, analyzing effectiveness of Web marketing, improving Web site
organization, etc.
■ Detecting diseases, pendemic, epidemic, plagues spreading.
Data Mining 44
Data Mining: On What Kind of Data?
■ Relational databases
■ Data warehouses
■ Transactional databases
■ Advanced DB and information repositories
■ Object-oriented and object-relational databases
■ Spatial databases
■ Time-series data and temporal data
■ Text databases and multimedia databases
■ Heterogeneous and legacy databases
■ WWW
Data Mining 45
Data Mining Functionalities (1)

■ Concept description
■ Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions

■ Association (correlation and causality)


■ Multi-dimensional vs. single-dimensional association
■ buys(x, "diapers") 🡪 buys(x, "beer") [0.5%, 60%]
■ age(X, “20..29”) ^ income(X, “20..29K”) 🡪 buys(X, “PC”) [support
= 2%, confidence = 60%]
■ contains(T, “computer”) 🡪 contains(x, “software”) [1%, 75%]

Data Mining 46
Data Mining Functionalities (2)
■ Classification and Prediction
■ Finding models (functions) that describe and distinguish classes
or concepts for future prediction
■ E.g., classify countries based on climate, or classify cars based

on gas mileage
■ Presentation: decision-tree, classification rule, neural network
■ Prediction: Predict some unknown or missing numerical values
■ Cluster analysis
■ Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
■ Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity
Data Mining 47
Data Mining Functionalities (3)
■ Outlier / Anomaly analysis
■ Outlier: a data object that does not comply with the general behavior of
the data
■ It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
■ Trend and evolution analysis
■ Trend and deviation: regression analysis
■ Sequential pattern mining, periodicity analysis
■ Similarity-based analysis
■ Other pattern-directed or statistical analyses

Data Mining 48
OLAP Mining: An Integration of Data
Mining and Data Warehousing
■ Data mining systems, DBMS, Data warehouse
systems coupling
■ On-line analytical mining data
■ integration of mining and OLAP technologies
■ Interactive mining multi-level knowledge
■ Necessity of mining knowledge and patterns at different levels
of abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
■ Integration of multiple mining functions
■ Characterized classification, first clustering and then
association
Data Mining 49
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta Data

Filtering&Integration Database API Filtering


Layer1
Data cleaning Data
Databases Data
Data integration Warehouse
Data Mining 50
Repository
Thanks

Data Mining 51

You might also like