Lecture 1 Introduction Updated (1)
Lecture 1 Introduction Updated (1)
Lecture 1
Preliminaries and Overview
Dr Mohammad Mehedi Hassan
2
Examples
• Europe's Very Long Baseline Interferometry (VLBI)
– has 16 telescopes, each of which produces 1 Gigabit/second
of astronomical data over a 25-day observation session
– storage and analysis a big problem
• Walmart reported to have 24 Tera-byte DB
• AT&T handles billions of calls per day
– data cannot be stored -- analysis is done on the fly
3
Growth Trends
• Moore’s law
– Computer Speed doubles every 18
months
• Storage law
– total storage doubles every 9 months
• Consequence
– very little data will ever be looked at
by a human
• Data mining is NEEDED to make
sense and use of data.
4
• What is data mining?
• Data mining on what kind of data
• Data mining : A KDD process
• What kind of patterns can be minded?
• Data mining system
• Data mining applications
5
What is data mining
• Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of
data stored either in databases, data warehouses or others
information repository.
• Alternative names :
– KDD (Knowledge Discovery in Databases),
– data pattern analysis,
– business intelligence
6
What is data mining
• Data mining can be viewed as a result of the
natural evolution of information technology.
7
What is (not) Data Mining?
What is not Data What is Data Mining?
Mining? – Certain names are more
– Look up phone prevalent in certain US
number in phone locations (O’Brien, O’Rurke,
directory O’Reilly… in Boston area)
– Group together similar
– Query a Web documents returned by search
search engine for engine according to their
information about context (e.g. Amazon
“Amazon” rainforest, Amazon.com,)
8
What is (not) Data Mining?
Supermarket Analysis
What is Data Mining:
A supermarket uses data mining to analyze customer
purchase data over time. They look for patterns, like which
products are often bought together (e.g., bread and butter) or
which days certain items sell the most. This helps them
optimize stock levels, create better promotions, and
understand customer preferences.
What is Not Data Mining:
Simply generating a report that shows last month’s total sales
figures without analyzing patterns or trends is not data
mining. It’s just basic data reporting.
9
What is (not) Data Mining?
Email Spam Filtering
What is Data Mining:
An email service uses data mining to analyze thousands of
emails, identifying patterns that distinguish spam from
legitimate emails. For example, the system might learn that
emails with certain keywords, phrases, or sender addresses
are often flagged as spam. Over time, this helps the service
automatically filter out spam emails for users.
11
Data mining on what kind of data
• Relational database
1. Relational database system is a collection of tables
with ER for modeling and SQL for querying
12
Data mining on what kind of data
• Data warehouse
It is repository of multiple heterogeneous data
sources organized under a unified schema at a
single site in order to facilitate management
decision making.
13
Data warehouse
Data source n
client
14
Data mining on what kind of data
• Transactional database
1. Transaction is a file where each record represents a
transaction
sales(trans_ID, list of item_IDs)
15
Data mining on what kind of data
16
• What is data mining?
• Data mining on what kind of data
• Data mining: A KDD process
• What kind of patterns can be minded?
• Data mining system
• Data mining applications
17
Knowledge Discovery Process
Integration
Interpretation Knowledge
Da & Evaluation
ta
Mi
nin
Tra g Knowledge
ns
Raw Data for
Understanding
Se ma __ __ __
tio Patterns
& lect n __ __ __
Cl io __ __ __ and
ea n
nin Rules
g
Transformed
Target Data
DATA
Data
Ware
house
18
• What is data mining?
• Data mining on what kind of data
• Data mining: A KDD process
• What kind of patterns can be minded?
• Data mining system
• Data mining applications
19
What kind of Patterns can be minded
(Association analysis)
• Association analysis discovers association rules
showing attribute-value conditions that occur
frequently together in a set of data, e.g. market
basket
• A rule has the form body head
buys(X, “milk”) buys(X, “sugar”)
20
What kind of Patterns can be minded
(Association analysis)
• Itemset X={x1, …, xk}
Transaction-id Items bought
• Find all the rules XY with min confidence
10 A, B, C
and support
20 A, C – support, s, probability that a
30 A, D transaction contains XY
support(XY ) = P(XY)
40 B, E, F
– confidence, c, conditional probability
that a transaction having X also
Customer Customer
buys both
contains Y
buys sugar
confidence(XY ) = P(Y/X)
=support({X,Y})/
Let support({X})
min_support = 50%,
min_conf = 50%
Customer A C (50%, 66.7%)
buys milk
C A (50%, 100%)
21
Association Rule Mining: Application 1
• Marketing and Sales Promotion:
• The rule discovered is: {Bagels, … } --> {Potato Chips}
• This rule means that when customers buy bagels (and possibly other items), they
also tend to buy potato chips. Here’s what each part of the rule implies for
marketing strategies:
• Potato Chips as consequent: This tells us that potato chips are often bought when
customers buy other products like bagels. Knowing this, the store can think of
strategies to increase the sales of potato chips. For instance, they might place
potato chips closer to bagels or advertise them together in promotions to boost
sales.
• Bagels in the antecedent: Since bagels are in the antecedent (the part before the
arrow, which triggers the rule), the store can analyze the impact of bagels on sales
of other products like potato chips. If the store considers discontinuing bagels, this
analysis will help them understand how that decision might affect the sales of
products that are often bought with bagels, like potato chips.
• Bagels in antecedent and Potato chips in consequent: This part of the rule
suggests that selling bagels alongside potato chips can be an effective strategy to
promote the sale of potato chips. The store could use this insight to create bundle
deals, discounts, or marketing campaigns that feature both products, potentially
increasing sales for both.
22
Association Rule Mining: Application 2
• Supermarket shelf management.
– Goal: To identify items that are bought together
by sufficiently many customers.
– Approach: Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
– A classic rule --
• If a customer buys diaper and milk, then he is very
likely to buy Juice.
23
Association Rule Mining: Application 3
• Inventory Management:
– Goal: The appliance repair company wants to be more efficient in
fixing consumer products. They aim to understand what kind of
repairs are usually needed for different appliances. This knowledge
will help them ensure that their service vehicles always carry the right
parts. By doing this, they hope to fix the appliances in one visit, rather
than having to make multiple trips to get the correct parts.
– Approach: To achieve this, the company plans to analyze the data from
past repair jobs. This data includes what tools and parts were needed
for each repair job at different locations. By examining this
information, the company can identify patterns. For example, they
might find that washing machines in a particular area often need a
specific type of belt replaced. Recognizing these patterns (called "co-
occurrence patterns") will help them predict what parts are likely to be
needed at future jobs in similar areas or with similar appliances.
24
What kind of Patterns can be minded
(Classification and Prediction)
25
Training Dataset
age income student leasing_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
26
Output: A Decision Tree for
“buys_computer” Target Class
age?
<=30 overcast
31..40 >40
no yes no yes
27
Classification: Application 1
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of consumers likely to
buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise.
• This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.
28
Classification: Application 2
• Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
• Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he
pays on time, etc
• Label past transactions as fraud or fair transactions.
This forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
29
What kind of Patterns can be minded
(Cluster Analysis)
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no
predefined classes
• Typical applications
– As a stand-alone tool to get insight into data
distribution
30
Clustering: Application 1
• Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.
31
Clustering: Application 2
• Document Clustering:
– Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
– Approach:
• To identify frequently occurring terms in each
document.
• Form a similarity measure based on the frequencies
of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the
clusters to relate a new document or search
term to clustered documents. 32
Illustrating Document Clustering
• Clustering Points: 3204 Articles of Los Angeles Times.
• Similarity Measure: How many words are common in these
documents (after some word filtering).
Category Total Correctly
Articles Placed
Financial 555 364
National 273 36
33
Clustering of S&P 500 Stock Data
• Observe Stock Movements every day.
• Clustering points: Stock-{UP/DOWN}
• Similarity Measure: Two points are more similar if the events
described by them frequently happen together on the same
day.
• We used association rules to quantify a similarity measure.
Discovered Clusters Industry Group
1
Applied-Matl-DOW N,Bay-Net work-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,
Sun-DOW N
2
Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN,
ADV-M icro-Device-DOWN,Andrew-Corp-DOWN,
Co mputer-Assoc-DOWN,Circuit-City-DOWN,
Technology2-DOWN
Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN
3
Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N,
MBNA-Corp -DOWN,Morgan-Stanley-DOWN Financial-DOWN
4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP
Schlu mberger-UP
34
Clustering Complex (social) network
• Complex networks are large
networks where local behavior
generates non-trivial global
features.
• Network Clustering
• Clustering coefficients – how
well connected?
• What does a complex network
look like when you can really
see it?
• Community discovery-separate
into densely connected subsets
• Automatic discovery of
communities
• Split by interest or meaning
35
Regression
• Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
• Greatly studied in statistics, neural network fields.
• Examples:
– Predicting sales amounts of new product based on
advetising expenditure.
– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.
36
• What is data mining?
• Data mining on what kind of data
• Data mining : A KDD process
• What kind of patterns can be minded?
• Data mining system
• Data mining applications
37
Data Mining System
Graphical user interface
Pattern evaluation
Data
Databases Warehouse
38
Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
39
• What is data mining?
• Data mining on what kind of data
• Data mining : A KDD process
• What kind of patterns can be minded?
• Data mining system
• Data mining applications
40
Major Application Areas for
Data Mining Solutions
• Advertising
• Customer Relationship Management (CRM)
• Database Marketing
• Fraud Detection
• eCommerce
• Health Care
• Investment/Securities
• Manufacturing, Process Control
• Sports and Entertainment
• Telecommunications
• Web
• Bioinformatics
41
Case Study: Search Engines
• Early search engines used mainly keywords
on a page – were subject to manipulation
• Google success is due to its algorithm
which uses mainly links to the page
• Google founders Sergey Brin and Larry
Page were students in Stanford doing
research in databases and data mining in
1998 which led to Google
42
Case Study: Direct Marketing and CRM
• Most major direct marketing companies are
using modeling and data mining
• Most financial companies are using customer
modeling
• Modeling is easier than changing customer
behaviour
• Some successes (Homework)
– Verizon Wireless reduced churn rate from 2% to
1.5%
43
Case Study:
Security and Fraud Detection
• Credit Card Fraud Detection
• Money laundering
– FAIS (US Treasury)
• Securities Fraud
– NASDAQ Sonar system
• Phone fraud
– AT&T, Bell Atlantic, British
Telecom/MCI
• Bio-terrorism detection at Salt Lake
Olympics 2002 44
Data Mining with Privacy
• Data mining is about finding patterns in large sets of data to gain insights
or make predictions, not about tracking individual people.
• Protecting Privacy: Here's how privacy can be maintained while using
data mining:
• Replacing Personal Data: Instead of using sensitive personal details like
names or addresses, these are replaced with anonymous identifiers. This
means the data can be used without revealing who it belongs to.
• Randomized Outputs: Sometimes, data mining systems are designed to
provide outputs (results) that are slightly randomized. This helps to ensure
that the results can't be used to figure out personal details about the people
in the data.
• Multi-party Computation: This is a method where data is distributed
across different locations or parties. No single party has access to all the
information. They can work together to perform calculations or analyses
without actually sharing the sensitive data they each hold.
45
Summary
• Data mining: discovering interesting patterns from large
amounts of data
• KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation,
and knowledge presentation
• Mining can be performed in a variety of information
repositories
• Data mining functionalities: association, classification,
clustering, outlier and trend analysis, etc.
46