0% found this document useful (0 votes)
6 views

Lecture 1 Introduction Updated (1)

The document is an introductory lecture on data mining, covering its definition, trends leading to data generation, and the necessity of data mining in understanding large datasets. It explains the knowledge discovery process, types of data suitable for mining, and various data mining techniques such as association analysis, classification, and clustering. Additionally, it provides examples of applications in marketing, fraud detection, and inventory management.

Uploaded by

zyadmonster22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 1 Introduction Updated (1)

The document is an introductory lecture on data mining, covering its definition, trends leading to data generation, and the necessity of data mining in understanding large datasets. It explains the knowledge discovery process, types of data suitable for mining, and various data mining techniques such as association analysis, classification, and clustering. Additionally, it provides examples of applications in marketing, fraud detection, and inventory management.

Uploaded by

zyadmonster22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

King Saud University

College of Computer & Information Sciences

IS 463 Introduction to Data Mining

Lecture 1
Preliminaries and Overview
Dr Mohammad Mehedi Hassan

The slides content is derived and adopted from many references


Trends leading to Data Flood
• More data is generated:
– Bank, telecom, other
business transactions ...
– Scientific Data: astronomy,
biology, etc
– Web, text, and e-commerce
• More data is captured:
– Storage technology faster
and cheaper
– DBMS capable of handling
bigger DB

2
Examples
• Europe's Very Long Baseline Interferometry (VLBI)
– has 16 telescopes, each of which produces 1 Gigabit/second
of astronomical data over a 25-day observation session
– storage and analysis a big problem
• Walmart reported to have 24 Tera-byte DB
• AT&T handles billions of calls per day
– data cannot be stored -- analysis is done on the fly

3
Growth Trends
• Moore’s law
– Computer Speed doubles every 18
months
• Storage law
– total storage doubles every 9 months
• Consequence
– very little data will ever be looked at
by a human
• Data mining is NEEDED to make
sense and use of data.

4
• What is data mining?
• Data mining on what kind of data
• Data mining : A KDD process
• What kind of patterns can be minded?
• Data mining system
• Data mining applications

5
What is data mining
• Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of
data stored either in databases, data warehouses or others
information repository.
• Alternative names :
– KDD (Knowledge Discovery in Databases),
– data pattern analysis,
– business intelligence

6
What is data mining
• Data mining can be viewed as a result of the
natural evolution of information technology.

• The evolutionary path :


1960 : data collection, database creation, network DBMS
1970 : relational data model (Codd)
1980 : advanced data models (extended relational, OO, deductive,…)
1990 : data analysis and understanding (data mining & data
warehousing)
2000 : data mining with variety of applications, web technology

7
What is (not) Data Mining?
 What is not Data  What is Data Mining?
Mining? – Certain names are more
– Look up phone prevalent in certain US
number in phone locations (O’Brien, O’Rurke,
directory O’Reilly… in Boston area)
– Group together similar
– Query a Web documents returned by search
search engine for engine according to their
information about context (e.g. Amazon
“Amazon” rainforest, Amazon.com,)

8
What is (not) Data Mining?
Supermarket Analysis
What is Data Mining:
A supermarket uses data mining to analyze customer
purchase data over time. They look for patterns, like which
products are often bought together (e.g., bread and butter) or
which days certain items sell the most. This helps them
optimize stock levels, create better promotions, and
understand customer preferences.
What is Not Data Mining:
Simply generating a report that shows last month’s total sales
figures without analyzing patterns or trends is not data
mining. It’s just basic data reporting.

9
What is (not) Data Mining?
Email Spam Filtering
What is Data Mining:
An email service uses data mining to analyze thousands of
emails, identifying patterns that distinguish spam from
legitimate emails. For example, the system might learn that
emails with certain keywords, phrases, or sender addresses
are often flagged as spam. Over time, this helps the service
automatically filter out spam emails for users.

What is Not Data Mining:


A user manually marking individual emails as spam is not
data mining. It’s just manual filtering without any analysis of
patterns across large datasets.
10
• What is data mining?
• Data mining on what kind of data
• Data mining : A KDD process
• What kind of patterns can be minded?
• Data mining system
• Data mining applications

11
Data mining on what kind of data
• Relational database
1. Relational database system is a collection of tables
with ER for modeling and SQL for querying

2. Data mining system may analyze customer data to


predict the credit risk of new customers based on
their income, age and previous credit information

3. Data mining system may detect deviations, such as


items whose sales are far from those expected in
comparison with the previous year.

12
Data mining on what kind of data
• Data warehouse
It is repository of multiple heterogeneous data
sources organized under a unified schema at a
single site in order to facilitate management
decision making.

• Data Mining can handle many prediction


problems

13
Data warehouse

Data source 1 Clean client


Querying
Transform
& Analysis
Integrate
Load Data warehouse

Data source n

client

14
Data mining on what kind of data
• Transactional database
1. Transaction is a file where each record represents a
transaction
sales(trans_ID, list of item_IDs)

trans_ID list of item_ID


T1 I1, I3, I8
… ….

2. Data mining can bring answer to “Which items sold


well together”

15
Data mining on what kind of data

• Advanced database and information repository

1. Spatial and temporal data


– Characteristics of houses located near a park
– Change in trend of metropolitan poverty rated based on city
distances from major highways
2. Heterogeneous and legacy database
3. Text databases & WWW

16
• What is data mining?
• Data mining on what kind of data
• Data mining: A KDD process
• What kind of patterns can be minded?
• Data mining system
• Data mining applications

17
Knowledge Discovery Process
Integration

Interpretation Knowledge
Da & Evaluation
ta
Mi
nin
Tra g Knowledge
ns
Raw Data for

Understanding
Se ma __ __ __
tio Patterns
& lect n __ __ __
Cl io __ __ __ and
ea n
nin Rules
g
Transformed
Target Data
DATA
Data
Ware
house

18
• What is data mining?
• Data mining on what kind of data
• Data mining: A KDD process
• What kind of patterns can be minded?
• Data mining system
• Data mining applications

19
What kind of Patterns can be minded
(Association analysis)
• Association analysis discovers association rules
showing attribute-value conditions that occur
frequently together in a set of data, e.g. market
basket
• A rule has the form body head
buys(X, “milk”)  buys(X, “sugar”)

20
What kind of Patterns can be minded
(Association analysis)
• Itemset X={x1, …, xk}
Transaction-id Items bought
• Find all the rules XY with min confidence
10 A, B, C
and support
20 A, C – support, s, probability that a
30 A, D transaction contains XY
support(XY ) = P(XY)
40 B, E, F
– confidence, c, conditional probability
that a transaction having X also
Customer Customer
buys both
contains Y
buys sugar
confidence(XY ) = P(Y/X)
=support({X,Y})/
Let support({X})
min_support = 50%,
min_conf = 50%
Customer A  C (50%, 66.7%)
buys milk
C  A (50%, 100%)
21
Association Rule Mining: Application 1
• Marketing and Sales Promotion:
• The rule discovered is: {Bagels, … } --> {Potato Chips}
• This rule means that when customers buy bagels (and possibly other items), they
also tend to buy potato chips. Here’s what each part of the rule implies for
marketing strategies:
• Potato Chips as consequent: This tells us that potato chips are often bought when
customers buy other products like bagels. Knowing this, the store can think of
strategies to increase the sales of potato chips. For instance, they might place
potato chips closer to bagels or advertise them together in promotions to boost
sales.
• Bagels in the antecedent: Since bagels are in the antecedent (the part before the
arrow, which triggers the rule), the store can analyze the impact of bagels on sales
of other products like potato chips. If the store considers discontinuing bagels, this
analysis will help them understand how that decision might affect the sales of
products that are often bought with bagels, like potato chips.
• Bagels in antecedent and Potato chips in consequent: This part of the rule
suggests that selling bagels alongside potato chips can be an effective strategy to
promote the sale of potato chips. The store could use this insight to create bundle
deals, discounts, or marketing campaigns that feature both products, potentially
increasing sales for both.
22
Association Rule Mining: Application 2
• Supermarket shelf management.
– Goal: To identify items that are bought together
by sufficiently many customers.
– Approach: Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
– A classic rule --
• If a customer buys diaper and milk, then he is very
likely to buy Juice.

23
Association Rule Mining: Application 3
• Inventory Management:
– Goal: The appliance repair company wants to be more efficient in
fixing consumer products. They aim to understand what kind of
repairs are usually needed for different appliances. This knowledge
will help them ensure that their service vehicles always carry the right
parts. By doing this, they hope to fix the appliances in one visit, rather
than having to make multiple trips to get the correct parts.

– Approach: To achieve this, the company plans to analyze the data from
past repair jobs. This data includes what tools and parts were needed
for each repair job at different locations. By examining this
information, the company can identify patterns. For example, they
might find that washing machines in a particular area often need a
specific type of belt replaced. Recognizing these patterns (called "co-
occurrence patterns") will help them predict what parts are likely to be
needed at future jobs in similar areas or with similar appliances.
24
What kind of Patterns can be minded
(Classification and Prediction)

• Construct models (functions) that describe and


distinguish classes or concepts for future prediction
– E.g., classify countries based on climate,
– or classify cars based on gas mileage
• Presentation
– decision-tree, classification rule, neural network
• Predict some unknown or missing numerical values

25
Training Dataset
age income student leasing_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

26
Output: A Decision Tree for
“buys_computer” Target Class
age?

<=30 overcast
31..40 >40

student? yes leasing_rating?

no yes excellent fair

no yes no yes

27
Classification: Application 1
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of consumers likely to
buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise.
• This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.

28
Classification: Application 2
• Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
• Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he
pays on time, etc
• Label past transactions as fraud or fair transactions.
This forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
29
What kind of Patterns can be minded
(Cluster Analysis)
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no
predefined classes
• Typical applications
– As a stand-alone tool to get insight into data
distribution

30
Clustering: Application 1
• Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.
31
Clustering: Application 2
• Document Clustering:
– Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
– Approach:
• To identify frequently occurring terms in each
document.
• Form a similarity measure based on the frequencies
of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the
clusters to relate a new document or search
term to clustered documents. 32
Illustrating Document Clustering
• Clustering Points: 3204 Articles of Los Angeles Times.
• Similarity Measure: How many words are common in these
documents (after some word filtering).
Category Total Correctly
Articles Placed
Financial 555 364

Foreign 341 260

National 273 36

Metro 943 746

Sports 738 573

Entertainment 354 278

33
Clustering of S&P 500 Stock Data
• Observe Stock Movements every day.
• Clustering points: Stock-{UP/DOWN}
• Similarity Measure: Two points are more similar if the events
described by them frequently happen together on the same
day.
• We used association rules to quantify a similarity measure.
Discovered Clusters Industry Group

1
Applied-Matl-DOW N,Bay-Net work-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,
Sun-DOW N

2
Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN,
ADV-M icro-Device-DOWN,Andrew-Corp-DOWN,
Co mputer-Assoc-DOWN,Circuit-City-DOWN,
Technology2-DOWN
Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN

3
Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N,
MBNA-Corp -DOWN,Morgan-Stanley-DOWN Financial-DOWN

4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP
Schlu mberger-UP

34
Clustering Complex (social) network
• Complex networks are large
networks where local behavior
generates non-trivial global
features.
• Network Clustering
• Clustering coefficients – how
well connected?
• What does a complex network
look like when you can really
see it?
• Community discovery-separate
into densely connected subsets
• Automatic discovery of
communities
• Split by interest or meaning
35
Regression
• Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
• Greatly studied in statistics, neural network fields.
• Examples:
– Predicting sales amounts of new product based on
advetising expenditure.
– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.

36
• What is data mining?
• Data mining on what kind of data
• Data mining : A KDD process
• What kind of patterns can be minded?
• Data mining system
• Data mining applications

37
Data Mining System
Graphical user interface

Pattern evaluation

Data mining engine

Data warehouse Knowledge-


server base
Data cleaning & data integration Filtering

Data
Databases Warehouse

38
Confluence of Multiple Disciplines

Database
Statistics
Systems

Machine
Learning
Data Mining Visualization

Algorithm Other
Disciplines
39
• What is data mining?
• Data mining on what kind of data
• Data mining : A KDD process
• What kind of patterns can be minded?
• Data mining system
• Data mining applications

40
Major Application Areas for
Data Mining Solutions
• Advertising
• Customer Relationship Management (CRM)
• Database Marketing
• Fraud Detection
• eCommerce
• Health Care
• Investment/Securities
• Manufacturing, Process Control
• Sports and Entertainment
• Telecommunications
• Web
• Bioinformatics

41
Case Study: Search Engines
• Early search engines used mainly keywords
on a page – were subject to manipulation
• Google success is due to its algorithm
which uses mainly links to the page
• Google founders Sergey Brin and Larry
Page were students in Stanford doing
research in databases and data mining in
1998 which led to Google

42
Case Study: Direct Marketing and CRM
• Most major direct marketing companies are
using modeling and data mining
• Most financial companies are using customer
modeling
• Modeling is easier than changing customer
behaviour
• Some successes (Homework)
– Verizon Wireless reduced churn rate from 2% to
1.5%
43
Case Study:
Security and Fraud Detection
• Credit Card Fraud Detection
• Money laundering
– FAIS (US Treasury)
• Securities Fraud
– NASDAQ Sonar system
• Phone fraud
– AT&T, Bell Atlantic, British
Telecom/MCI
• Bio-terrorism detection at Salt Lake
Olympics 2002 44
Data Mining with Privacy
• Data mining is about finding patterns in large sets of data to gain insights
or make predictions, not about tracking individual people.
• Protecting Privacy: Here's how privacy can be maintained while using
data mining:
• Replacing Personal Data: Instead of using sensitive personal details like
names or addresses, these are replaced with anonymous identifiers. This
means the data can be used without revealing who it belongs to.
• Randomized Outputs: Sometimes, data mining systems are designed to
provide outputs (results) that are slightly randomized. This helps to ensure
that the results can't be used to figure out personal details about the people
in the data.
• Multi-party Computation: This is a method where data is distributed
across different locations or parties. No single party has access to all the
information. They can work together to perform calculations or analyses
without actually sharing the sensitive data they each hold.
45
Summary
• Data mining: discovering interesting patterns from large
amounts of data
• KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation,
and knowledge presentation
• Mining can be performed in a variety of information
repositories
• Data mining functionalities: association, classification,
clustering, outlier and trend analysis, etc.

46

You might also like