0% found this document useful (0 votes)
14 views40 pages

DM Lec1

The document discusses data mining and provides definitions and examples of key concepts like association rules, clustering, and classification. It defines data mining as the extraction of useful patterns from large amounts of data and explains why it is needed due to the huge growth of digital data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views40 pages

DM Lec1

The document discusses data mining and provides definitions and examples of key concepts like association rules, clustering, and classification. It defines data mining as the extraction of useful patterns from large amounts of data and explains why it is needed due to the huge growth of digital data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

DATA MINING

Introduction
Lec 1

Mohammed
What is data mining?
• After years of data mining there is still no unique
answer to this question.

• A tentative definition:

Data mining is the use of efficient techniques for


the analysis of very large collections of data and the
extraction of useful and possibly unexpected
patterns in data.
What is data mining?
• Data mining (knowledge discovery in databases):
• Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) information or patterns from data in large
databases

• Alternative names and their “inside stories”:


• Data mining: a misnomer?
• Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, business
intelligence, etc.
What is (not) Data Mining?
● What is not Data Mining?

– Look up phone number in phone directory


– Query a Web search engine for information about
“Amazon”
Database vs Data Mining
• Database
• Find all credit applicants with last name of Smith.
• Identify customers who have purchased more than $10,000
in the last month.
• Find all customers who have purchased milk

• Data Mining
• Find all credit applicants who are poor credit risks.
(classification)
• Identify customers with similar buying habits. (Clustering)
• Find all items which are frequently purchased with milk.
(association rules)
A Bit of History
•We are drowning in data, but starving for knowledge.
(John Naisbitt, 1982)

•It has been estimated that the amount of information


in the world doubles every 20 months.
(Frawley, Piatetsky-Shapiro, Matheus, 1992)
We are Drowning in Data...

Wikipedia (en, text only)


≈ 20 GB of data

James Webb
Telescope
≈57 GB/day
≈21 TB/year
We are Drowning in Data...

Facebook
≈12 TB/day added
(as of Mar. 2010 )

Google
≈20 PB/day processed
(Jan. 2010 )
We are Drowning in Data...
We are Drowning in Data...
...but starving for knowledge!

Rate at which data are produced

Rate at which data can be understood


manual interpretation is hardly feasible!
Motivation:
• Data explosion problem

• Automated data collection tools and mature database technology

lead to tremendous amounts of data stored in databases, data


warehouses and other information repositories

• Solution: Data warehousing and data mining

• Data warehousing and on-line analytical processing

• Extraction of interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases


Why Mine Data ?
• Commercial Viewpoint :
• Lots of data is being collected
and warehoused
• Web data, e-commerce
• purchases at department/
grocery stores
• Bank/Credit Card
transactions

• Scientific Viewpoint:
• Data collected and stored at
enormous speeds (GB/hour)
• remote sensors on a satellite
• telescopes scanning the skies
• microarrays generating gene
expression data
• scientific simulations
generating terabytes of data
Data is power!
• “The data is the computer”
• Large amounts of data can be more powerful than
complex algorithms and models
• Google has solved many Natural Language Processing problems,
simply by looking at the data
• Example: misspellings, synonyms
• Data is power!
• Today, the collected data is one of the biggest assets of an
online company
• Query logs of Google
• The friendship and updates of Facebook
• Tweets and follows of Twitter
• Amazon transactions
Data is power!
• Competitive Pressure is Strong
• Provide better, customized services for anedge (e.g. in Customer
Relationship Management)

• Traditional techniques infeasible for raw data


• Data mining may help scientists
•Really, really huge amounts of raw data!!
•In the digital age, TB of data is generated
by the second
•Mobile devices, digital photographs, web
documents.
•Facebook updates, Tweets, Blogs, User-
generated content
•Transactions, sensor data, surveillance data
•Queries, clicks, browsing
•Cheap storage has made possible to
maintain this data
Data, Information, Knowledge, and
Wisdom
Example (Cholera)
• Cholera disease
• From beginning of 19th century
• ~100,000 deaths per year
– until today!
• For a long time,
there was little knowledge
– on ways of infection
– on causes of the disease
Example (CoViD-19)
• Data Mining can help understanding
– pathways and chains of infection
– critical preconditions of patients
• previous diseases
• medications
• genetic preconditions
-effectiveness of prevention strategies
What is Data Mining again?
• “Data mining is the discovery of models for data”
(Rajaraman, Ullman)
Origins of Data Mining
• Draws ideas from machine learning, statistics, and
database systems.
• Traditional techniques may be unsuitable due to
– large amount of data
– high dimensionality of data
– heterogeneous, distributed nature of data
Data Mining: Classification Schemes
• Decisions in data mining
• Kinds of databases to be mined

• Kinds of knowledge to be discovered

• Kinds of techniques utilized

• Kinds of applications adapted

• Data mining tasks


• Descriptive data mining

• Predictive data mining


Decisions in Data Mining
• Databases to be mined
• Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy,
WWW, etc.
• Knowledge to be mined
• Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.
Data Mining Tasks
• Prediction Tasks
• Use some variables to predict unknown or future values of other
variables
• Description Tasks
• Find human-interpretable patterns that describe the data.

Common data mining tasks


• Classification [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Deviation Detection [Predictive]
Data Mining Models and Tasks
ASSOCIATION RULES
Frequent Itemsets and Association
Rules
• Given a set of records each of which contain some number
of items from a given collection;
• Identify sets of items (itemsets) occurring frequently
together
• Produce dependency rules which will predict occurrence
of an item based on occurrences of other items.

Itemsets Discovered:
{Milk,Coke}
{Diaper, Milk}

Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining


Frequent Itemsets: Applications
• Text mining: finding associated phrases in text
• There are lots of documents that contain the phrases
“association rules”, “data mining” and “efficient
algorithm”

• Recommendations:
• Users who buy this item often buy this item as well
• Users who watched James Bond movies, also watched
Jason Bourne movies.

• Recommendations make use of item and user similarity


Association Rule Discovery:
Application
• Supermarket shelf management.
• Goal: To identify items that are bought together by
sufficiently many customers.
• Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies among
items.
• A classic rule --
• If a customer buys diaper and milk, then he is very likely to
buy beer.
• So, don’t be surprised if you find six-packs stacked next to
diapers!

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining


CLUSTERING
Clustering Definition
• Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
• Data points in one cluster are more similar to one
another.
• Data points in separate clusters are less similar to
one another.
• Similarity Measures?
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining


Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances


are minimized are maximized

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining


Clustering: Application 2
• Document Clustering:
• Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
• Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
• Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining


CLASSIFICATION
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes , one of the
attributes is theclass .
• Find amodel for class attribute as a function of
the values of other attributes.

• Goal: previously unseen records should be


assigned a class as accurately as possible.
• A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Classification
a l a l s
Example
ic ic ou
r r u
ego ego tin s
a t a t on s
c c c cla

Test
Set

Learn
Training Model
Set Classifier

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining


Classification: Application 1
• Ad Click Prediction
• Goal: Predict if a user that visits a web page will click
on a displayed ad. Use it to target users with high
click probability.
• Approach:
• Collect data for users over a period of time and record who
clicks and who does not. The {click, no click} information
forms the class attribute.
• Use the history of the user (web pages browsed, queries
issued) as the features.
• Learn a classifier model and test on new users.
Classification: Application 2
• Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
• Use credit card transactions and the information on its account-
holder as attributes.
• When does a customer buy, what does he buy, how often he pays on time,
etc
• Label past transactions as fraud or fair transactions. This forms the
class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card transactions
on an account.

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining


ANY QUESTIONS?

You might also like