0% found this document useful (0 votes)

21 views30 pages

Datamining 1

Uploaded by

castiron1998

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views30 pages

Datamining 1

Uploaded by

castiron1998

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Introduction

DATA MINING

1
Why Data Mining?
Necessity, who is the mother of invention. – Plato

 We are drowning in data, but starving for knowledge!

 The Explosive Growth of Data: from terabytes to

petabytes

 Major sources of abundant data

 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …

 Society and everyone: news, digital cameras, YouTube

2
Why Data Mining?
 Data mining turns a large collection of data into
knowledge

 A search engine (e.g., Google) receives hundreds of millions of queries

every day
 Each query can be viewed as a transaction where the user describes her
or his information need
 some patterns found in user search queries can disclose invaluable
knowledge that cannot be obtained by reading individual data items
alone

3
Data Mining

searching for knowledge (interesting patterns) in data.

4
What Is Data Mining?

 Data mining (knowledge discovery from data)

 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data
 Data mining: a misnomer?

 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems

5
Data Mining Applications

6
Data Mining for Financial Data Analysis

 Design and construction of data warehouses

 Loan payment prediction and customer credit
policy analysis
 Classification and clustering of customers for
targeted marketing
 Detection of money laundering and other financial
crimes

7
Knowledge Discovery (KDD) Process
 This is a view from typical database
systems and data warehousing
communities
Pattern Evaluation
 Data mining plays an essential role
in the knowledge discovery process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
8
Knowledge Discovery (KDD) Process
 Data cleaning (to remove noise and inconsistent data)
 Data integration (where multiple data sources may be
combined)
 Data selection (where data relevant to the analysis task are
retrieved from the database)
 Data transformation (where data are transformed and
consolidated into forms appropriate for mining by performing
summary or aggregation operations)
 Data mining (an essential process where intelligent methods
are applied to extract data patterns)
 Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on interestingness measures)
 Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined
knowledge to users)
9
Data Warehouses
 A data warehouse is a repository of information
collected from multiple sources, stored under a unified
schema, and usually residing at a single site.
 It is usually modeled by a multidimensional data
structure, called a data cube
 In data cube, each dimension corresponds to an
attribute or a set of attributes in the schema
 each cell stores the value of some aggregate measure
such as count as an example
 A data cube provides a multidimensional view of data
and allows the pre-computation and fast access of
summarized data
10
Data Warehouses

11
Data Mining: On What Kinds of Data?

 Database-oriented data sets and applications

 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web

12
Data Mining Functionalities
 Data mining functionalities are used to specify the
kinds of patterns to be found in data mining tasks

 In general, such tasks can be classified into two

categories –
 Descriptive - characterizes properties of the data in a
target data set.
 Predictive - performs induction on the current data in
order to make predictions

13
Generalization

 Information integration and data warehouse construction

 Data cleaning, transformation, integration, and
multidimensional data model

 Multidimensional concept description: Characterization

and discrimination
 Generalize, summarize, and contrast data characteristics

14
Example: Data Characterization
 A customer relationship manager at
“ABCElectronics” may order the following data
mining task: Summarize the characteristics of
customers who spend more than $5000 a year at
“ABCElectronics”.
 The result is a general profile of these customers,
such as that they are 40 to 50 years old, employed,
and have excellent credit ratings.
 The data mining system should allow the customer
relationship manager to drill down on any
dimension, such as on occupation to view these
customers according to their type of employment
15
Example: Data Discrimination
 A customer relationship manager at “ABCElectronics” may want
to compare two groups of customers—those who shop for
computer products regularly (e.g., more than twice a month) and
those who rarely shop for such products (e.g., less than three
times a year)
 The resulting description provides a general comparative profile
of these customers, such as that 80% of the customers who
frequently purchase computer products are between 20 and 40
years old and have a university education

 Whereas 60% of the customers who infrequently buy such

products are either seniors or youths, and have no university
degree.
16
Mining Frequent Patterns, Association
and Correlation Analysis
 Frequent patterns or frequent item sets - patterns that
occur frequently in data.
 A frequent item set typically refers to a set of items
that often appear together in a transactional data set
 —for example, milk and bread, which are frequently bought together in
grocery stores by many customer
 What items are frequently purchased together in your Walmart?
 A frequently occurring subsequence, such as the pattern that
customers, tend to purchase first a laptop, followed by a digital
camera, and then a memory card, is a (frequent) sequential pattern
 Mining frequent patterns leads to the discovery of
interesting associations and correlations within data. 17
Association and Correlation Analysis

 Suppose that, as a marketing manager at

“ABCElectronics”, you want to know which items are
frequently purchased together
 An example of such a rule:
buys(X, “computer”) ⇒ buys(X, “software”) [support = 1%,confidence = 50%]

 A confidence, or certainty, of 50% means that if a

customer buys a computer, there is a 50% chance that
she will buy software as well
 A 1% support means that 1% of all the transactions
under analysis show that computer and software are
purchased together
18
Question

 A data mining system may find association rules as

follows: age(X, “20..29”) ∧ income(X, “40K..49K”) ⇒ buys(X,
“laptop”) [support = 2%, confidence = 60%]

 What does the above association rule indicate?

19
Answer
 The rule indicates that of all the customers under
study, 2% are 20 to 29 years old with an income of
$40,000 to $49,000 and have purchased a laptop
(computer)

 There is a 60% probability that a customer in this age

and income group will purchase a laptop.

20
Classification
 Classification and label prediction
 Construct models (functions) based on some training
examples
 Describe and distinguish classes or concepts for future
prediction
 E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
 Predict some unknown class labels
 Typical methods
 Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
 Typical applications: Credit card fraud detection, direct
21
Some Classification Tools

22
Classification and Regression

 Suppose as a sales manager you want to classify a large set of

items in the store, based on three kinds of responses to a sales
campaign: good response, mild response and no response.
 You want to derive a model for each of these three classes
based on the descriptive features of the items, such as price,
brand, place made, type, and category
 Suppose instead, that rather than predicting categorical
response labels for each store item, you would like to predict
the amount of revenue that each item will generate during an
upcoming sale , based on the previous sales data
 This is an example of regression

23
Cluster Analysis
 Unsupervised learning (i.e., Class label is unknown)
 Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns
 Principle: Maximizing intra-class similarity & minimizing
interclass similarity
 Many methods and applications

24
Outlier Analysis

 Outlier analysis
 Outlier: A data object that does not comply with the general behavior of
the data
 Noise or exception? ― One person’s garbage could be another person’s
treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis

 Example: Outlier analysis may uncover fraudulent usage of credit cards by

detecting purchases of unusually large amounts for a given account number
in comparison to regular charges incurred by the same account.

25
Technologies Used

26
Technologies Used
 Statistics

 Data mining has an inherent connection with statistics.

 It studies the collection, analysis, interpretation or

explanation, and presentation of data

 Statistical models are widely used to model data and

data classes

27
Technologies Used

 Machine Learning

 It investigates how computers can learn (or improve

their performance) based on data

 For example, a typical machine learning problem is to

program a computer so that it can automatically
recognize handwritten postal codes on mail after
learning from a set of examples

28
Technologies Used

 Information Retrieval
 It is the science of searching for documents or
information in documents

 Documents can be text or multimedia, and may

reside on the Web

29
Major Issues
 Mining various and new kinds of knowledge

 Mining knowledge in multidimensional space

 Data mining—an interdisciplinary effort

 Handling uncertainty, noise, or incompleteness of

data

 Pattern evaluation and pattern- or constraint-

guided mining 30

Prof. Chandan Singhavi
No ratings yet
Prof. Chandan Singhavi
86 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
27 pages
Gathering and Organizing Information Using ICT Advantages and Disadvantages of Using Online Tools To Gather Data
No ratings yet
Gathering and Organizing Information Using ICT Advantages and Disadvantages of Using Online Tools To Gather Data
16 pages
UNIT-3 DATA MINING - Part1
No ratings yet
UNIT-3 DATA MINING - Part1
111 pages
Tuckshop Management System
100% (1)
Tuckshop Management System
70 pages
Phone Location Tracker
No ratings yet
Phone Location Tracker
5 pages
Computer Class 5 Third Term
No ratings yet
Computer Class 5 Third Term
2 pages
Major Issues in Data Mining
75% (4)
Major Issues in Data Mining
45 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Unit 1 Introduction To Data Science
No ratings yet
Unit 1 Introduction To Data Science
63 pages
Form IV - ICT - CLASS - Chapter 1 Peripheral Devices Part 2
100% (1)
Form IV - ICT - CLASS - Chapter 1 Peripheral Devices Part 2
41 pages
Data Mining Notes
100% (1)
Data Mining Notes
45 pages
MC9251 Middleware Technologies Unit-1
100% (1)
MC9251 Middleware Technologies Unit-1
26 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
IoT Lab Manual
No ratings yet
IoT Lab Manual
47 pages
AED AEM Netscout
No ratings yet
AED AEM Netscout
9 pages
Week #2
No ratings yet
Week #2
17 pages
Sybase Ase
0% (1)
Sybase Ase
25 pages
EPM Functions
0% (1)
EPM Functions
3 pages
Overview of Cyber Laws in India
No ratings yet
Overview of Cyber Laws in India
43 pages
Windows Administrator L2 Interview Question - System Administrator
63% (48)
Windows Administrator L2 Interview Question - System Administrator
44 pages
Prachi Patil SAP-Azure Architect
No ratings yet
Prachi Patil SAP-Azure Architect
2 pages
1 s2.0 S0965997814000775 Main
No ratings yet
1 s2.0 S0965997814000775 Main
16 pages
Data Mining and Datawarehousing CS-303
No ratings yet
Data Mining and Datawarehousing CS-303
34 pages
The RODBC Package: R Topics Documented
No ratings yet
The RODBC Package: R Topics Documented
22 pages
2048 Game Using C++
No ratings yet
2048 Game Using C++
21 pages
Quizizz: Sempoa: Quiz Started On: Thu 05, Nov 07:58 AM Total Attendance: 43 Average Score: 4820 Class Level # Correct
No ratings yet
Quizizz: Sempoa: Quiz Started On: Thu 05, Nov 07:58 AM Total Attendance: 43 Average Score: 4820 Class Level # Correct
28 pages
DM - Unit I-Updated
No ratings yet
DM - Unit I-Updated
65 pages
CH 1
No ratings yet
CH 1
66 pages
Exam Content Manual Preview: Effective Date March 31, 2022
100% (2)
Exam Content Manual Preview: Effective Date March 31, 2022
6 pages
Ccs342devops Syllabus
No ratings yet
Ccs342devops Syllabus
4 pages
SE Assignment 3
No ratings yet
SE Assignment 3
9 pages
Addpac AP1000 DS
No ratings yet
Addpac AP1000 DS
2 pages
SPC 2306 Systems Programming Year III Semester II
No ratings yet
SPC 2306 Systems Programming Year III Semester II
3 pages
Quiz On Structured Data
No ratings yet
Quiz On Structured Data
7 pages
Data Mining Concept (MMU)
No ratings yet
Data Mining Concept (MMU)
38 pages
Resume Sireesha
No ratings yet
Resume Sireesha
3 pages
Data Mining: by P.Tejesh Reddy
No ratings yet
Data Mining: by P.Tejesh Reddy
28 pages
Internal
No ratings yet
Internal
267 pages
Umbrella App Discovery and Blocking PDF
No ratings yet
Umbrella App Discovery and Blocking PDF
2 pages
Chapter 1 DM
No ratings yet
Chapter 1 DM
20 pages
01 - Introduction To Datamining
No ratings yet
01 - Introduction To Datamining
19 pages
Deepak Tare CV
No ratings yet
Deepak Tare CV
2 pages
Data Mining: Concepts & Techniques
No ratings yet
Data Mining: Concepts & Techniques
29 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
Nakul Dev Resume
No ratings yet
Nakul Dev Resume
2 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
Sigir2013 Tutorial
No ratings yet
Sigir2013 Tutorial
1 page
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
CH 2
No ratings yet
CH 2
37 pages
02-Introduction To Data Mining
No ratings yet
02-Introduction To Data Mining
40 pages
Unit-1 A
No ratings yet
Unit-1 A
47 pages
DM Unit 1
No ratings yet
DM Unit 1
10 pages
Archana Data Mining
No ratings yet
Archana Data Mining
24 pages
Intro Data Mining
100% (1)
Intro Data Mining
87 pages
Introduction
No ratings yet
Introduction
46 pages
Chapter 1 Data Mining Lecture Note
No ratings yet
Chapter 1 Data Mining Lecture Note
31 pages
Module 3
No ratings yet
Module 3
187 pages
Data Mining
No ratings yet
Data Mining
88 pages
Week 4 - Introduction To Data Mining and Data Mining Techniques
No ratings yet
Week 4 - Introduction To Data Mining and Data Mining Techniques
44 pages
1 IT326 - Ch1 - Introduction
No ratings yet
1 IT326 - Ch1 - Introduction
37 pages
Data Mining:: Dr. Hany Saleeb
No ratings yet
Data Mining:: Dr. Hany Saleeb
37 pages
DM 1
No ratings yet
DM 1
47 pages
DWM Unit II
No ratings yet
DWM Unit II
76 pages
Unit I Dbmi
No ratings yet
Unit I Dbmi
35 pages
Lecture 1.1.1 1.1.2
No ratings yet
Lecture 1.1.1 1.1.2
32 pages
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
No ratings yet
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
36 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
46 pages
Data Mining
No ratings yet
Data Mining
52 pages
Data Mining and Its Applications
No ratings yet
Data Mining and Its Applications
60 pages
Data Mining: Business Intelligence
No ratings yet
Data Mining: Business Intelligence
68 pages
DWDM
No ratings yet
DWDM
30 pages
Chapter 5 - Data Mining
No ratings yet
Chapter 5 - Data Mining
29 pages
Unit 1
No ratings yet
Unit 1
59 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
69 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
Data Miningppt378
No ratings yet
Data Miningppt378
31 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
Chap 1
No ratings yet
Chap 1
45 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Data Mining: Knowledge Discovery in Databases
No ratings yet
Data Mining: Knowledge Discovery in Databases
21 pages
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
No ratings yet
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
14 pages
1 Intro
No ratings yet
1 Intro
33 pages
Data Mining - IMT Nagpur-Manish
No ratings yet
Data Mining - IMT Nagpur-Manish
82 pages
Data Mining, Data Pattern, Machine Learning (Week 2
No ratings yet
Data Mining, Data Pattern, Machine Learning (Week 2
19 pages
DataWarehouseMining Complete Notes
No ratings yet
DataWarehouseMining Complete Notes
55 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Datamining 1

Uploaded by

Datamining 1

Uploaded by

Introduction

 We are drowning in data, but starving for knowledge!

 The Explosive Growth of Data: from terabytes to

 Major sources of abundant data

 Society and everyone: news, digital cameras, YouTube

 A search engine (e.g., Google) receives hundreds of millions of queries

searching for knowledge (interesting patterns) in data.

 Data mining (knowledge discovery from data)

 Design and construction of data warehouses

Data Warehouse Selection

 Database-oriented data sets and applications

 In general, such tasks can be classified into two

 Information integration and data warehouse construction

 Multidimensional concept description: Characterization

 Whereas 60% of the customers who infrequently buy such

 Suppose that, as a marketing manager at

 A confidence, or certainty, of 50% means that if a

 A data mining system may find association rules as

 What does the above association rule indicate?

 There is a 60% probability that a customer in this age

 Suppose as a sales manager you want to classify a large set of

 Example: Outlier analysis may uncover fraudulent usage of credit cards by

 Data mining has an inherent connection with statistics.

 It studies the collection, analysis, interpretation or

 Statistical models are widely used to model data and

 It investigates how computers can learn (or improve

 For example, a typical machine learning problem is to

 Documents can be text or multimedia, and may

 Mining knowledge in multidimensional space

 Data mining—an interdisciplinary effort

 Handling uncertainty, noise, or incompleteness of

 Pattern evaluation and pattern- or constraint-

You might also like