0% found this document useful (0 votes)

117 views31 pages

Chapter 1 Data Mining Lecture Note

The document provides an introduction to data mining and warehousing. It discusses why data mining has become important due to the massive growth of data from various sources. Data mining involves automated analysis of large datasets to discover useful patterns and knowledge that would otherwise remain unknown. It describes the key steps in the knowledge discovery process including data cleaning, integration, selection, transformation, mining, pattern evaluation and presentation. The document also outlines different types of data mining techniques like classification, clustering, association rule mining and their applications.

Uploaded by

naol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views31 pages

Chapter 1 Data Mining Lecture Note

Uploaded by

naol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Data Mining and Warehousing

Chapter One
Introduction
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data: Business, science and society
 The computing power is available and is affordable
 DM commercial products and machine learning algorithms are available
 The competitive pressure is very strong
• How to gain competitive advantage?
• How to control the volatile market?
• How to satisfy customers (prosumers) need?
 “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets. We are drowning in data, but starving for knowledge! 2



 We are data rich, but information poor.

3
What Is Data Mining?
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
 What is not data mining?
 Simple search and query processing
 Expert systems or small statistical programs

4
Knowledge Discovery (KDD) Process

 This is a view from typical

database systems and data Pattern Evaluation
warehousing communities
 Data mining plays an essential
role in the knowledge Data Mining
discovery process
Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases 5
KDD Process: Several Key Steps
 Learning the application domain
 Relevant prior knowledge and goals of application
 Identifying a target data set
 Data processing
 Data cleaning (remove noise and inconsistent data)
 Data integration (multiple data sources maybe combined)
 Data selection (data relevant to the analysis task are retrieved from database)
 Data transformation (consolidated into forms appropriate for mining)
Data mining (an essential process where intelligent methods are applied to extract
data patterns)
 Pattern evaluation (identify the truly interesting patterns)
 Knowledge presentation (mined knowledge is presented to the user with
representation techniques)

 Use of discovered knowledge

6
Why the focus shifts to “Knowledge”
 We are living in dynamic/complex environment; an environment
which is characterized by:
 Competitors
 Very strong competition
 Market
 Volatility of the market
 The business landscape is changing rapidly and non-linearly
 Customers/Consumers
 Customers reaches to the level of prosumers
 Prosumer are more educated consumer, who provide feedback
regarding products/services they need
 Professionals
 The high turnover rate of professionals
 Diminishing individual experience 7
Data Mining Applications
 Market Analysis
Targeted marketing/customer profiling
 Find clusters of ‘model’ customers who share the
same characteristics: Interest, income level, spending
habit etc.
Determine customer purchasing habits over time
Cross-Market Analysis
 Association/Co-relation between product sales
 Prediction based on the association information
Provide Summery Information
 Various multidimensional summery reports
8
Data Mining Applications
 Corporate Analysis and Risk Management
 Finance Planning and Asset Evaluation
 Cash flow analysis and prediction
 Trend analysis, time series etc.
 Resource planning
 summarize and compare the resource and spending
 Computation
 Monitor competitors and market directions
 Group customers into class and a class based pricing procedure
 Set pricing strategy in a highly competitive market
 Fraud detection/Network intrusion detection
 Which types of transactions are likely to be fraudulent, given the
demographics and transactional history of a particular customer?
9
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web

10
Database Processing vs. Data Mining Processing

11
Query Examples
 Database
 Find all credit applicants with first name ‘Alex’.
 Identify customers who have purchased more than Birr
10,000 in the last month.
 Find all customers who have purchased Bread
Data Mining
 Find all credit applicants who have no credit risks.
(classification)
 Identify customers with similar buying habits.
(Clustering)
 Find all items which are frequently purchased with
Bread. (association rules)
12
Data Mining Functionalities
What kind of patterns can be mined?
 Data mining functionalities are used to specify the kind
of patterns to be found in data mining tasks.
 In general, data mining tasks can be classified into two
categories: descriptive and predictive.
 Descriptive mining tasks characterize the general
properties of the data in the database.
 Predictive mining tasks perform inference on the
current data in order to make predictions.
 Users may have no idea regarding what kinds of
patterns in their data may be interesting, and hence may
like to search for several different kinds of patterns in
parallel.
13
Data Mining Functionalities
Association and Correlation Analysis
 Frequent patterns (or frequent itemsets)
 What items are frequently purchased together in your
Walmart?
 Association, correlation vs. causality
 A typical association rule
 Diaper  Beer [0.5%, 75%] (support, confidence)
 Are strongly associated items also strongly correlated?
 How to mine such patterns and rules efficiently in large
datasets?
 How to use such patterns for classification, clustering, and
other applications?
14
Data Mining Functionalities
Classification and Prediction
 Classification
 The process of finding a model that describes and distinguishes the
data classes or concepts, for the purpose of being able to use the model
to predict the class of objects whose class label is unknown.
 The derived model is based on the analysis of a set of training data
(data objects whose class label is known).

 Prediction
 Predict missing or unavailable numerical data values
 Typical methods
 Decision trees, naïve Bayesian classification, support vector machines, neural
networks, rule-based classification, pattern-based classification, logistic
regression, …
 Typical applications:
 Credit card fraud detection, direct marketing, classifying stars, diseases, web-
pages, … 15
Data Mining Functionalities
Cluster Analysis and Outlier Analysis
Cluster Analysis
 Unsupervised learning (i.e., Class label is unknown)
 Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
 Principle: Maximizing intra-class similarity and minimizing
interclass similarity
Outlier analysis
 Outlier: A data object that does not comply with the general behavior
of the data
 Noise or exception? ― One person’s garbage could be another
person’s treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis 16
Are All of the Patterns Interesting?
 Data mining may generate thousands of patterns: Not all of them
are interesting
 A pattern is interesting if it is

 easily understood by humans

 valid on new or test data with some degree of certainty,
 potentially useful
 novel
 validates some hypothesis that a user seeks to confirm
 An interesting measure represents knowledge !

17
Are All of the Patterns Interesting?
 Objective measures
 Based on statistics and structures of patterns, e.g., support,
confidence, etc. (Rules that do not satisfy a threshold are
considered uninteresting.)
 Subjective measures

 Reflect the needs and interests of a particular user.

 E.g. A marketing manager is only interested in characteristics of customers
who shop frequently.

 Based on user’s belief in the data.

 e.g., Patterns are interesting if they are unexpected, or can be used for strategic
planning, etc.

 Objective and subjective measures need to be combined.

18
Data Mining: Confluence of Multiple Disciplines

Machine Pattern Statistics

Learning Recognition

Information Visualization
Retrieval Data Mining

Algorithm Database High-Performance

Technology Computing

19
Data Mining: Confluence of Multiple Disciplines
 Statistics: studies the collection, analysis, interpretation or
explanation, and presentation of data.
 Machine learning investigates how computers can learn (or
improve their performance) based on data. A main research area is
for computer programs to automatically learn to recognize complex
patterns and make intelligent decisions based on data.
 Database Systems and Data Warehouses- Database
systems research focuses on the creation, maintenance, and use
of databases for organizations and end-users. A data warehouse
integrates data originating from multiple sources and various
timeframes.
 Information retrieval (IR) is the science of searching for
documents or information in documents. Documents can be text
or multimedia, and may reside on the Web.
20
Multi-Dimensional View of Data Mining
 Data to be mined
 Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
 Knowledge to be mined
 Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
21
Major Issues in Data Mining
The major issues in data mining classified into five groups:
 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results
22
Major Issues in Data Mining (2)
 Efficiency and Scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining

23
Data Mining System Classification
 A data mining system can be classified according to
the following criteria:
 Database Technology
 Statistics
 Machine Learning
 Information Science
 Visualization
 Apart from these, a data mining system can also be
classified based on the kind of (a) databases mined,
(b) knowledge mined, (c) techniques utilized, and (d)
applications adapted.
24
Data Mining System Classification
Classification Based on the Databases Mined
 We can classify a data mining system according to the kind of
databases mined. Database system can be classified according
to different criteria such as data models, types of data, etc.
 And the data mining system can be classified accordingly.
 For example, if we classify a database according to the data
model, then we may have a relational, transactional, object-
relational, or data warehouse mining system.
Classification Based on the Techniques Utilized
 We can classify a data mining system according to the kind of
techniques used. We can describe these techniques according
to the degree of user interaction involved or the methods of
analysis employed. 25
Data Mining System Classification
Classification Based on the Kind of Knowledge Mined
 We can classify a data mining system according to the kind of
knowledge mined. It means the data mining system is
classified on the basis of functionalities such as:
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Clustering
 Outlier Analysis
 Evolution Analysis
26
Data Mining System Classification
Classification Based on the Applications Adapted
 We can classify a data mining system according to
the applications adapted. These applications are as
follows:
 Finance
 Telecommunications
 DNA
 Stock Markets
 E-mail

27
Architecture of Data Mining
 A typical data mining system may have the following major components

28
Architecture of Data Mining
Knowledge Base:
 This is the domain knowledge that is used to guide the search
or evaluate the interestingness of resulting patterns.
 Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different le
vels of abstraction.
 Knowledge such as user beliefs, which can be used to assess apat
tern’s interestingness based on its unexpectedness, may
also be included.
 Other examples of domain knowledge are additional
interestingness constraints or thresholds, and metadata (e.g.,
describing data from multiple heterogeneous sources).

29
Architecture of Data Mining
Data Mining Engine:
 This is essential to the data mining system and ideally consists
of a set of functional modules for tasks such as
characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and
evolution analysis.
Pattern Evaluation Module:
 This component typically employs interestingness measures
interacts with the data mining modules so as to focus the
search toward interesting patterns.
 It may use interestingness thresholds to filter out discovered
patterns. Alternatively, the pattern evaluation module may be
integrated with the mining module, depending on the
implementation of the datamining method used. 30
Architecture of Data Mining
User interface:
 This module communicates between users and the data
mining system, allowing the user to interact with the system
by specifying a data mining query or task, providing
information to help focus the search, and performing
exploratory data mining based on the intermediate data
mining results.
 In addition, this component allows the user to browse
database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in different
forms.

Customer Churn Data - A Project Based On Logistic Regression
100% (12)
Customer Churn Data - A Project Based On Logistic Regression
31 pages
Shumanska Task 14.06
No ratings yet
Shumanska Task 14.06
3 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
27 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
Chap 1
No ratings yet
Chap 1
45 pages
Intro of Data Mining
No ratings yet
Intro of Data Mining
27 pages
Introduction
No ratings yet
Introduction
27 pages
Lecture_01_11jan
No ratings yet
Lecture_01_11jan
29 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
25 pages
Data Mining
No ratings yet
Data Mining
88 pages
Data Mining From Scratch
No ratings yet
Data Mining From Scratch
17 pages
Data Mining Notes
100% (1)
Data Mining Notes
45 pages
01 - Introduction To Datamining
No ratings yet
01 - Introduction To Datamining
19 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
Data Mining
No ratings yet
Data Mining
13 pages
Major Issues in Data Mining
75% (4)
Major Issues in Data Mining
45 pages
DWDM
No ratings yet
DWDM
30 pages
UNIT I DBMI
No ratings yet
UNIT I DBMI
35 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
Data Mining Concept (MMU)
No ratings yet
Data Mining Concept (MMU)
38 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
ICS 2408 Lecture 1 Introduction
No ratings yet
ICS 2408 Lecture 1 Introduction
32 pages
Module 3
No ratings yet
Module 3
187 pages
Chapter 1 DM
No ratings yet
Chapter 1 DM
20 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
Lecture 1.1.1 1.1.2
No ratings yet
Lecture 1.1.1 1.1.2
32 pages
Introduction To Data Mining 1604
No ratings yet
Introduction To Data Mining 1604
32 pages
Combine 056
No ratings yet
Combine 056
57 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
15 pages
dm 1
No ratings yet
dm 1
47 pages
Chapter 1 Intro
No ratings yet
Chapter 1 Intro
23 pages
01 - Data Mining Introduction
No ratings yet
01 - Data Mining Introduction
21 pages
data mining 1
No ratings yet
data mining 1
39 pages
02-Introduction to Data Mining
No ratings yet
02-Introduction to Data Mining
40 pages
Introduction
No ratings yet
Introduction
46 pages
01Intro1
No ratings yet
01Intro1
33 pages
1 Intro
No ratings yet
1 Intro
33 pages
Data Mining:: Dr. Hany Saleeb
No ratings yet
Data Mining:: Dr. Hany Saleeb
37 pages
01 Introduction
No ratings yet
01 Introduction
36 pages
01 Intro
No ratings yet
01 Intro
23 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
84 pages
01 Intro
No ratings yet
01 Intro
40 pages
Data Mining
No ratings yet
Data Mining
15 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
PPT 1
No ratings yet
PPT 1
34 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
25 pages
Data Mining
No ratings yet
Data Mining
27 pages
Data Mining
No ratings yet
Data Mining
52 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
31 pages
Unit-1
No ratings yet
Unit-1
148 pages
Topic10 - Data Mining
No ratings yet
Topic10 - Data Mining
29 pages
Day-2 BE-VIII DMDW (Into. Contd..)
No ratings yet
Day-2 BE-VIII DMDW (Into. Contd..)
23 pages
Data Mining: Concepts and Techniques: - Chapter 1
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1
37 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
37 pages
Lecture 2 Data Mining Functions
No ratings yet
Lecture 2 Data Mining Functions
40 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Science Crash Course - SharpSight PDF
100% (3)
Data Science Crash Course - SharpSight PDF
107 pages
Comparing Models Here Is The Chart Depicting All Model Scores
No ratings yet
Comparing Models Here Is The Chart Depicting All Model Scores
7 pages
Clark Schunk 80
No ratings yet
Clark Schunk 80
33 pages
4 1 C A Mathematicalmodeling
No ratings yet
4 1 C A Mathematicalmodeling
9 pages
Estimation of Highway Project Cost Using Probabilistic Technique
No ratings yet
Estimation of Highway Project Cost Using Probabilistic Technique
8 pages
Why Model?: Joshua M. Epstein
No ratings yet
Why Model?: Joshua M. Epstein
6 pages
Information PDF
No ratings yet
Information PDF
18 pages
Get Strategic System Assurance and Business Analytics P. K. Kapur PDF Ebook With Full Chapters Now
100% (3)
Get Strategic System Assurance and Business Analytics P. K. Kapur PDF Ebook With Full Chapters Now
52 pages
Lesson Plan 7
No ratings yet
Lesson Plan 7
3 pages
Preliminary Prediction of Power
100% (2)
Preliminary Prediction of Power
74 pages
Lesson Plan 3 Edst201 Unit Plan
No ratings yet
Lesson Plan 3 Edst201 Unit Plan
7 pages
Netact Tutorial
No ratings yet
Netact Tutorial
9 pages
Software Vulnerability Prediction Using Text Analysis Techniques
No ratings yet
Software Vulnerability Prediction Using Text Analysis Techniques
3 pages
Data Minig 2
No ratings yet
Data Minig 2
108 pages
Spatial Econometric Modeling Using PROC SPATIALREG - Subconscious Musings
No ratings yet
Spatial Econometric Modeling Using PROC SPATIALREG - Subconscious Musings
3 pages
Data Science
No ratings yet
Data Science
11 pages
Issues in Neuro - Management Decision Making
No ratings yet
Issues in Neuro - Management Decision Making
14 pages
A New International Politics-Diplomacy in Complex Interdependence
No ratings yet
A New International Politics-Diplomacy in Complex Interdependence
19 pages
Anastasia Novykh - Predictions of The Future
No ratings yet
Anastasia Novykh - Predictions of The Future
117 pages
Smart Betting Insights for Football
No ratings yet
Smart Betting Insights for Football
6 pages
Lecture Notes 1
No ratings yet
Lecture Notes 1
5 pages
1 s2.0 S2352710221001443 Main
No ratings yet
1 s2.0 S2352710221001443 Main
13 pages
Problem Solution Fit Car Resale Value Prediction
No ratings yet
Problem Solution Fit Car Resale Value Prediction
2 pages
Expert Systems With Applications: Moloud Abdar, Mariam Zomorodi-Moghadam, Resul Das, I-Hsien Ting
No ratings yet
Expert Systems With Applications: Moloud Abdar, Mariam Zomorodi-Moghadam, Resul Das, I-Hsien Ting
13 pages
Perbandingan Peramalan Penjualan Produk Aknil PT - Sunthi Sepurimengguanakan Metode Single Moving Average Dan Single
No ratings yet
Perbandingan Peramalan Penjualan Produk Aknil PT - Sunthi Sepurimengguanakan Metode Single Moving Average Dan Single
8 pages
(2005) Mathematical Model For Predicting Gel Point in The Process of Manufacturing Alkyd Resins
No ratings yet
(2005) Mathematical Model For Predicting Gel Point in The Process of Manufacturing Alkyd Resins
6 pages
Pages9 53
No ratings yet
Pages9 53
45 pages
Bjerre Nielsen Glavind - Ethnographic-Data-In-The-Age-Of-Big-Data-How-To-Compare-And-Combine
No ratings yet
Bjerre Nielsen Glavind - Ethnographic-Data-In-The-Age-Of-Big-Data-How-To-Compare-And-Combine
6 pages

Chapter 1 Data Mining Lecture Note

Uploaded by

Chapter 1 Data Mining Lecture Note

Uploaded by

Data Mining and Warehousing

 This is a view from typical

Data Warehouse Selection

 Use of discovered knowledge

 easily understood by humans

 Reflect the needs and interests of a particular user.

 Based on user’s belief in the data.

 Objective and subjective measures need to be combined.

Machine Pattern Statistics

Algorithm Database High-Performance

You might also like