0% found this document useful (0 votes)
9 views

Basic Concepts Data Mining (Lecture 02) - 1

Data mining

Uploaded by

Muhammad Hammad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Basic Concepts Data Mining (Lecture 02) - 1

Data mining

Uploaded by

Muhammad Hammad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 40

Introduction to Data

Mining

1
Introduction to Data Mining
Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Summary

2
Introduction to Data Mining
Why Data Mining?
 Data vs. Information:
 Data: recorded facts
 Information: patterns underlying the data
 The Explosive Growth of Data:
 Data collection and data availability

Automated data collection tools, database systems, Web
 Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: bioinformatics,

Society and everyone: news, digital cameras, YouTube

3
Introduction to Data Mining
Why Data Mining?

 We are drowning in data, but starving for knowledge!


 We are data rich, but information poor.

4
Introduction to Data Mining
What is Data Mining?
 “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets.
 Data mining—searching for knowledge (interesting
patterns) in your data.

5
Introduction to Data Mining
Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Summary

6
Introduction to Data Mining
What is Data Mining?
 Data Mining(knowledge discovery from data)
 Refers to extracting or “mining” knowledge from large amounts of
data.
 Extraction of interesting (implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of
data.

 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

7
Introduction to Data Mining
Knowledge Discovery Process
 Data mining can be viewed as simply an essential step
in the process of knowledge discovery.

 This is a view from typical


database systems and data
warehousing communities
 Data mining plays an essential
role in the knowledge discovery
process

8
Introduction to Data Mining
Knowledge Discovery Process
 Knowledge Discovery Process
 Data cleaning (to remove noise and inconsistent data)
 Data integration (where multiple data sources may be combined
 Data selection (where data relevant to the analysis task are retrieved
from the database)
 Data transformation (where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations)
 Data mining (an essential process where intelligent methods are
applied to extract data patterns)
 Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on interestingness measures
 Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined knowledge to
users)
 Steps 1 through 4 are different forms of data preprocessing

9
Introduction to Data Mining
Evolution of Database Technology
 1960s:

Data collection, database creation, IMS and network DBMS
 1970s:

Relational data model, relational DBMS implementation
 1980s:

RDBMS, advanced data models (extended-relational, OOetc.)

Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:

Data mining, data warehousing, multimedia databases, and Web
databases
 2000s

Stream data management and mining

Data mining and its applications

Web technology (XML, data integration) and global information systems
Modern GIS applications include address matching, location analysis or
site selection and development of evacuation plans. weather forecasting,
environmental study, natural hazards study
10
Introduction to Data Mining
Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Summary

11
Introduction to Data Mining
What Kind of Data Can be Mined?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data (varies over time), sequence data (incl.
bio-sequences (DNA sequence.))
 Structure data, graphs, social networks
 Heterogeneous databases and legacy databases
 Spatial data (geographic) and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web

12
Introduction to Data Mining
Database Data
 Database Data: Database management system
(DBMS), consists of a collection of interrelated data,
known as a database, and a set of software programs to
manage and access the data.
 The software programs provide mechanisms
 for defining database structures and data storage;
 for specifying and managing shared, or distributed data access;
 for ensuring consistency and security of the information stored
despite system crashes or attempts at unauthorized access.

13
Introduction to Data Mining
Database Data

 An example AllElectonics relational database

14
Introduction to Data Mining
Data Warehouse
 Data warehouse: A data warehouse is a repository of
information collected from multiple sources, stored under
a unified schema, and usually residing at a single site.
 Data in a data warehouse are organized around major
subjects (e.g., customer, item, supplier, and activity).
 The data are stored to provide information from a
historical perspective, such as in the past 6 to 12 months,
and

15
Introduction to Data Mining
Data Warehouse
 Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data
loading, and periodic data refreshing.

16
Introduction to Data Mining
Transactional Data
 Transactional Data: Each record in a transactional
database captures a transaction, such as a customer’s
purchase, a flight booking, or a user’s clicks on a web
page.
 A transaction typically includes a unique transaction identity
number (trans ID) and a list of the items making up the
transaction, such as the items purchased in the transaction.
 Transactions can be stored in a table, with one record
per transaction.
 Because most relational database systems do not support
nested relational structures, the transactional database is usually
either stored in a flat file

17
Introduction to Data Mining
Transactional Data

Fragment of a transactional database for sales at AllElectronics.

18
Introduction to Data Mining
Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Summary

19
Introduction to Data Mining
What Kind of Patterns Can Be Mined?
 Data mining functionalities.
 Characterization and discrimination
 Mining of frequent patterns, associations, and correlations
 Classification and regression
 Clustering analysis
 Outlier analysis
 Data mining functionalities are used to specify the kinds
of patterns to be found in data mining tasks.

20
Introduction to Data Mining
Concept/Class Description
 Characterization: summarization of the general
characteristics or features of a target class of data.
 The output of data characterization can be presented in
various forms.
 E.g., pie charts, bar charts, curves, multidimensional data cubes
etc.

 Example:
 A customer relationship manager at AllElectronics may order the
following data mining task: Summarize the characteristics of
customers who spend more than $5000 a year at AllElectronics.
The result is a general profile of these customers, such as that
they are 40 to 50 years old, employed, and have excellent credit
ratings.

21
Introduction to Data Mining
Concept/Class Description
 Discrimination: Comparison of the general features of
the target class data objects against the general features
of objects from one or multiple contrasting classes.
 The forms of output presentation are similar to those for
characteristic descriptions.
 Example:
 A customer relationship manager at AllElectronics may want to compare
two groups of customers—those who shop for computer products
regularly (e.g., more than twice a month) and those who rarely shop for
such products (e.g., less than three times a year). The resulting
description provides a general comparative profile of these customers,
such as that 80% of the customers who frequently purchase computer
products are between 20 and 40 years old and have a university
education, whereas 60% of the customers who infrequently buy such
products are either seniors or youths, and have no university degree.

22
Introduction to Data Mining
Frequent Patterns, Association and Correlation Analysis

 Mining Frequent Patterns: Frequent patterns are


patterns that occur frequently in data.
 The kinds of frequent patterns
 Frequent item sets patterns: refers to a set of items that
frequently appear together in a transactional data set, such as
milk and bread.
 Frequent sequential patterns: such as the pattern that
customers tend to purchase first a PC, followed by scanner, and
a printer , is a (frequent) sequential pattern.
 Mining frequent patterns leads to the discovery of
interesting associations and correlations within data.

23
Introduction to Data Mining
Frequent Patterns, Association and Correlation Analysis

 An example of association rule:

 where X is a variable representing a customer.


 This association rule involves a single attribute or
predicate (i.e., buys) that repeats, referred to as single-
dimensional

24
Introduction to Data Mining
Frequent Patterns, Association and Correlation Analysis

 We may find association rules like:

 This is an association between more than one attribute


(i.e., age, income, and buys).
 This is a multidimensional association rule.

25
Introduction to Data Mining
Classification
 Classification and label prediction
 Construct models (functions) based on some training examples
 Describe and distinguish classes or concepts for future
prediction

E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
 Predict some unknown class labels
 Typical methods
 Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
 Typical applications:
 Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …

26
Introduction to Data Mining
Classification

A classification model can be represented in various forms: (a)


IF-THEN rules, (b) a decision tree, or (c) a neural network.

27
Introduction to Data Mining
Clustering
 Unsupervised learning (i.e., Class label is unknown)
 Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
 Principle: Maximizing intra-class similarity & minimizing
interclass similarity
 Many methods and applications

28
Introduction to Data Mining
Clustering

A 2-D plot of customer data with respect to customer locations


in a city, showing three data clusters.

29
Introduction to Data Mining
Clustering
 The output takes the form of a diagram that shows how
the instances fall into clusters.
 Different cases:
 Simple 2D representation: involves associating a cluster
number with each instance
 Venn diagram: allow one instance to belong to more than one
cluster
 Probabilistic assignment: associate instances with clusters
probabilistically
 Dendrogram: produces a hierarchical structure of clusters
(dendron is the Greek word for tree)

30
Introduction to Data Mining
31
Clustering
Introduction to Data Mining
32
Clustering
Introduction to Data Mining
Outlier Analysis
 Outlier analysis
 Outlier: A data object that does not comply with the general
behavior of the data
 Noise or exception? ―
 Methods: clustering or regression analysis, …

33
Introduction to Data Mining
Are All Patterns are Interesting?
 Data mining may generate thousands of patterns: Not all
of them are interesting
 What makes a pattern interesting?
 Easily understood by humans,
 Valid on new or test data
 Novel, Potentially useful
 Validates some hypothesis that a user seeks to confirm
 Objective vs. subjective interestingness measures
 Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
 Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty etc.

34
Introduction to Data Mining
Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Summary

35
Introduction to Data Mining
What Technology Are Used?

36
Introduction to Data Mining
Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary

37
Introduction to Data Mining
What Kind of Applications Are Targeted?
 Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
 Recommender systems
 Basket data analysis
 Biological and medical data analysis: classification, cluster analysis
biological sequence analysis, biological network analysis

38
Introduction to Data Mining
Introduction
 Why Data Mining?
 What Is Data Mining?
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Summary

39
Introduction to Data Mining
Summary
 Data mining: Discovering interesting patterns and
knowledge from massive amount of data
 A natural evolution of database technology, in great
demand, with wide applications
 A KDD process includes data cleaning, data integration,
data selection, transformation, data mining, pattern
evaluation, and knowledge presentation
 Mining can be performed in a variety of data
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend
analysis, etc.

40

You might also like