0% found this document useful (0 votes)
21 views30 pages

Datamining 1

Uploaded by

castiron1998
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views30 pages

Datamining 1

Uploaded by

castiron1998
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Introduction

DATA MINING

1
Why Data Mining?
Necessity, who is the mother of invention. – Plato

 We are drowning in data, but starving for knowledge!

 The Explosive Growth of Data: from terabytes to


petabytes

 Major sources of abundant data


 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …

 Society and everyone: news, digital cameras, YouTube

2
Why Data Mining?
 Data mining turns a large collection of data into
knowledge

 A search engine (e.g., Google) receives hundreds of millions of queries


every day
 Each query can be viewed as a transaction where the user describes her
or his information need
 some patterns found in user search queries can disclose invaluable
knowledge that cannot be obtained by reading individual data items
alone

3
Data Mining

searching for knowledge (interesting patterns) in data.

4
What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data
 Data mining: a misnomer?

 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems

5
Data Mining Applications

6
Data Mining for Financial Data Analysis

 Design and construction of data warehouses


 Loan payment prediction and customer credit
policy analysis
 Classification and clustering of customers for
targeted marketing
 Detection of money laundering and other financial
crimes

7
Knowledge Discovery (KDD) Process
 This is a view from typical database
systems and data warehousing
communities
Pattern Evaluation
 Data mining plays an essential role
in the knowledge discovery process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
8
Knowledge Discovery (KDD) Process
 Data cleaning (to remove noise and inconsistent data)
 Data integration (where multiple data sources may be
combined)
 Data selection (where data relevant to the analysis task are
retrieved from the database)
 Data transformation (where data are transformed and
consolidated into forms appropriate for mining by performing
summary or aggregation operations)
 Data mining (an essential process where intelligent methods
are applied to extract data patterns)
 Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on interestingness measures)
 Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined
knowledge to users)
9
Data Warehouses
 A data warehouse is a repository of information
collected from multiple sources, stored under a unified
schema, and usually residing at a single site.
 It is usually modeled by a multidimensional data
structure, called a data cube
 In data cube, each dimension corresponds to an
attribute or a set of attributes in the schema
 each cell stores the value of some aggregate measure
such as count as an example
 A data cube provides a multidimensional view of data
and allows the pre-computation and fast access of
summarized data
10
Data Warehouses

11
Data Mining: On What Kinds of Data?

 Database-oriented data sets and applications


 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web

12
Data Mining Functionalities
 Data mining functionalities are used to specify the
kinds of patterns to be found in data mining tasks

 In general, such tasks can be classified into two


categories –
 Descriptive - characterizes properties of the data in a
target data set.
 Predictive - performs induction on the current data in
order to make predictions

13
Generalization

 Information integration and data warehouse construction


 Data cleaning, transformation, integration, and
multidimensional data model

 Multidimensional concept description: Characterization


and discrimination
 Generalize, summarize, and contrast data characteristics

14
Example: Data Characterization
 A customer relationship manager at
“ABCElectronics” may order the following data
mining task: Summarize the characteristics of
customers who spend more than $5000 a year at
“ABCElectronics”.
 The result is a general profile of these customers,
such as that they are 40 to 50 years old, employed,
and have excellent credit ratings.
 The data mining system should allow the customer
relationship manager to drill down on any
dimension, such as on occupation to view these
customers according to their type of employment
15
Example: Data Discrimination
 A customer relationship manager at “ABCElectronics” may want
to compare two groups of customers—those who shop for
computer products regularly (e.g., more than twice a month) and
those who rarely shop for such products (e.g., less than three
times a year)
 The resulting description provides a general comparative profile
of these customers, such as that 80% of the customers who
frequently purchase computer products are between 20 and 40
years old and have a university education

 Whereas 60% of the customers who infrequently buy such


products are either seniors or youths, and have no university
degree.
16
Mining Frequent Patterns, Association
and Correlation Analysis
 Frequent patterns or frequent item sets - patterns that
occur frequently in data.
 A frequent item set typically refers to a set of items
that often appear together in a transactional data set
 —for example, milk and bread, which are frequently bought together in
grocery stores by many customer
 What items are frequently purchased together in your Walmart?
 A frequently occurring subsequence, such as the pattern that
customers, tend to purchase first a laptop, followed by a digital
camera, and then a memory card, is a (frequent) sequential pattern
 Mining frequent patterns leads to the discovery of
interesting associations and correlations within data. 17
Association and Correlation Analysis

 Suppose that, as a marketing manager at


“ABCElectronics”, you want to know which items are
frequently purchased together
 An example of such a rule:
buys(X, “computer”) ⇒ buys(X, “software”) [support = 1%,confidence = 50%]

 A confidence, or certainty, of 50% means that if a


customer buys a computer, there is a 50% chance that
she will buy software as well
 A 1% support means that 1% of all the transactions
under analysis show that computer and software are
purchased together
18
Question

 A data mining system may find association rules as


follows: age(X, “20..29”) ∧ income(X, “40K..49K”) ⇒ buys(X,
“laptop”) [support = 2%, confidence = 60%]

 What does the above association rule indicate?

19
Answer
 The rule indicates that of all the customers under
study, 2% are 20 to 29 years old with an income of
$40,000 to $49,000 and have purchased a laptop
(computer)

 There is a 60% probability that a customer in this age


and income group will purchase a laptop.

20
Classification
 Classification and label prediction
 Construct models (functions) based on some training
examples
 Describe and distinguish classes or concepts for future
prediction
 E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
 Predict some unknown class labels
 Typical methods
 Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
 Typical applications: Credit card fraud detection, direct
21
Some Classification Tools

22
Classification and Regression

 Suppose as a sales manager you want to classify a large set of


items in the store, based on three kinds of responses to a sales
campaign: good response, mild response and no response.
 You want to derive a model for each of these three classes
based on the descriptive features of the items, such as price,
brand, place made, type, and category
 Suppose instead, that rather than predicting categorical
response labels for each store item, you would like to predict
the amount of revenue that each item will generate during an
upcoming sale , based on the previous sales data
 This is an example of regression

23
Cluster Analysis
 Unsupervised learning (i.e., Class label is unknown)
 Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns
 Principle: Maximizing intra-class similarity & minimizing
interclass similarity
 Many methods and applications

24
Outlier Analysis

 Outlier analysis
 Outlier: A data object that does not comply with the general behavior of
the data
 Noise or exception? ― One person’s garbage could be another person’s
treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis

 Example: Outlier analysis may uncover fraudulent usage of credit cards by


detecting purchases of unusually large amounts for a given account number
in comparison to regular charges incurred by the same account.

25
Technologies Used

26
Technologies Used
 Statistics

 Data mining has an inherent connection with statistics.

 It studies the collection, analysis, interpretation or


explanation, and presentation of data

 Statistical models are widely used to model data and


data classes

27
Technologies Used

 Machine Learning

 It investigates how computers can learn (or improve


their performance) based on data

 For example, a typical machine learning problem is to


program a computer so that it can automatically
recognize handwritten postal codes on mail after
learning from a set of examples

28
Technologies Used

 Information Retrieval
 It is the science of searching for documents or
information in documents

 Documents can be text or multimedia, and may


reside on the Web

29
Major Issues
 Mining various and new kinds of knowledge

 Mining knowledge in multidimensional space

 Data mining—an interdisciplinary effort

 Handling uncertainty, noise, or incompleteness of


data

 Pattern evaluation and pattern- or constraint-


guided mining 30

You might also like