0% found this document useful (0 votes)
68 views78 pages

DM 1

This document provides an introduction to data mining for business decision making. It discusses why data mining is important due to the massive growth of data from various sources. It defines data mining as the process of discovering patterns and predictions from large amounts of data. The document outlines the typical steps involved in knowledge discovery from data including data selection, cleaning, mining, evaluation and interpretation. It also discusses how data mining fits within business intelligence applications and helps transform data into knowledge.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views78 pages

DM 1

This document provides an introduction to data mining for business decision making. It discusses why data mining is important due to the massive growth of data from various sources. It defines data mining as the process of discovering patterns and predictions from large amounts of data. The document outlines the typical steps involved in knowledge discovery from data including data selection, cleaning, mining, evaluation and interpretation. It also discusses how data mining fits within business intelligence applications and helps transform data into knowledge.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Data Mining for Business Decision

Prof. (Dr.) T. Muthukumar


M.Sc; M.C.A; M.B.A; M.Phil; Ph.D.
Professor – Business Analytics & Associate Dean (Academic)
XIME - Bangalore
Agenda
• Introduction, Data and Pre-Processing
• Prediction models
• Descriptive models
• Other metheds
Recommended Reference Books
4
4
Introduction of Data Mining
• Why Data Mining?

• What Is Data Mining?


• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Why Data Mining?
• The Explosive Growth of Data: from terabytes to
petabytes
– Data collection and data availability
•Automated data collection tools, database
systems, Web, computerized society
– Major sources of abundant data
•Business: Web, e-commerce, transactions,
stocks, …
Why Data Mining?
•Science: Remote sensing,
bioinformatics, scientific simulation …
•Society and everyone: news, digital
cameras, YouTube
Why Data Mining?
• Large number of records(cases) (108-1012 bytes)
– One thousand (103) bytes = 1 kilobyte (KB)
– One million (106) bytes = 1 megabyte (MB)
– One billion (109) bytes = 1 gigabyte (GB)
– One trillion (1012) bytes = 1 terabyte (TB)
• High dimensional data (variables)
– 10-104 attributes
• Only a small portion, typically 5% to 10%, of
the collected data is ever analyzed.
Scientific Viewpoint
• Data collected and stored at enormous speeds
(Gbyte/hour)
– remote sensor on a satellite
– telescope scanning the skies
– scientific simulations generating terabytes of data
• Classical modeling techniques are infeasible
• Data reduction
• Cataloging, classifying, segmenting data
• Helps scientists in Hypothesis Formation
Evolution of Sciences: New Data Science Era

• Before 1600: Empirical science


• 1600-1950s: Theoretical science
– Each discipline has grown a theoretical
component. Theoretical models often motivate
experiments and generalize our understanding.
Evolution of Sciences: New Data Science Era

• 1950s-1990s: Computational science


– Over the last 50 years, most disciplines
have grown a third, computational branch
(e.g. empirical, theoretical, and
computational ecology, or physics, or
linguistics.)
– Computational Science traditionally
meant simulation. It grew out of our
inability to find closed-form solutions for
complex mathematical models.
Evolution of Sciences: New Data Science Era

• 1990-now: Data science


– The flood of data from new scientific
instruments and simulations
– The ability to economically store and
manage petabytes of data online
– The Internet and computing Grid that
makes all these archives universally
accessible
Evolution of Sciences: New Data Science Era

– Scientific info. management, acquisition,


organization, query, and visualization
tasks scale almost linearly with data
volumes
– Data mining is a major new challenge!
Why Data Mining?
• We are drowning in data, but starving for
knowledge!
• “Necessity is the mother of invention”—Data
mining—Automated analysis of massive data sets
Data pyramid

Wisdom Knowledge + experience

Knowledge Information + rules

Information Data + context

Data
Chapter 1. Introduction
• Why Data Mining?

• What Is Data Mining?


• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
What Is Data Mining?

• Data mining (knowledge discovery from data)


– Extraction of interesting (non-trivial,
implicit, previously unknown and
potentially useful) patterns or knowledge
from huge amount of data.
What Is Data Mining?
• Alternative names
– Knowledge discovery (mining) in
databases (KDD), knowledge extraction,
data/pattern analysis, data archeology,
data dredging, information harvesting,
business intelligence, etc.
• Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems
What is Data Mining?
• The automated extraction of predictive
information from large databases
– Automated
– Extraction
– Predictive
In large Database.
Business Intelligence
“A broad category of applications and
technologies for gathering, storing,
analyzing, sharing and providing access to
data to help enterprise users make better
business decisions.”
– Gartner
Relationships
And Acronyms...
What does Data Mining Do?

Explores Finds Performs


Your Data Patterns Predictions
DM and BI

• BI is geared at an end user, such as a business


owner, knowledge worker etc.
• DM is an IT technology generally geared
towards a more advanced user – today
Knowledge Discovery (KDD) Process
• This is a view from typical database
systems and data warehousing
Pattern Evaluation
communities
• Data mining plays an essential role in
the knowledge discovery process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
Knowledge Discovery Process
Integration

Interpretation Knowledge
Da & Evaluation
ta
Mi
nin Knowledge
Tr g
Raw an
sfo
Data rm __ __ __
Patterns

Understanding
S ati __ __ __
& elec on __ __ __ and
Cl
ea tion Rules
nin
g Transformed
Target Data
DATA
Ware Data
house
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
KDD Process: A Typical View from ML and Statistics

Input Data Data Data Post-Process


Pre-Processing Mining ing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Outlier analysis Pattern visualization
…………

• This is a view from typical machine learning and statistics communities


The Evolution of Data Analysis
Evolutionary Step Business Question Enabling Product Providers Characteristics
Technologies

Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective,
(1960s) revenue in the last disks static data delivery
five years?"

Data Access "What were unit Relational Oracle, Sybase, Retrospective,


(1980s) sales in New databases Informix, IBM, dynamic data
England last (RDBMS), Microsoft delivery at record
March?" Structured Query level
Language (SQL),
ODBC

Data Warehousing "What were unit On - line analytic SPSS, Comshare, Retrospective,
& Decis ion sales in New processing Arbor, Cognos, dynamic data
Support England last (OLAP), Microstrategy,NCR d elivery at multiple
(1990s) March? Drill down multidimensional levels
to Boston." databases, data
warehouses

Data Mining "What’s likely to Advanced SPSS/Clementine, Prospective,


(Emerging Today) happen to Boston algorithms, Lockheed, IBM, proactive
unit sales next multiprocessor SGI, SAS, NCR, information
month? Why?" computers, massive Oracle, numerous delivery
databases s tartups
When is DM useful

• Data rich world


• Large data (dimensionality and size)
– Image data (size)
– Gene chip data (dimensionality)
• Little knowledge about data (exploratory data
analysis)
– What if we have some knowledge?
Data Mining Versus Statistical Analysis
•Data Mining •Data Analysis
– Originally developed to act as – Tests for statistical
expert systems to solve correctness of models
problems • Are statistical assumption
– Less interested in the of models correct?
mechanics of the technique – Eg Is the R-Square
– If it makes sense then let’s use good?
it – Hypothesis testing
– Does not require assumptions • Is the relationship
to be made about data significant?
– Can find patterns in very large – Use a t-test to validate
amounts of data significance
– Requires understanding of – Tends to rely on sampling
data and business problem – Techniques are not optimised
for large amounts of data
– Requires strong statistical
skills
Data Mining versus OLAP

•OLAP - On-line
Analytical Processing
– Provides you with
a very good view
of what is
happening, but can
not predict what
will happen in the
future or why it is
happening
Data Mining vs. Database
• DB’s user knows what is looking for.
• DM’s user might/might not know what is looking for.
• DB’s answer to query is 100% accurate, if data correct.
• DM’s effort is to get the answer as accurate as possible.
• DB’s data are retrieved as stored.
• DM’s data need to be cleaned (some what) before
producing results.
• DB’s results are subset of data.
• DM’s results are the analysis of the data.
• The meaningfulness of the results is not the concern of
Database as it is the main issue in Data Mining.
Data Mining vs. KDD
• Knowledge Discovery in Databases (KDD) is the process
of finding useful information and patterns in the data.
• Data Mining is the use of algorithms to find the useful
information in the KDD process.
• KDD process is:
» Data cleaning & integration (Data Pre-processing)
» Creating a common data repository for all sources, such
as data warehouse.
Data mining
» Visualization for the generated results
Data mining is not
• Brute-force crunching of bulk
data
• “Blind” application of
algorithms
• Going to find relationships
where none exist
• Presenting data in different ways
• A database intensive task
• A difficult to understand
technology requiring an
advanced degree in computer
science
Chapter 1. Introduction
• Why Data Mining?

• What Is Data Mining?


• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Data Mining: On What Kinds of Data?

• Database-oriented data sets and applications


– Relational database, data warehouse,
transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data
(incl. bio-sequences)
Data Mining: On What Kinds of Data?
– Structure data, graphs, social networks
and multi-linked data
– Object-relational databases
– Heterogeneous databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases & The World-Wide Web
Chapter 1. Introduction
• Why Data Mining?

• What Is Data Mining?


• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Data Mining Algorithms

Online Analytical Discovery Driven


Processing Methods

Description Prediction
SQL Query
Tools
Regressio
Classification
ns
Visualization
Decision
Clustering Trees
Neural
Association Networks
Sequential
Analysis
Data Mining Function: (1) Generalization

• Information integration and data warehouse construction


– Data cleaning, transformation, integration, and
multidimensional data model
• Data cube technology
– Scalable methods for computing (i.e.,
materializing) multidimensional aggregates
– OLAP (online analytical processing)
• Multidimensional concept description: Characterization and
discrimination
– Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
Concept Description
• Characterization: provides a concise and
succinct summarization of the given
collection of data
• Discrimination: provides descriptions
comparing two or more collections of data.
Concept description: Characterization

Initial
Relation

Generalized
Relation
Data Mining Function: (2) Association and
Correlation Analysis
• Frequent patterns (or frequent itemsets)
– What items are frequently purchased together in
your Walmart?
• Association, correlation vs. causality
– A typical association rule
– Are strongly associated items also strongly
correlated?
• How to mine such patterns and rules efficiently in
large datasets?
Association rule
• Association (correlation and causality)
– age(X, “20..29”) ^ income(X, “20..29K”) buys(X,
“PC”) [support = 2%, confidence = 60%]
• Association rule mining
– Finding frequent patterns, associations, correlations
among sets of items or objects in transaction databases,
relational databases, and other information repositories
– Frequent pattern: pattern (set of items, sequence, etc.)
that occurs frequently in a database
• Motivation: finding regularities in data
– What products were often purchased together?
Example: Association rule

Transaction-id Items bought • Itemset A1,A2={a1, …, ak}


10 a1,a2, a3 • Find all the rules A1 A2 with min
20 a1, a3 confidence and support
30 a1, a4 – support, s, probability that a
40 a2, a5, a6 transaction contains A1∪A2
– confidence, c, conditional
probability that a transaction
having A1 also contains A2.
Let min_support = 50%,
min_conf = 50%:
a1 a3 (50%, 66.7%)
a3 a1 (50%, 100%)
Data Mining Function: (3) Classification

• Classification and label prediction


– Describe and distinguish classes or concepts
for future prediction
•E.g., classify countries based on (climate),
or classify cars based on (mileage)
– Predict some unknown class labels
Data Mining Function: (3) Classification
• Typical methods
– Decision trees, naïve Bayesian
classification, support vector machines,
neural networks, rule-based classification,
pattern – based classification, logistic
regression, …
• Typical applications:
– Credit card fraud detection, direct
marketing, classifying stars, diseases,
web-pages, …
Classification (1): Model Construction
Classification
Algorithms
Training
Data

Classifier
(Model)

IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification (2): Prediction Using the Model

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

Tenured?
Classification Techniques
• Decision Tree Induction
• Bayesian Classification
• Neural Networks
• Genetic Algorithms
• Fuzzy Set and Logic
Data Mining Function: (4) Cluster Analysis

• Unsupervised learning (i.e., Class label is


unknown)
• Group data to form new categories (i.e.,
clusters), e.g., cluster houses to find distribution
patterns
• Principle: Maximizing intra-class similarity &
minimizing interclass similarity
• Many methods and applications
Clustering
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Clustering
– Grouping a set of data objects into clusters based on the
principle: maximizing the intra-class similarity and
minimizing the interclass similarity
• Example
– Land use: Identification of areas of similar land use in an
earth observation database
– City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Data Mining Function: (5) Outlier Analysis

• Outlier analysis
– Outlier: A data object that does not comply
with the general behavior of the data
– Noise or exception? ― One person’s garbage
could be another person’s treasure
– Methods: by product of clustering or
regression analysis, …
– Useful in fraud detection, rare events analysis
Time and Ordering: Sequential Pattern, Trend and
Evolution Analysis
• Sequence, trend and evolution analysis
– Trend, time-series, and deviation analysis: e.g.,
regression and value prediction
– Sequential pattern mining
• e.g., first buy digital camera, then buy large SD memory
cards
– Periodicity analysis
– Biological sequence analysis
• Mining data streams
– Ordered, time-varying, potentially infinite, data
streams
Regression
• Regression is similar to classification
– First, construct a model
– Second, use model to predict unknown
value
• Methods
– Linear and multiple regression
– Non-linear regression
• Regression is different from
classification
– Classification refers to predict categorical
class label
– Regression models continuous-valued
functions
Chapter 1. Introduction
• Why Data Mining?

• What Is Data Mining?


• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Data Mining: Confluence of Multiple Disciplines

Pattern
Machine Statistics
Recogniti
Learning
on

Applicati Data Visualizat


ons Mining ion

Database High-Perform
Algorithm Technolo ance
gy Computing
Why Confluence of Multiple Disciplines?
• Tremendous amount of data
– Algorithms must be highly scalable to handle such
as tera-bytes of data
• High-dimensionality of data
– Micro-array may have tens of thousands of
dimensions
Why Confluence of Multiple Disciplines?
• High complexity of data
– Data streams and sensor data
– Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and
multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web
data
– Software programs, scientific simulations
• New and sophisticated applications
Chapter 1. Introduction
• Why Data Mining?

• What Is Data Mining?


• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Applications of Data Mining
• Web page analysis: from web page
classification, clustering algorithms
• Collaborative analysis & recommender systems
• Basket data analysis to targeted marketing
• Biological and medical data analysis:
classification, cluster analysis (microarray data
analysis), biological sequence analysis,
biological network analysis
Applications of Data Mining
• Data mining and software engineering (e.g.,
IEEE Computer, Aug. 2009 issue)
• From major dedicated data mining
systems/tools (e.g., SAS, MS SQL-Server
Analysis Manager, Oracle Data Mining
Tools) to invisible data mining
Chapter 1. Introduction
• Why Data Mining?

• What Is Data Mining?


• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Major Issues in Data Mining (1)

• Mining Methodology
– Mining various and new kinds of knowledge
– Mining knowledge in multi-dimensional
space
– Data mining: An interdisciplinary effort
– Boosting the power of discovery in a
networked environment
Major Issues in Data Mining (1)
– Handling noise, uncertainty, and
incompleteness of data
– Pattern evaluation and pattern- or
constraint-guided mining
• User Interaction
– Interactive mining
– Incorporation of background knowledge
– Presentation and visualization of data
mining results
Major Issues in Data Mining (2)

• Efficiency and Scalability


– Efficiency and scalability of data mining
algorithms
– Parallel, distributed, stream, and incremental
mining methods
Major Issues in Data Mining (2)
• Diversity of data types
– Handling complex types of data
– Mining dynamic, networked, and global
data repositories
• Data mining and society
– Social impacts of data mining
– Privacy-preserving data mining
– Invisible data mining
Chapter 1. Introduction
• Why Data Mining?

• What Is Data Mining?


• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
A Brief History of Data Mining Society
• 1989 IJCAI Workshop on Knowledge Discovery in
Databases
– Knowledge Discovery in Databases (G.
Piatetsky-Shapiro and W. Frawley, 1991)
• 1991-1994 Workshops on Knowledge Discovery in
Databases
– Advances in Knowledge Discovery and Data
Mining (U. Fayyad, G. Piatetsky-Shapiro, P.
Smyth, and R. Uthurusamy, 1996)
A Brief History of Data Mining Society
• 1995-1998 International Conferences on Knowledge
Discovery in Databases and Data Mining (KDD’95-98)
– Journal of Data Mining and Knowledge Discovery
(1997)
• ACM SIGKDD conferences since 1998 and SIGKDD
Explorations
• More conferences on data mining
– PAKDD (1997), PKDD (1997), SIAM-Data Mining
(2001), (IEEE) ICDM (2001), WSDM (2008), etc.
• ACM Transactions on KDD (2007).
Conferences and Journals on Data Mining
• KDD Conferences ■ Other related conferences
– ACM SIGKDD Int. Conf. on ■ DB conferences: ACM SIGMOD,
Knowledge Discovery in Databases VLDB, ICDE, EDBT, ICDT, …
and Data Mining (KDD)
■ Web and IR conferences: WWW,
– SIAM Data Mining Conf. (SDM)
SIGIR, WSDM
– (IEEE) Int. Conf. on Data Mining
■ ML conferences: ICML, NIPS
(ICDM)
– European Conf. on Machine ■ PR conferences: CVPR,
Learning and Principles and ■ Journals
practices of Knowledge Discovery ■ Data Mining and Knowledge
and Data Mining (ECML-PKDD) Discovery (DAMI or DMKD)
– Pacific-Asia Conf. on Knowledge
■ IEEE Trans. On Knowledge and
Discovery and Data Mining
Data Eng. (TKDE)
(PAKDD)
– Int. Conf. on Web Search and Data ■ KDD Explorations
Mining (WSDM) ■ ACM Trans. on KDD
Where to Find References? DBLP, Google
• Data mining and KDD (SIGKDD: CDROM)
– Conferences: ACM-SIGKDD, IEEE-ICDM,
SIAM-DM, PKDD, PAKDD, etc.
– Journal: Data Mining and Knowledge Discovery,
KDD Explorations, ACM TKDD
• Database systems (SIGMOD: ACM SIGMOD
Anthology—CD ROM)
– Conferences: ACM-SIGMOD, ACM-PODS,
VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
– Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS,
J. ACM, VLDB J., Info. Sys., etc.
Where to Find References? DBLP, Google
• Web and IR
– Conferences: SIGIR, WWW, CIKM, etc.
– Journals: WWW: Internet and Web Information Systems,
• Statistics
– Conferences: Joint Stat. Meeting, etc.
– Journals: Annals of statistics, etc.
• Visualization
– Conference proceedings: CHI, ACM-SIGGraph, etc.
– Journals: IEEE Trans. visualization and computer
graphics, etc.
Chapter 1. Introduction
• Why Data Mining?

• What Is Data Mining?


• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Summary
• Data mining: Discovering interesting patterns and knowledge from
massive amount of data
• A natural evolution of science and information technology, in great
demand, with wide applications
• A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
• Mining can be performed in a variety of data
• Data mining functionalities: characterization, discrimination,
association, classification, clustering, trend and outlier analysis, etc.
• Data mining technologies and applications
• Major issues in data mining
Data Mining for Business Decision
And now
discussion
Data Mining in Association Rule
• Contact:
• Prof. (Dr.) T. Muthukumar
[email protected]
• (0-9871969455)

You might also like