0% found this document useful (0 votes)
173 views

Data Mining (DM)

This document provides an overview of data mining and introduces key concepts. It discusses recommended textbooks on the topic and outlines the lecture, including defining what data mining is, why it has become more popular, potential applications, and the basic knowledge discovery process. Data mining is described as the process of discovering useful patterns from large datasets through methods at the intersection of machine learning, statistics, and databases. The knowledge discovery process involves data cleaning, integration, selection, mining, and evaluation.

Uploaded by

Khai Duong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
173 views

Data Mining (DM)

This document provides an overview of data mining and introduces key concepts. It discusses recommended textbooks on the topic and outlines the lecture, including defining what data mining is, why it has become more popular, potential applications, and the basic knowledge discovery process. Data mining is described as the process of discovering useful patterns from large datasets through methods at the intersection of machine learning, statistics, and databases. The knowledge discovery process involves data cleaning, integration, selection, mining, and evaluation.

Uploaded by

Khai Duong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Mining (DM)

Lecture #1
Data Mining Overview

1
Textbooks
• Galit Shmueli, Nitin R. Patel and Peter C. Bruce. (2010) Data Mining for
Business Intelligence: Concepts, Techniques, and Applications in Microsoft
Office Excel with XLMiner, Wiley (Second edition) (978-0-470-52682-8) (or later
edition) Online available
• A B M Shawkat Ali and Saleh A. Wasimi. (2007) Data Mining: Methods and
Techniques, Thompson (978-0-17-013676-1) (or later edition)
• Jiawei Han, Micheline Kamber and Jian Pei. (2012) Data Mining: Concepts and
Techniques, third edition, Morgan Kaufmann. ISBN: 978-0123814791. (or later
edition) Online available
• Ian H. Witten and Eibe Frank. (2005) Data Mining: Practical Machine Learning
Tools and Techniques, second edition, Morgan Kaufmann. ISBN: 0-12-088407-0
(or later edition) Online available
• Pang-Ning Tan, Michael Steinbach and Vipin Kumar. (2006) Introduction to Data
Mining, Pearson. (or later edition) Online available

2
Subject Information CP3403/CP5634
Recommended text Supplimentary text

3
Lecture Overview

• What is data mining?


• Why data mining?
• Where is data mining used?
• Data mining processes
• Data mining dimensions/multi-disciplines
• Data mining techniques
• Challenges of data mining
• Data mining software

CP3300 CP5605 CP5634 4


Pre-Reading List
Pre-Lecture work (reading) for students:

• Textbook 1 (Data Mining for Business Intelligence: Concepts,


Techniques, and Applications in Microsoft Office Excel with XLMiner
(Solver Anlaytic) – Galit Shmueli, Nitin R. Patel and Peter C. Bruce. ) –
Chapters-1 and 2
• Textbook 2 (Data Mining: Methods and Techniques - A B M Shawkat
Ali and Saleh A. Wasimi ) – Chapter-1

5
Cartoons

source from: capgemini consulting


CP3300 CP5605 CP5634 6
Key Idea … Learning from experience?

• Who do you think make a better dinner for you?


and why?

• What do you think make will offer sound financial


knowledge/ advice for you?
CP3300 CP5605 CP5634 7
Experience & Learning

• Gaining knowledge on data:


understanding, learning, intelligence, and
prediction,
• what do we need to understand and learn?
• how do we learn …?

DATACP3300 CP5605 CP5634 8


What is Data Mining?

• Data is hard to understand


• We want small, easy-to-understand useful pieces of
knowledge
• Data Mining will find those useful pieces

Terse, unreadable Easy to understand


What is Data Mining?

• Data mining is the nontrivial process of identifying


valid, novel, potentially useful and ultimately
understandable patterns in large datasets
(Fayyad, Piatetsky-Shapiro and Smyth, 1996)

• Nontrivial: More than simple computations.


• Valid: Discovered patterns are general enough to apply
to unseen data with some accuracy.
• Novel: Patterns are unexpected, not obvious.
• Potentially useful: Leads to effective actions.
• Understandable: Simple, interpretable.
What is Data Mining?
• It is about finding patterns from LARGE data.
• It is about learning (inference) from LARGE data.
• It is about exploratory data analysis.

S Traditional analysis tools


Massive (Ebay, Amazon, Walmart, Visa, Flickr
etc)

DM tools
What is Data Mining?
• Data mining is a process of discovering patterns in
large data sets involving methods at the intersection
of machine learning, statistics, and database
systems.[1] Data mining is an interdisciplinary subfield
of computer science and statistics with an overall goal to
extract information (with intelligent methods) from a data
set and transform the information into a comprehensible
structure for further use.

Many experts agree that data mining should not be fully


automatic – human intervention and interpretation is
essential.

CP3300 CP5605 CP5634 12


What is Data Mining?

• Alternative names
– Knowledge discovery (mining)
in databases (KDD)
– knowledge extraction
– pattern mining
– exploratory data analysis
– inductive learning
– business intelligence
– etc.

CP3300 CP5605 CP5634 13


Data Query VS. Data Mining
Data Query Data Mining
• A list of customers who used • Develop a profile of MasterCard
MasterCard to buy medicine holders who will take advantage of
from a pharmacy. the forthcoming sale promotion of
the pharmacy.
• A list of employees who will
reach retiring age next year. • Develop a list of employees, who
are likely to avail themselves of the
• A list of residents in a locality voluntary early retirement scheme
who became diabetic before when they reach the retiring age.
reaching the age of 50.
• Construct some rules about the
• Find all customers who have lifestyle of residents of a locality
purchased diapers. which may reduce the occurrence
of diabetes at an early age.
• Find all items which are frequently
CP3300 CP5605 CP5634 14
purchased with diapers.
What is DM?
• Induction
• Joanne is cool!
• Jai is cool!
• Thus, all JCU lecturers are cool!
• Deduction
• Joanne is cool!
• Cool means hot.
• Thus, Joanne is hot.
• Abduction
• All Koreans are cool.
• Joanne is cool!
• Thus, Joanne is a Korean.

15
Why DM?

• Expected or unexpected
• Generalisation or subsetting (searching)
• Inductive or deductive learning
• Exploratory (data orientated) or
confirmatory (model orientated)

CP3300 CP5605 CP5634 16


Why has DM become more popular?

• data explosion – mega/tera to peta/exa/zetta/yottabyte

• data complexity - numerical, textual, multimedia

• data ownership due to Web 2 - everybody is data owner

• availability of technology - faster and cheaper

• availability of new algorithms and ideas to be more


competitive

17
How big a zettabyte is?

Zettabyte
Exabyte 1,000,000,000,000,000,000,000

Petabyte

Terabyte
Paragraph

A Gigabyte
Megabyte
Kilobyte
byte
How to handle big data?

Processing
power

Algorithms
data
mining

Storage
Where are they (data) from?
Potential DM Applications
• Data analysis and decision support
– Customer analysis and management
• Target marketing, customer relationship management (CRM), market
basket analysis, cross selling, market segmentation
• AMAZON, Walmart etc
• Web analysis
– Web mining, web personalisation, spam filter, text mining
• Spatial data mining
– Hot spot analysis, cause-effect analysis, spatial reasoning
• Biological data mining
– Microarray analysis, DNA analysis
CP3300 CP5605 CP5634 21
• …
Potential DM Applications

CP3300 CP5605 CP5634 22


Data mining in place…
Knowledge Discovery (KDD) Process

• Data mining—core of Pattern Evaluation


knowledge discovery
process
Data Mining

Data Warehouse Selection

Data Cleaning

Data Integration

Databases CP3300 CP5605 CP5634 24


KDD Process – key steps
1. Goal identification (need for domain knowledge)

2. Collation of data
• Data visualisation
• Data collection, preprocessing, reduction and transformation

3. Model selection
• Classification/ regression/ ARM/ Clustering etc.?
• Evaluate model -> interesting insights/ knowledge? (the ‘DM’ part)

4. Actionable insights
• Present insights
• Return on investments
CP3300 CP5605 CP5634 25
• Fine-tune model on operations/ new data
Data Mining Processes

Problem identification

Taking Action
Collation of data

Data preprocessing

Interpretation of the
Discovered knowledge Choosing an algorithm
Act Plan

Check Do

Iteration

Model construction
and Evaluation
Data
processing

PDCA model
Data Mining Processes

CRISP-DM
(Source: http://www.crisp-dm.org/Process/index.htm)
DM – Different Functionalities
Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation
Visualization Techniques Business
Analyst
Data Mining
Information Discovery
Data
Data Exploration Analyst/ Scientist
Statistical Summary, Querying, and Reporting

Data Preprocessing, Data Warehouses & Data Lake


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
CP3300 CP5605 CP5634 28
DM – Confluence of Multiple Disciplines

Source: Data Mining: Introductory and Advanced Topics, by Dunham, Prentice Hall.
CP3300 CP5605 CP5634 29
DM Techniques (Strategies)

• Descriptive mining
– Clustering: identifying a set of aggregations with similar
characteristics that summarise/describe the data
– Characterization: generalising the data to find compact descriptions
– Deviation detection: finding outliers that deviate from aggregations
• Predictive mining
– Classification: assigning data to one of a set of predefined classes
– Trend detection: detecting changes and trends
– Association: finding interesting dependencies among attributes

CP3300 CP5605 CP5634 30


Descriptive Mining

Concept description: Characterization and discrimination


• Generalize, summarize, and contrast data characteristics
• Example: Good vs. bad students

Good Bad

Attending lectures and practicals Sleeping in lectures


Listen to lecturers Listen to music
Asking intriguing questions Asking for extensions

CP3300 CP5605 CP5634 31


Descriptive DM - Clustering

Clustering
Descriptive DM Example

• Outlier analysis
– Outlier: Data object that does not comply with
the general behavior of the data
– Noise or exception?
– Useful in fraud detection, rare events analysis

Suspect

Normal
Predictive DM - Classification

Classification
Apple: round & red Banana: long & yellow

CP3300 CP5605 CP5634 34


Predictive DM – Trend Detection

Trend and evolution analysis


– Trend and deviation: e.g., regression analysis
– Sequential pattern mining: e.g., digital camera 
large SD memory
– Periodicity analysis
Income = education * x
income

education
Predictive DM - ARM

Association Rules Mining (Frequent Pattern Mining)

CP3300 CP5605 CP5634 36


Example: Clustering, ARM, Classification

______________ could be used by an insurance


company to group important customers according
to age, types of policies purchased, duration of
membership, and prior claims history.

A leading supermarket chain had 100,000 point-of


sale transactions last month. Using __________, it
observed that 25,000 of these transactions
include both banana and bread and 8,000
transactions include three items – banana, bread
and honey.

Using _________, a bank wishes to determine the credit risk


of a credit card applicant. The application is either
approved or rejected.
CP3300 CP5605 CP5634 37
DM Challenge - Bottleneck

Processing
power

Algorithms
data
mining

Storage
DM Big challenges

Volume Variety
scale of data different forms of
data 90%
unstructured, text,
audio, movie,
images

Velocity Veracity
1/3 business leaders do
analysis of not trust the info they
streaming data, use to make decisions,
real-time decision incorrectness,
making in emergent uncertainties, garbage-
in-gem-out
situations
DM – Major Practical Issues
• Mining methodology
– Handling missing, noise and incomplete data
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Pattern evaluation: the interestingness problem
– Performance: efficiency, effectiveness, and scalability
• User interaction
– Expression and visualization of data mining results
– Incorporation of background knowledge
– Integration of the discovered knowledge with existing one: knowledge fusion
• Applications and social impacts
– Protection of data security, integrity, unauthorized use, confidentiality and privacy

CP3300 CP5605 CP5634 40


Dark Side of DM

source from: www.toondoo.com


CP3300 CP5605 CP5634 41
DM Issues:
Are all the “discovered” patterns interesting?

• Data mining may generate thousands of patterns, but not all of


them are interesting !
– Human-centered, query-based, focused mining
– Constrained mining
– Meta-mining (mining from mined patterns)

• Find all the interesting patterns: Completeness


- Can a data mining system find all the interesting patterns? Do we need
to find all of the interesting patterns?

• Search for only interesting patterns: Optimization


- Can a data mining system find only the interesting patterns?

CP3300 CP5605 CP5634 42


DM Issues:
Data Mining and Privacy

Discuss present privacy issues:


• On What’s app.

How many have switched over to Telegram? Do you feel


comfortable with What’s app?
How do you think What’s app/ Facebook can use your data
commercially?

CP3300 CP5605 CP5634 43


What is This Subject About?
Conclusion.

46

You might also like