Data Mining (DM)
Data Mining (DM)
Lecture #1
Data Mining Overview
1
Textbooks
• Galit Shmueli, Nitin R. Patel and Peter C. Bruce. (2010) Data Mining for
Business Intelligence: Concepts, Techniques, and Applications in Microsoft
Office Excel with XLMiner, Wiley (Second edition) (978-0-470-52682-8) (or later
edition) Online available
• A B M Shawkat Ali and Saleh A. Wasimi. (2007) Data Mining: Methods and
Techniques, Thompson (978-0-17-013676-1) (or later edition)
• Jiawei Han, Micheline Kamber and Jian Pei. (2012) Data Mining: Concepts and
Techniques, third edition, Morgan Kaufmann. ISBN: 978-0123814791. (or later
edition) Online available
• Ian H. Witten and Eibe Frank. (2005) Data Mining: Practical Machine Learning
Tools and Techniques, second edition, Morgan Kaufmann. ISBN: 0-12-088407-0
(or later edition) Online available
• Pang-Ning Tan, Michael Steinbach and Vipin Kumar. (2006) Introduction to Data
Mining, Pearson. (or later edition) Online available
2
Subject Information CP3403/CP5634
Recommended text Supplimentary text
3
Lecture Overview
5
Cartoons
DM tools
What is Data Mining?
• Data mining is a process of discovering patterns in
large data sets involving methods at the intersection
of machine learning, statistics, and database
systems.[1] Data mining is an interdisciplinary subfield
of computer science and statistics with an overall goal to
extract information (with intelligent methods) from a data
set and transform the information into a comprehensible
structure for further use.
• Alternative names
– Knowledge discovery (mining)
in databases (KDD)
– knowledge extraction
– pattern mining
– exploratory data analysis
– inductive learning
– business intelligence
– etc.
15
Why DM?
• Expected or unexpected
• Generalisation or subsetting (searching)
• Inductive or deductive learning
• Exploratory (data orientated) or
confirmatory (model orientated)
17
How big a zettabyte is?
Zettabyte
Exabyte 1,000,000,000,000,000,000,000
Petabyte
Terabyte
Paragraph
A Gigabyte
Megabyte
Kilobyte
byte
How to handle big data?
Processing
power
Algorithms
data
mining
Storage
Where are they (data) from?
Potential DM Applications
• Data analysis and decision support
– Customer analysis and management
• Target marketing, customer relationship management (CRM), market
basket analysis, cross selling, market segmentation
• AMAZON, Walmart etc
• Web analysis
– Web mining, web personalisation, spam filter, text mining
• Spatial data mining
– Hot spot analysis, cause-effect analysis, spatial reasoning
• Biological data mining
– Microarray analysis, DNA analysis
CP3300 CP5605 CP5634 21
• …
Potential DM Applications
Data Cleaning
Data Integration
2. Collation of data
• Data visualisation
• Data collection, preprocessing, reduction and transformation
3. Model selection
• Classification/ regression/ ARM/ Clustering etc.?
• Evaluate model -> interesting insights/ knowledge? (the ‘DM’ part)
4. Actionable insights
• Present insights
• Return on investments
CP3300 CP5605 CP5634 25
• Fine-tune model on operations/ new data
Data Mining Processes
Problem identification
Taking Action
Collation of data
Data preprocessing
Interpretation of the
Discovered knowledge Choosing an algorithm
Act Plan
Check Do
Iteration
Model construction
and Evaluation
Data
processing
PDCA model
Data Mining Processes
CRISP-DM
(Source: http://www.crisp-dm.org/Process/index.htm)
DM – Different Functionalities
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation
Visualization Techniques Business
Analyst
Data Mining
Information Discovery
Data
Data Exploration Analyst/ Scientist
Statistical Summary, Querying, and Reporting
Source: Data Mining: Introductory and Advanced Topics, by Dunham, Prentice Hall.
CP3300 CP5605 CP5634 29
DM Techniques (Strategies)
• Descriptive mining
– Clustering: identifying a set of aggregations with similar
characteristics that summarise/describe the data
– Characterization: generalising the data to find compact descriptions
– Deviation detection: finding outliers that deviate from aggregations
• Predictive mining
– Classification: assigning data to one of a set of predefined classes
– Trend detection: detecting changes and trends
– Association: finding interesting dependencies among attributes
Good Bad
Clustering
Descriptive DM Example
• Outlier analysis
– Outlier: Data object that does not comply with
the general behavior of the data
– Noise or exception?
– Useful in fraud detection, rare events analysis
Suspect
Normal
Predictive DM - Classification
Classification
Apple: round & red Banana: long & yellow
education
Predictive DM - ARM
Processing
power
Algorithms
data
mining
Storage
DM Big challenges
Volume Variety
scale of data different forms of
data 90%
unstructured, text,
audio, movie,
images
Velocity Veracity
1/3 business leaders do
analysis of not trust the info they
streaming data, use to make decisions,
real-time decision incorrectness,
making in emergent uncertainties, garbage-
in-gem-out
situations
DM – Major Practical Issues
• Mining methodology
– Handling missing, noise and incomplete data
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Pattern evaluation: the interestingness problem
– Performance: efficiency, effectiveness, and scalability
• User interaction
– Expression and visualization of data mining results
– Incorporation of background knowledge
– Integration of the discovered knowledge with existing one: knowledge fusion
• Applications and social impacts
– Protection of data security, integrity, unauthorized use, confidentiality and privacy
46