0% found this document useful (0 votes)
45 views11 pages

Data Mining What Is Data Mining?

This document discusses data mining and knowledge discovery. It defines data mining as involving domain understanding, data selection, data cleaning, preprocessing, pattern discovery, interpretation, and reporting. The document outlines common data mining techniques like prediction, process control, and fraud detection. It distinguishes data mining from other data analysis techniques and discusses different types of learning systems, models, and knowledge representation forms used in data mining like decision trees, rules, and neural networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views11 pages

Data Mining What Is Data Mining?

This document discusses data mining and knowledge discovery. It defines data mining as involving domain understanding, data selection, data cleaning, preprocessing, pattern discovery, interpretation, and reporting. The document outlines common data mining techniques like prediction, process control, and fraud detection. It distinguishes data mining from other data analysis techniques and discusses different types of learning systems, models, and knowledge representation forms used in data mining like decision trees, rules, and neural networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

What is Data Mining?

Data Mining
• Domain understanding
• Data selection
Andrew Kusiak
Intelligent Systems Laboratory • Data cleaning, e.g., data duplication,
2139 Seamans Center missing data
The University of Iowa
Iowa City, IA 52242 - 1527
• Preprocessing, e.g., integration of different
[email protected] files
http://www.icaen.uiowa.edu/~ankusiak • Pattern (knowledge) discovery
Tel. 319-335 5934
Fax. 319-335 5669 • Interpretation (e.g.,visualization)
• Reporting

Data Mining “Architecture” Illustrative Applications


• Prediction of equipment faults
• Determining a stock level
• Process control
• Fraud detection
• Genetics
• Disease staging and diagnosis
• Decision making

1
Pharmaceutical Industry Pharmaceutical Industry
• Selection of “Patient suitable” medication
– Adverse drug effects minimized
– Drug effectiveness maximized
An individual object (e.g., product,
– New markets for “seemingly ineffective” drugs
patient, drug) orientation
vs • “Medication bundle”
– Life-time treatments
A population of objects (products,
patients, drugs) orientation • Design and virtual testing of new drugs

What is Knowledge Discovery?


Data Mining is Not
• Data warehousing
Data • SQL / Ad hoc queries / reporting
• Software agents
Set
• Online Analytical Processing (OLAP)
• Data visualization
E.g., Excel, Access,
Data Warehouse

2
Learning Systems (1/2)
Learning Systems (2/2)
• Classical statistical methods
(e.g., discriminant analysis) • Association rule algorithms
• Modern statistical techniques • Text mining algorithms
(e.g., k-nearest neighbor, Bayes theorem) • Meta-learning algorithms
• Neural networks • Inductive learning programming
• Support vector machines • Sequence learning
• Decision tree algorithms
• Decision rule algorithms
• Learning classifier systems

Regression Models Neural Networks

• Simple linear regression = Linear combination of inputs


• Based on biology
• Inputs transformed via a network of simple processors
• Logistic regression = Logistic function of a linear
• Processor combines (weighted) inputs and produces an
combination of inputs
output value
- Classic “perceptron”
• Obvious questions: What transformation function do you use
and how are the weights determined?

3
Neural Networks Types of Decision Trees
• CHAID: Chi-Square Automatic Interaction Detection
- Kass (1980)
• Feed-forward - Regression analogy - n-way splits
• Multi-layer NN- Nonlinear regression analogy - Categorical variables
• CART: Classification and Regression Trees
- Breimam, Friedman, Olshen, and Stone (1984)
- Binary splits
- Continuous variables
• C4.5
- Quinlan (1993)
- Also used for rule induction

Text Mining Yet Another Classification


• Supervised
• Mining unstructured data (free-form text) is - Regression models
a challenge for data mining - k-Nearest-Neighbor
• Usual solution is to impose structure on the data and - Neural networks
then process using standard techniques, e.g.,
- Rule induction
- Simple heuristics (e.g., unusual words)
- Domain expertise - Decision trees
- Linguistic analysis • Unsupervised
• Presentation is critical - k-means clustering
- Self organized maps

4
Supervised Learning Algorithms
Knowledge Representation Forms
• kNN
- Quick and easy
- Models tend to be very large
• Neural Networks
- Difficult to interpret • Decision rules
- Training can be time consuming • Trees (graphs)
• Rule Induction
- Understandable • Patterns (matrices)
- Need to limit calculations
• Decision Trees
- Understandable
- Relatively fast
- Easy to translate into SQL queries

Decision Rules
DM: Product Quality Example
Rule 1. IF (Process_parameter_1 < 0.515) THEN (D = Poor_Quality);
[2, 2, 50.00%, 100.00%][2, 0][5, 6]
Training data set
Rule 2. IF (Test_2 = Low) THEN (D = Poor_Quality);
Product Process Test_1 Process Test_2 Quality [3, 3, 75.00%, 100.00%][3,0][2, 5, 8]
ID param 1 param_2 D
1 1.02 Red 2.98 High Good_Quality Rule 3. IF (Process_parameter_2 >= 2.01) THEN (D = Good_Quality);
2 2.03 Black 1.04 Low Poor_Quality [3, 3, 75.00%, 100.00%][0, 3][1, 3, 4]
3 0.99 Blue 3.04 High Good_Quality
4 2.03 Blue 3.11 High Good_Quality Rule 4. IF (Process_parameter_1 >= 0.515) & (Test_1 = Orange) THEN
5 0.03 Orange 0.96 Low Poor_Quality (D = Good_Quality);
[1, 1, 25.00%, 100.00%][0, 1][7]
6 0.04 Blue 1.04 Medium Poor_Quality
7 0.99 Orange 1.04 Medium Good_Quality
8 1.02 Red 0.94 Low Poor_Quality

The University of Iowa Intelligent Systems Laboratory The University of Iowa Intelligent Systems Laboratory

5
Decision Rule Metrics Definitions
Rule 12 • Support = Number of objects satisfying
IF (Flow = 6) AND (Pressure = 7) conditions of the rule
THEN (Efficiency = 81); No of supporting • Strength = Number of objects satisfying
[13, 8, 4.19%, 61.54%] [1, 8, 4] objects conditions and the decision of the rule
Support Strength Relative strength Confidence • Relative strength = Number of objects
[ { 524 }, satisfying conditions and decision of the
{ 527, 528, 529, 530, 531, 533, 535, 536 }, rule/The number of objects in the class
{ 525, 526, 532, 534 }]
• Confidence = Strength/Support
Supporting objects

Classification Accuracy
Decision rules
Test: Leaving-one-out Rule 113
Confusion Matrix
Poor_Quality Good_Quality None IF (B_Master >= 1634.26)
Poor_Quality 3 1 0 AND (B_Temp in (1601.2, 1660.22]
Good_Quality 1 3 0 AND (B_Pressure in [17.05, 18.45))
AND (A_point = 0.255) AND (Average_O2 = 77)
Average Accuracy [%] THEN (Eff = 87) OR (Eff = 88);
Correct Incorrect None
Total 75.00 25.00 0.00 [6, 6, 23.08%, 100.00%][0, 0, 0, 0, 0, 0, 0, 3, 3, 0]
Poor_Quality 75.00 25.00 0.00 [{2164, 2167, 2168}, {2163, 2165, 2166}]
Good_Quality 75.00 25.00 0.00

6
Decision rules Decision Rule vs Decision Tree
Algorithms
Rule 12
IF (Ave_Middle_Bed = 0) AND (PA_Fan_Flow = 18) THEN
(Efficiency = 71); F1 F2 F3 F4 D
0 0 0 1 One
[16, 10, 10.31%, 62.50%] [1, 1, 2, 10, 2,]
[{ 682 }, { 681 }, { 933, 936 }, 0 0 1 1 Two
{ 875, 876, 877, 878, 879, 880, 934, 935, 1000, 1001}, 0 1 1 1 Three
{ 881, 882 }]
1 1 1 1 Four

Decision Tree
F1 F2 F3 F4 D
Decision Tree
0 0 0 1 One
0 0 1 1 Two
F1 F2 F3 F4 D
0 1 1 1 Three 0 0 0 1 One
1 1 1 1 Four
F2 1 0 0 1 1 Two
0
0 1 1 1 Three
F3 F1 1 1 1 1 Four 0
F2
1

0 1 0 1 F3 F1

0 1 0 1
0001 0011 0111 1111 0001 0011 0111 1111

One Two Three Four One Two Three Four

7
Decision Rules
Rule 1. (F3 = 0) THEN (D = One);
Decision Tree vs Rule Tree
[1, 100.00%, 100.00%][1]
Rule 2. (F2 = 0) AND (F3 = 1) THEN (D = Two); F2 1
[1, 100.00%, 100.00%][2] 0
Rule 3. (F1 = 0) AND (F2 = 1) THEN (D = Three); Decision Tree
[1, 100.00%, 100.00%][3] F3 F1
Rule 4. (F1 = 1) THEN (D = Four);
[1, 100.00%, 100.00%][4] 0 1 0 1
0001 0011 0111 1111
F1 F2 F3 F4 D
One Two Three Four
0 0 0 1 One
F3 F2 F1
0 0 1 1 Two
0 1 0 1 0 1
0 1 1 1 Three Rule Tree
0001 0011 0111 1111
1 1 1 1 Four One Two Three Four

Use of Extracted Knowledge


Decision Rule Algorithms
-0 1-

F1 F2 F3 F4 D
Identify 0
F2
1
0 0 0 1 One
unique features of an object F3 F1
0 0 1 1 Two
rather than 0 1 1 1 Three
0 1 0 1

0001 0011 0111 1111


commonality among all objects 1 1 1 1 Four One Two Three Four

8
Traditional Modeling Data Mining
• Regression analysis
• Neural network

• Rules
• Decision trees
• Patterns

Data Life Cycle


Evolution in Data Mining

Data
farming
• Data Farming

Result
• Cultivating data evaluation Knowledge
extraction
rather than
assuming that it is
Decision-
available making

9
Data Farming Data Farming

Pull data approach


vs
Define features that
Push data approach in classical • Maximize classification accuracy
and
data mining • Minimize the data collection cost

Data Mining Standards Summary


• Predictive Model Markup Language (PMML)
- The Data Mining Group (www.dmg.org) • Data mining
- XML based (DTD) algorithms support a
• Java Data Mining API spec request (JSR-000073) new paradigm:
- Oracle, Sun, IBM, …
Identify what is
- Support for data mining APIs on J2EE platforms
- Build, manage, and score models programmatically unique about an object
• OLE DB for Data Mining • DM tools to enter new
- Microsoft areas of information
- Table based analysis
- Incorporates PMML

10
References (1/2) References (2/2)
Kusiak, A. Rough Set Theory: A Data Mining Tool for Semiconductor
Manufacturing, IEEE Transactions on Electronics Packaging A. Kusiak, Feature Transformation Methods in Data Mining,
Manufacturing, Vol. 24, No. 1, 2001, pp. 44-50. IEEE Transactions on Electronics Packaging Manufacturing,
Vol. 24, No. 3, 2001, pp. 214 -221.
Kusiak, A., Decomposition in Data Mining: An Industrial Case Study,
IEEE Transactions on Electronics Packaging Manufacturing,
Vol. 23, No. 4, 2000, pp. 345-353. A. Kusiak, I.H. Law, M.D. Dick, The G-Algorithm for Extraction
of Robust Decision Rules: Children’s Postoperative Intra-atrial
Kusiak, A., J.A. Kern, K.H. Kernstine, and T.L. Tseng, Autonomous Arrhythmia Case Study, IEEE Transactions on Information
Decision-Making: A Data Mining Approach, IEEE Transactions on Technology in Biomedicine, Vol. 5, No. 3, 2001, pp. 225-235.
Information Technology in Biomedicine, Vol. 4, No. 4, 2000, pp. 274-
284.

11

You might also like