Data Mining What Is Data Mining?
Data Mining What Is Data Mining?
Data Mining
• Domain understanding
• Data selection
Andrew Kusiak
Intelligent Systems Laboratory • Data cleaning, e.g., data duplication,
2139 Seamans Center missing data
The University of Iowa
Iowa City, IA 52242 - 1527
• Preprocessing, e.g., integration of different
[email protected] files
http://www.icaen.uiowa.edu/~ankusiak • Pattern (knowledge) discovery
Tel. 319-335 5934
Fax. 319-335 5669 • Interpretation (e.g.,visualization)
• Reporting
1
Pharmaceutical Industry Pharmaceutical Industry
• Selection of “Patient suitable” medication
– Adverse drug effects minimized
– Drug effectiveness maximized
An individual object (e.g., product,
– New markets for “seemingly ineffective” drugs
patient, drug) orientation
vs • “Medication bundle”
– Life-time treatments
A population of objects (products,
patients, drugs) orientation • Design and virtual testing of new drugs
2
Learning Systems (1/2)
Learning Systems (2/2)
• Classical statistical methods
(e.g., discriminant analysis) • Association rule algorithms
• Modern statistical techniques • Text mining algorithms
(e.g., k-nearest neighbor, Bayes theorem) • Meta-learning algorithms
• Neural networks • Inductive learning programming
• Support vector machines • Sequence learning
• Decision tree algorithms
• Decision rule algorithms
• Learning classifier systems
3
Neural Networks Types of Decision Trees
• CHAID: Chi-Square Automatic Interaction Detection
- Kass (1980)
• Feed-forward - Regression analogy - n-way splits
• Multi-layer NN- Nonlinear regression analogy - Categorical variables
• CART: Classification and Regression Trees
- Breimam, Friedman, Olshen, and Stone (1984)
- Binary splits
- Continuous variables
• C4.5
- Quinlan (1993)
- Also used for rule induction
4
Supervised Learning Algorithms
Knowledge Representation Forms
• kNN
- Quick and easy
- Models tend to be very large
• Neural Networks
- Difficult to interpret • Decision rules
- Training can be time consuming • Trees (graphs)
• Rule Induction
- Understandable • Patterns (matrices)
- Need to limit calculations
• Decision Trees
- Understandable
- Relatively fast
- Easy to translate into SQL queries
Decision Rules
DM: Product Quality Example
Rule 1. IF (Process_parameter_1 < 0.515) THEN (D = Poor_Quality);
[2, 2, 50.00%, 100.00%][2, 0][5, 6]
Training data set
Rule 2. IF (Test_2 = Low) THEN (D = Poor_Quality);
Product Process Test_1 Process Test_2 Quality [3, 3, 75.00%, 100.00%][3,0][2, 5, 8]
ID param 1 param_2 D
1 1.02 Red 2.98 High Good_Quality Rule 3. IF (Process_parameter_2 >= 2.01) THEN (D = Good_Quality);
2 2.03 Black 1.04 Low Poor_Quality [3, 3, 75.00%, 100.00%][0, 3][1, 3, 4]
3 0.99 Blue 3.04 High Good_Quality
4 2.03 Blue 3.11 High Good_Quality Rule 4. IF (Process_parameter_1 >= 0.515) & (Test_1 = Orange) THEN
5 0.03 Orange 0.96 Low Poor_Quality (D = Good_Quality);
[1, 1, 25.00%, 100.00%][0, 1][7]
6 0.04 Blue 1.04 Medium Poor_Quality
7 0.99 Orange 1.04 Medium Good_Quality
8 1.02 Red 0.94 Low Poor_Quality
The University of Iowa Intelligent Systems Laboratory The University of Iowa Intelligent Systems Laboratory
5
Decision Rule Metrics Definitions
Rule 12 • Support = Number of objects satisfying
IF (Flow = 6) AND (Pressure = 7) conditions of the rule
THEN (Efficiency = 81); No of supporting • Strength = Number of objects satisfying
[13, 8, 4.19%, 61.54%] [1, 8, 4] objects conditions and the decision of the rule
Support Strength Relative strength Confidence • Relative strength = Number of objects
[ { 524 }, satisfying conditions and decision of the
{ 527, 528, 529, 530, 531, 533, 535, 536 }, rule/The number of objects in the class
{ 525, 526, 532, 534 }]
• Confidence = Strength/Support
Supporting objects
Classification Accuracy
Decision rules
Test: Leaving-one-out Rule 113
Confusion Matrix
Poor_Quality Good_Quality None IF (B_Master >= 1634.26)
Poor_Quality 3 1 0 AND (B_Temp in (1601.2, 1660.22]
Good_Quality 1 3 0 AND (B_Pressure in [17.05, 18.45))
AND (A_point = 0.255) AND (Average_O2 = 77)
Average Accuracy [%] THEN (Eff = 87) OR (Eff = 88);
Correct Incorrect None
Total 75.00 25.00 0.00 [6, 6, 23.08%, 100.00%][0, 0, 0, 0, 0, 0, 0, 3, 3, 0]
Poor_Quality 75.00 25.00 0.00 [{2164, 2167, 2168}, {2163, 2165, 2166}]
Good_Quality 75.00 25.00 0.00
6
Decision rules Decision Rule vs Decision Tree
Algorithms
Rule 12
IF (Ave_Middle_Bed = 0) AND (PA_Fan_Flow = 18) THEN
(Efficiency = 71); F1 F2 F3 F4 D
0 0 0 1 One
[16, 10, 10.31%, 62.50%] [1, 1, 2, 10, 2,]
[{ 682 }, { 681 }, { 933, 936 }, 0 0 1 1 Two
{ 875, 876, 877, 878, 879, 880, 934, 935, 1000, 1001}, 0 1 1 1 Three
{ 881, 882 }]
1 1 1 1 Four
Decision Tree
F1 F2 F3 F4 D
Decision Tree
0 0 0 1 One
0 0 1 1 Two
F1 F2 F3 F4 D
0 1 1 1 Three 0 0 0 1 One
1 1 1 1 Four
F2 1 0 0 1 1 Two
0
0 1 1 1 Three
F3 F1 1 1 1 1 Four 0
F2
1
0 1 0 1 F3 F1
0 1 0 1
0001 0011 0111 1111 0001 0011 0111 1111
7
Decision Rules
Rule 1. (F3 = 0) THEN (D = One);
Decision Tree vs Rule Tree
[1, 100.00%, 100.00%][1]
Rule 2. (F2 = 0) AND (F3 = 1) THEN (D = Two); F2 1
[1, 100.00%, 100.00%][2] 0
Rule 3. (F1 = 0) AND (F2 = 1) THEN (D = Three); Decision Tree
[1, 100.00%, 100.00%][3] F3 F1
Rule 4. (F1 = 1) THEN (D = Four);
[1, 100.00%, 100.00%][4] 0 1 0 1
0001 0011 0111 1111
F1 F2 F3 F4 D
One Two Three Four
0 0 0 1 One
F3 F2 F1
0 0 1 1 Two
0 1 0 1 0 1
0 1 1 1 Three Rule Tree
0001 0011 0111 1111
1 1 1 1 Four One Two Three Four
F1 F2 F3 F4 D
Identify 0
F2
1
0 0 0 1 One
unique features of an object F3 F1
0 0 1 1 Two
rather than 0 1 1 1 Three
0 1 0 1
8
Traditional Modeling Data Mining
• Regression analysis
• Neural network
• Rules
• Decision trees
• Patterns
Data
farming
• Data Farming
Result
• Cultivating data evaluation Knowledge
extraction
rather than
assuming that it is
Decision-
available making
9
Data Farming Data Farming
10
References (1/2) References (2/2)
Kusiak, A. Rough Set Theory: A Data Mining Tool for Semiconductor
Manufacturing, IEEE Transactions on Electronics Packaging A. Kusiak, Feature Transformation Methods in Data Mining,
Manufacturing, Vol. 24, No. 1, 2001, pp. 44-50. IEEE Transactions on Electronics Packaging Manufacturing,
Vol. 24, No. 3, 2001, pp. 214 -221.
Kusiak, A., Decomposition in Data Mining: An Industrial Case Study,
IEEE Transactions on Electronics Packaging Manufacturing,
Vol. 23, No. 4, 2000, pp. 345-353. A. Kusiak, I.H. Law, M.D. Dick, The G-Algorithm for Extraction
of Robust Decision Rules: Children’s Postoperative Intra-atrial
Kusiak, A., J.A. Kern, K.H. Kernstine, and T.L. Tseng, Autonomous Arrhythmia Case Study, IEEE Transactions on Information
Decision-Making: A Data Mining Approach, IEEE Transactions on Technology in Biomedicine, Vol. 5, No. 3, 2001, pp. 225-235.
Information Technology in Biomedicine, Vol. 4, No. 4, 2000, pp. 274-
284.
11