DDT: Distributed Decision Tree
DDT: Distributed Decision Tree
Decision Tree
A. Desai
Results and
PhD Scholar, School of Engineering & Applied Sciences, Discussion
Ahmedabad University Results
Discussion
Summary
Co-author:
S. Chaudhary
Professor,
School of Engineering & Applied Science
Ahmedabad University
1 / 35
DDT: Distributed
Outline Decision Tree
A. Desai
Introduction
Need for Research
Introduction Distributed
Decision Tree
Need for Research Related Work
Working of
Distributed Decision
Tree
Results and
Distributed Decision Tree Discussion
Results
Related Work Discussion
2 / 35
DDT: Distributed
Introduction Decision Tree
A. Desai
Introduction
Need for Research
I What is Machine Learning (ML)?
Distributed
I Decision Tree, Support Vector Machines, Neural Decision Tree
Related Work
Networks etc. Working of
Distributed Decision
I Types of Problem in ML. Tree
Results and
I Classification, regression and clustring Discussion
Results
I Problem domains with respect to Architecture (Single Discussion
3 / 35
1 DDT: Distributed
Introduction Decision Tree
A. Desai
Introduction
I What is Machine Learning (ML)? Need for Research
1
DDT uses Decision Tree to solve classification problem using
ensemble of trees and MapReduce approach
4 / 35
DDT: Distributed
Outline Decision Tree
A. Desai
Introduction
Need for Research
Introduction Distributed
Decision Tree
Need for Research Related Work
Working of
Distributed Decision
Tree
Results and
Distributed Decision Tree Discussion
Results
Related Work Discussion
5 / 35
DDT: Distributed
Need for Research Decision Tree
A. Desai
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
I Data Analytics Approaches wrt memory requirement Results and
Discussion
Results
Discussion
Summary
6 / 35
DDT: Distributed
Need for Research Decision Tree
A. Desai
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
I Data Analytics Approaches wrt memory requirement Results and
Discussion
I Machine Learning and Statistics Results
Discussion
Summary
6 / 35
DDT: Distributed
Need for Research Decision Tree
A. Desai
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
I Data Analytics Approaches wrt memory requirement Results and
Discussion
I Machine Learning and Statistics Results
I Classical Data Mining Discussion
Summary
6 / 35
DDT: Distributed
Need for Research Decision Tree
A. Desai
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
I Data Analytics Approaches wrt memory requirement Results and
Discussion
I Machine Learning and Statistics Results
I Classical Data Mining Discussion
6 / 35
DDT: Distributed
Outline Decision Tree
A. Desai
Introduction
Need for Research
Introduction Distributed
Decision Tree
Need for Research Related Work
Working of
Distributed Decision
Tree
Results and
Distributed Decision Tree Discussion
Results
Related Work Discussion
7 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
I Scalable construction of classification and regression Related Work
Working of
trees Distributed Decision
Tree
Results and
Discussion
Results
Discussion
Summary
8 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
I Scalable construction of classification and regression Related Work
Working of
trees Distributed Decision
Tree
Summary
8 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
I Scalable construction of classification and regression Related Work
Working of
trees Distributed Decision
Tree
8 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
I Scalable construction of classification and regression Related Work
Working of
trees Distributed Decision
Tree
8 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
I Scalable construction of classification and regression Related Work
Working of
trees Distributed Decision
Tree
8 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
I Bagging ensemble of C4.5 trees using MapReduce Results and
Discussion
Results
Discussion
Summary
9 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
I Bagging ensemble of C4.5 trees using MapReduce Results and
Discussion
I Dataset: Breast Cancer Dataset Results
Discussion
Summary
9 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
I Bagging ensemble of C4.5 trees using MapReduce Results and
Discussion
I Dataset: Breast Cancer Dataset Results
Discussion
I Parameters: Partition time, Map time, Reduce time, Summary
Total time, Number of base classifiers and Number of
nodes
9 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
I Time efficient and Scalable implementation of Decision Results and
Tree using MapReduce Discussion
Results
Discussion
Summary
10 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
I Time efficient and Scalable implementation of Decision Results and
Tree using MapReduce Discussion
Results
Discussion
I Dataset: 1.5-3 million records, Synthetic Dataset
Summary
10 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
I Time efficient and Scalable implementation of Decision Results and
Tree using MapReduce Discussion
Results
Discussion
I Dataset: 1.5-3 million records, Synthetic Dataset
Summary
I Parameters: Execution Time versus Number of
Instances and Execution Time versus Number of nodes
10 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
I Decision Tree algorithm i.e. ID3 extension using Working of
Distributed Decision
MapReduce Tree
Results and
Discussion
Results
Discussion
Summary
11 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
I Decision Tree algorithm i.e. ID3 extension using Working of
Distributed Decision
MapReduce Tree
Results and
I Scalable with the size of data Discussion
Results
Discussion
Summary
11 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
I Decision Tree algorithm i.e. ID3 extension using Working of
Distributed Decision
MapReduce Tree
Results and
I Scalable with the size of data Discussion
Results
I Core: Decide on split attribute, used pruning Discussion
Summary
11 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
I Decision Tree algorithm i.e. ID3 extension using Working of
Distributed Decision
MapReduce Tree
Results and
I Scalable with the size of data Discussion
Results
I Core: Decide on split attribute, used pruning Discussion
Summary
I Dataset: US Census Bureuse dataset (100 GB)
11 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
I Decision Tree algorithm i.e. ID3 extension using Working of
Distributed Decision
MapReduce Tree
Results and
I Scalable with the size of data Discussion
Results
I Core: Decide on split attribute, used pruning Discussion
Summary
I Dataset: US Census Bureuse dataset (100 GB)
I Parameter: Datasize versus Running Time
11 / 35
DDT: Distributed
Related Work Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
I Decision Tree algorithm i.e. ID3 extension using Working of
Distributed Decision
MapReduce Tree
Results and
I Scalable with the size of data Discussion
Results
I Core: Decide on split attribute, used pruning Discussion
Summary
I Dataset: US Census Bureuse dataset (100 GB)
I Parameter: Datasize versus Running Time
note: In all work surveyed, the authors have not compared
their implementation with any other similar algorithm.
11 / 35
DDT: Distributed
Outline Decision Tree
A. Desai
Introduction
Need for Research
Introduction Distributed
Decision Tree
Need for Research Related Work
Working of
Distributed Decision
Tree
Results and
Distributed Decision Tree Discussion
Results
Related Work Discussion
12 / 35
DDT: Distributed
Working of Distributed Decision Tree Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
Results and
Discussion
Results
Discussion
Summary
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
Results and
Discussion
Results
Discussion
Summary
14 / 35
DDT: Distributed
Working of Distributed Decision Tree Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
Results and
Discussion
Results
Discussion
Summary
15 / 35
DDT: Distributed
Outline Decision Tree
A. Desai
Introduction
Need for Research
Introduction Distributed
Decision Tree
Need for Research Related Work
Working of
Distributed Decision
Tree
Results and
Distributed Decision Tree Discussion
Results
Related Work Discussion
16 / 35
DDT: Distributed
Results Decision Tree
Introduction
Need for Research
Distributed
Characteristic Accuracy Size of Tree Number of Leaves Decision Tree
Data-set #C #I #N #No DT BT DDT ST DT BT DDT ST DT BT DDT ST Related Work
bcw 2 699 9 1 98.14 100 96.99 95.85 27 38 3.6 12.5 14 19.5 2.3 6.75 Working of
bupa 2 345 6 1 84.64 97.97 65.8 72.75 51 25.75 5.4 13.5 26 13.37 3.2 7.25 Distributed Decision
crx 2 690 6 10 90.72 100 85.51 86.09 41 102.4 5.6 11.5 29 68.7 3.5 7.5 Tree
echo 2 132 11 2 97.3 100 85.13 93.24 3 5.5 3 3 2 3.25 2 2
h-d 5 303 13 1 78.55 100 66.67 68.08 67 89.4 10.4 25.5 34 45.2 5.7 13.25 Results and
hv-84 2 435 12 5 97.24 99.08 92.87 95.63 11 19.22 5.4 5 6 10.11 3.2 3 Discussion
hypo 2 3163 7 19 99.43 99.97 97.88 99.11 13 53.2 6.6 8 7 27.1 3.8 4.5 Results
krkp 2 3196 33 4 99.66 100 92.4 98.97 59 85.6 42.7 40 31 44.8 23.7 21.5 Discussion
pima 2 768 8 1 84.11 100 77.43 78.26 39 103.8 11 18.5 20 52.4 6 9.75
sonar 2 208 60 1 98.08 100 76.92 81.73 35 22 5.6 9.5 18 11.5 3.3 5.25 Summary
Yahoo! 2 1155124 10 1 88.22 89.77 96.47 88.19 45 3130 108.33 26.4 23 6259 54.66 13.7
Average* N.A. 92.79 99.70 83.76 86.93 34.60 54.49 9.93 17.97 18.70 29.59 5.67 10.25
Note: #C = number of Classes, #I = number of Instances, #N = number of Numeric attributes and #No = number of Nominal
attributes, *=average is over first 10 datasets only.
17 / 35
DDT: Distributed
Results Decision Tree
Accuracy A. Desai
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
Results and
Discussion
Results
Discussion
Summary
18 / 35
DDT: Distributed
Results Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
Results and
Discussion
Results
Discussion
Summary
19 / 35
DDT: Distributed
Results Decision Tree
Introduction
Need for Research
Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
Results and
Discussion
Results
Discussion
Summary
20 / 35
DDT: Distributed
Outline Decision Tree
A. Desai
Introduction
Need for Research
Introduction Distributed
Decision Tree
Need for Research Related Work
Working of
Distributed Decision
Tree
Results and
Distributed Decision Tree Discussion
Results
Related Work Discussion
21 / 35
DDT: Distributed
Discussion Decision Tree
Accuracy A. Desai
Introduction
Need for Research
Distributed
Decision Tree
Related Work
I BT is almost accurate in predictions (avg: 99.7) Working of
Distributed Decision
I DT is close to the value of DDT and ST, except with Tree
22 / 35
DDT: Distributed
Discussion Decision Tree
Introduction
Need for Research
Distributed
I DDT and ST outperformed DT and BT Decision Tree
Related Work
I implementation of DDT and ST considers small number Working of
Distributed Decision
of examples from each chunk Tree
Results and
I Average size of tree: DT, BT, DDT and ST 34.60, Discussion
Results
54.49, 9.93 and 14.7 respectively Discussion
23 / 35
DDT: Distributed
Discussion Decision Tree
Introduction
Need for Research
I BT takes very long time to build the model with Distributed
accuracy improvement of just 1% over DT Decision Tree
Related Work
I comparing size of the tree, BT and DT has a huge Working of
Distributed Decision
Tree
difference in size of trees
Results and
I in both cases, DT wins. Discussion
Results
I DDT and ST results: Discussion
Summary
I DDT improves accuracy drastically, with an increase in
learning time
I ST takes an advantage of using Spark, it builds the
model in just few seconds even with such a large
dataset, its accuracy is comparable to DT and BT and
comparatively less then DDT
I At the same time, the size of a tree and number of
leaves are comparatively lower in ST.
24 / 35
DDT: Distributed
Conclusion Decision Tree
A. Desai
I DDT and ST outperformed DT and BT in terms of size
Introduction
of tree and number of leaves with acceptable accuracy Need for Research
of classification Distributed
Decision Tree
Related Work
Working of
Distributed Decision
Tree
Results and
Discussion
Results
Discussion
Summary
25 / 35
DDT: Distributed
Conclusion Decision Tree
A. Desai
I DDT and ST outperformed DT and BT in terms of size
Introduction
of tree and number of leaves with acceptable accuracy Need for Research
of classification Distributed
Decision Tree
I Average accuracy of DT, BT, DDT and ST over all ten Related Work
Working of
selected datasets are 92.79, 99.70, 83.76 and 86.93 Distributed Decision
Tree
Summary
25 / 35
DDT: Distributed
Conclusion Decision Tree
A. Desai
I DDT and ST outperformed DT and BT in terms of size
Introduction
of tree and number of leaves with acceptable accuracy Need for Research
of classification Distributed
Decision Tree
I Average accuracy of DT, BT, DDT and ST over all ten Related Work
Working of
selected datasets are 92.79, 99.70, 83.76 and 86.93 Distributed Decision
Tree
DT Summary
DDT ST
BT DT BT DT
sot 82% 71% 67% 48%
nol 81% 70% 65% 45%
25 / 35
DDT: Distributed
Conclusion Decision Tree
A. Desai
I DDT and ST outperformed DT and BT in terms of size
Introduction
of tree and number of leaves with acceptable accuracy Need for Research
of classification Distributed
Decision Tree
I Average accuracy of DT, BT, DDT and ST over all ten Related Work
Working of
selected datasets are 92.79, 99.70, 83.76 and 86.93 Distributed Decision
Tree
DT Summary
DDT ST
BT DT BT DT
sot 82% 71% 67% 48%
nol 81% 70% 65% 45%
A. Desai
26 / 35
DDT: Distributed
References II Decision Tree
A. Desai
[5] Fan, W., Stolfo, S. J., & Zhang, J. (1999, August). The Appendix
References
application of AdaBoost for distributed, scalable and
on-line learning. In Proceedings of the fifth ACM
SIGKDD international conference on Knowledge
discovery and data mining (pp. 62-366). ACM.
[6] Cooper, J., & Reyzin, L. Improved Algorithms for
Distributed Boosting.
[7] Dai, W., & Ji, W. (2014). A mapreduce implementation
of C4. 5 decision tree algorithm. International Journal of
Database Theory and Application, 7(1), 49-60.
[8] Lazarevic, A., & Obradovic, Z. (2002). Boosting
algorithms for parallel and distributed learning.
Distributed and Parallel Databases, 11(2), 203-229.
27 / 35
DDT: Distributed
References III Decision Tree
A. Desai
Appendix
[9] Abualkibash, M., ElSayed, A., & Mahmood, A. (2013). References
28 / 35
DDT: Distributed
References IV Decision Tree
A. Desai
[12] Biswanath Panda, Joshua S. Herbach, Sugato Basu,
Appendix
and Roberto J. Bayardo. 2009. PLANET: massively References
A. Desai
30 / 35
DDT: Distributed
References VI Decision Tree
A. Desai
Appendix
References
[19] Berry, M. J., & Linoff, G. (1997). Data mining
techniques: for marketing, sales, and customer support.
John Wiley & Sons, Inc.
[20] Quinlan, J. R. (1990). Decision trees and
decision-making. Systems, Man and Cybernetics, IEEE
Transactions on, 20(2), 339-346.
[21] Quinlan, J. R. (1996). Improved use of continuous
attributes in C4.5. Journal of artificial intelligence
research, 77-90.
[22] Quinlan, J. R. (1986). Induction of decision trees.
Machine learning, 1(1), 81-106.
31 / 35
DDT: Distributed
References VII Decision Tree
A. Desai
Appendix
[23] Bowyer, K. W., Hall, L. O., Moore, T., Chawla, N., & References
32 / 35
DDT: Distributed
References VIII Decision Tree
A. Desai
Appendix
References
33 / 35
DDT: Distributed
Thank you for your attention! Decision Tree
A. Desai
Appendix
References
I Any Questions?
34 / 35
DDT: Distributed
Decision Tree
A. Desai
Appendix
References
I Contact Details
I Ankit Desai
email: [email protected]
I Sanjay Chaudhary
email: [email protected]
35 / 35