POLYCiNN Multiclass Binary Inference Engine Using Convolutional Decision Forests
POLYCiNN Multiclass Binary Inference Engine Using Convolutional Decision Forests
Abstract—Convolutional Neural Networks (CNNs) have Accumulate (MAC) and memory access operations in both
achieved significant success in image classification. One of the training and inference [5]. Another drawback of CNNs is that
main reasons that CNNs achieve state-of-the-art accuracy is using they require careful selection of multiple hyper-parameters
many multi-scale learnable windowed feature detectors called
kernels. Fetching of kernel feature weights from memory and such as the number of convolutional layers, the number of
performing the associated multiply and accumulate computations kernels, the kernel size and the learning rate [1]. This results
consume massive amount of energy. This hinders the widespread in a large design space exploration that makes the training
usage of CNNs, especially in embedded devices. In comparison process of CNNs time consuming because of several inter-
with CNNs, decision forests are computationally efficient since fering parameters with many configurational combinations.
they are composed of decision trees, which are binary classifiers
by nature and can be implemented using AND-OR gates instead Current CNN applications are typically trained and run on
of costly multiply and accumulate units. In this paper, we clusters of computers with Graphical Processing Units (GPUs).
investigate the migration of CNNs to decision forests as one However, the limited throughput of mainstream processors and
of the promising approaches for reducing both execution time the high power consumption of GPUs limit their applicability
and power consumption while achieving acceptable accuracy. in embedded and edge computing CNN applications [6].
We introduce POLYCiNN, an architecture composed of a stack
of decision forests. Each decision forest classifies one of the Recently, there has been increased interest in other classi-
overlapped sub-images of the original image. Then, all decision fiers that should 1) suit the nature of hardware accelerators by
forest classifications are fused together to classify the input fully utilizing their specific computing resources to maximize
image. In POLYCiNN, each decision tree is implemented in a parallelism when executing a large number of operations [7]
single 6-input Look-Up Table and requires no memory access. [8]; 2) achieve acceptable classification accuracy [9] [10] [11];
Therefore, POLYCiNN can be efficiently mapped to simple and
densely parallel hardware designs. We validate the performance 3) be amenable to finding a robust model for a given task;
of POLYCiNN on the benchmark image classification tasks of and 4) be simple to train [12]. Decision Forests (DFs) were
the MNIST, CIFAR-10 and SVHN datasets. introduced as efficient models for classification problems [13].
Index Terms—Deep Learning, Decision Forests, Decision Trees, They operate by constructing a stack of Decision Trees (DTs)
Hardware Accelerators, FPGAs and then voting on the most popular output class. Since DTs
are binary in nature and can be implemented using AND-OR
I. I NTRODUCTION gates, DFs can be efficiently mapped to simple and densely
Convolutional Neural Networks (CNNs) have been over- parallel hardware architectures [14] [15]. Moreover, DFs can
whelmingly dominant in many computer vision problems, es- be trained quickly and are considered handy classifiers since
pecially image classification [1]. The recent success of CNNs they do not have many hyper-parameters [16]. However, by
is mainly due to the tremendous development of many deep contrast to CNNs, DFs do not achieve state-of-the-art accuracy
architectures such as AlexNet [2], GoogleNet [3] and ResNet on several applications [8]. CNNs outperform DFs in terms of
[1]. These deep CNN architectures are trained to extract repre- accuracy because they deploy several convolutional layers with
sentative features from their inputs through several non-linear many kernels to extract representative features from raw data.
convolutional layers. Typically, in each convolutional layer On the other hand, DFs divide the feature space into subspaces
many pre-trained windowed feature detectors called kernels based on simple comparison operations on the input data.
are applied on their inputs. One or more fully connected The motivation of this paper is based on three observa-
layers connect the top-level extracted features and make a tions. The first observation stems from the fact that CNNs
classification detection. achieve state-of-the-art accuracy by sliding many kernels over
Although CNNs achieve state-of-the-art accuracy in many images. This motivates us to propose convolutional DFs, where
tasks, they have deficiencies that limit their use in embedded DFs are applied over a sliding window (sub-images) on the
applications [1]. A main downside of CNNs is their computa- original image. The second observation is that most Field-
tional complexity. They typically demand many Multiply and Programmable Gate Arrays (FPGAs) fit any function with
Authorized licensed use limited to: Universidad Peruana de Ciencias Aplicadas (UPC). Downloaded on August 30,2024 at 04:33:03 UTC from IEEE Xplore. Restrictions apply.
14
Fig. 1. Overview of the POLYCiNN architecture with w windows and M classes.
of POLYBiNNs, where each POLYBiNN classifies one image C. POLYCiNN training algorithm
window using the extracted LBP features vector of the original We train each POLYBiNN classifier on its corresponding
window and the corresponding window of the downsampled LBP feature vector and downsampled image window. POLY-
image (DI). Fig. 1 shows an example for the CIFAR-10 BiNNs are trained using AdaBoost, an ensemble learning
dataset, where w = 9, with 16×16 windows and a stride of algorithm that creates complex classifiers by combining many
eight pixels, and the downsampled image is 8×8 with nine weak DTs [22]. We limit the number of nodes of each DT
windows of size 6×6 with a stride of one pixel. to six in order to implement each DT as a single 6-input
LUT. Once all N DTs within the same POLYBiNN have been
B. Local Binary Pattern feature extraction trained, their outputs are combined to come to M decisions
(D1 to DM ) with M confidences (C1 to CM ). The output
The main goal of this layer is to obtain the most relevant (Dm ) is the binary decision of class (m) in the corresponding
information from inputs and represent that information in a POLYBiNN and can be 0 or 1. The output (Cm ) is a 2-bit
lower dimensionality space. We choose LBP descriptors [24] confidence value of the corresponding binary decision (Dm ).
because they measure the spatial structure of local image When the training process of all POLYBiNNs is completed,
texture efficiently and with parallel simple computations that we merge their outputs using a decision fusion approach
fit the nature of most hardware accelerators such as FPGAs. to obtain a decision for a given input. For each class m,
The LBP descriptor is formed by comparing the intensity the final confidence CFm is computed by summing all the
of the center pixel to its neighboring pixels within a patch. corresponding Cm together. We select the class with the
Neighbor pixels with higher intensity than the center pixel are highest confidence as the final classification decision. Fig. 3
assigned a value of 1, and 0 otherwise. LBP patch sizes are shows the overall process.
normally 3×3, 5×5, etc., however, we restrict the intensity
comparison of the center pixel to its four adjacent neighbor D. Implementing POLYCiNN in hardware
pixels (top, right, bottom and left) which reduces memory As shown in Fig. 1, POLYCiNN consists of a LBP fea-
access cost. This approach is more suitable for hardware ture extraction layer, a stack of POLYBiNNs where each
implementation since the comparisons are computed row-wise POLYBiNN is composed of an array of DTs followed by
and column-wise, as discussed in Section III.D. a voting circuit, and finally a decision fusion circuit that
Each pixel is now represented with a 4-bit string computed merges POLYBiNN classifications. We propose an efficient
by comparing the pixel intensity to its four corresponding hardware implementation of the LBP layer, as shown in Fig.
neighbors intensities. The final feature vector of each window 4. The architecture is composed of two arrays of comparators:
is the histogram of the feature values within the corresponding row comparators and column comparators. The row array of
window. The histogram provides better discrimination of the comparators compares between the intensity of each given
inputs and diminishes the dimensionality space of the inputs pixel and the intensity of its adjacent bottom neighbor pixel
to 16 (all possible values of a 4-bit string). Fig. 2 shows an in the consecutive row. The column array of comparators
example of computing the LBP features vector of a local image compares between the intensity of each given pixel and the
window. intensity of its adjacent right neighbor pixel in the consecutive
Authorized licensed use limited to: Universidad Peruana de Ciencias Aplicadas (UPC). Downloaded on August 30,2024 at 04:33:03 UTC from IEEE Xplore. Restrictions apply.
15
Fig. 2. Local binary pattern encoding process.
Authorized licensed use limited to: Universidad Peruana de Ciencias Aplicadas (UPC). Downloaded on August 30,2024 at 04:33:03 UTC from IEEE Xplore. Restrictions apply.
16
experiments, we trained POLYCiNN using the extracted LBP TABLE I
feature vectors. The third set trained POLYCiNN using the ACCURACY COMPARISON WITH EXISTING DECISION TREE APPROACHES
extracted LBP feature vectors alongside with the downsampled Accuracy (%)
image. MNIST CIFAR-10 SVHN
For reasons of comparison, we reproduced the results of [9] 99.10 - -
[20] 96.76 - -
training POLYBiNN [7] on the raw data of the three datasets. [8] 99.26 63.37 -
We also studied the effect of changing the number of DTs [7] 97.45 55.12 71.68
for each POLYBiNN on the accuracy. We trained POLYBiNN POLYCiNN* 98.23 63.43 76.35
* nine windows per image and 1000 DTs per class of each window.
for classifying the three datasets with 100, 500 and 1000 DTs
per class. In the case of POLYCiNN, we trained it with 100,
500 and 1000 DTs per class for each window. For CIFAR-10
or comparators. The delay caused in this part by awaiting
and SVHN, we divided each image into nine windows of size
neighbor pixels to be loaded is negligible. This is because the
16×16 and a stride of eight pixels. For MNIST, we divided
arrays of comparators access the memory symmetrically (row
each image into nine windows of size 22×22 and a stride of
by row and column by column). Moreover, this approach of
three pixels.
implementing LBP can be parallelized by using many arrays
Fig. 5 shows the accuracy for MNIST, CIFAR-10 and
of comparators, which increases the throughput at the cost of
SVHN as a function of the number of DTs in different
computational and memory access resources.
experiments. In the three datasets, POLYCiNN outperforms
Concerning the second part, all DTs have six decision nodes
POLYBiNN in terms of classification accuracy when both
as maximum. Consequently, each DT is implemented in a
are trained using raw data. This is because POLYCiNN has
single LUT. It should be noted that DTs with more nodes can
the advantage of using sliding DFs. The accuracy curves for
be utilized to increase the classification accuracy. However,
CIFAR-10 and SVHN indicate that using LBP features is a
this requires more LUTs to implement these DTs. The third
powerful approach that can achieve the same performance
part of the decision fusion circuits uses few accumulators
as using raw data. However, Fig. 5 shows that training
and a set of pipelined comparators that should not restrict
POLYCiNN using LBP features and downsampled images
the implementation capabilities. Future works will focus on
achieves higher classification accuracy for the three datasets.
experimenting different implementations of POLYCiNN in
This is because each POLYBiNN is trained using the LBP
FPGAs.
feature vector of its corresponding window alongside with its
neighbor area from the downsampled image. The accuracy of V. C ONCLUSION
POLYCiNN is sensitive to the number of DTs up to a limit
This paper presented POLYCiNN, a classifier inspired by
where the classification accuracy saturates, as shown in Fig. 5.
CNNs and Decision Forest (DF) classifiers. POLYCiNN com-
Table I compares the accuracy of POLYCiNN for the MNIST,
prises a stack of DFs, where each DF classifies one of the
CIFAR-10 and SVHN datasets with prior works. Although
overlapped image windows. POLYCiNN deploys an efficient
POLYCiNN is inferior to state-of-the-art CNNs [2] [3] [4] in
LBP feature extraction layer that improves its classification
terms or accuracy, we obtain better results when compared to
accuracy. We demonstrated that POLYCiNN achieves the same
other DF approaches [7] [8] [9].
accuracy as prior DF approaches on the MNIST, CIFAR-10
POLYCiNN with LBP feature extraction layer and down-
and SVHN datasets. From a hardware perspective, POLYCiNN
sampled image suits the CIFAR-10 and SVHN datasets more
can be implemented using efficient computational and memory
than the MNIST dataset, as shown in Fig. 5. This is because
resources. Moreover, it can be configured to suit various
the variability in the CIFAR-10 and SVHN datasets in terms of
hardware accelerators and embedded devices.
image translations, scales, rotations, color spaces and geomet-
rical deformations are much more than those in the MNIST VI. ACKNOWLEDGMENTS
dataset. When a space of high variability is divided into sub-
spaces using DFs, the variations become less evident and the The authors would like to thank Imad Benacer and Siva
overall performance of classifier fusion on all sub-spaces gains Chidambaram for their insightful comments.
are notable. On the other hand, when there are few variations, R EFERENCES
the gains of space division are less notable since there is not
much reduction in variability. [1] Y. LeCun, Y. Bengio and G. Hinton, ”Deep learning.” Nature, May 2015.
[2] A. Krizhevsky, I. Sutskever and G. Hinton, ”ImageNet Classification
with Deep Convolutional Neural Networks.” Advances in Neural Infor-
B. Hardware implementation mation Processing Systems, 2012.
The hardware implementation of POLYCiNN is composed [3] C. Szegedy et al., ”Going Deeper with Convolutions.” IEEE Conference
on Computer Vision and Pattern Recognition, 2015.
of three main parts 1) extracting LBP features, 2) decision [4] S. Han, H. Mao and W. J. Dally, ”Deep Compression: Compressing
forests of decision trees and 3) decision fusion circuit. As Deep Neural Networks with Pruning, Trained Quantization and Huffman
discussed in Section III.D, the simple computations needed to Coding.” arXiv preprint :1510.00149, Oct. 2015 Oct
[5] D. Hunter, H. Yu, M. S. Pukish, J. Kolbusz and B. M. Wilamowski,
compute LBP feature vectors should not hinder the implemen- ”Selection of Proper Neural Network Sizes and Architectures—A Com-
tation process since these features are computed using arrays parative Study.” IEEE Trans. on Industrial Informatics, May 2012.
Authorized licensed use limited to: Universidad Peruana de Ciencias Aplicadas (UPC). Downloaded on August 30,2024 at 04:33:03 UTC from IEEE Xplore. Restrictions apply.
17
Fig. 5. POLYCiNN accuracy for the CIFAR-10, SVHN and MNIST datasets.
[6] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Hock, Y. T. [16] T. Hastie, R. Tibshirani, and J. H. Friedman, ”The Elements of Statistical
Liew, K. Srivatsan, D. Moss, S. Subhaschandra and G. Boudoukh, ”Can Learning.” Springer Series in Statistics, 2009.
FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Net- [17] Y. Cheng, D. Wang, P. Zhou and T. Zhang, ”A Survey of Model
works?” ACM/SIGDA International Symposium on Field-Programmable Compression and Acceleration for Deep Neural Networks.” arXiv
Gate Arrays, Feb. 2017. preprint:1710.09282, Oct. 2017.
[7] A. M. Abdelsalam, A. Elsheikh A, J. P. David and J. M. P. Langlois, [18] M. Courbariaux, I. Hubara, D. Soudry, R.E.Yaniv and Y. Bengio,
”POLYBiNN: A Scalable and Efficient Combinatorial Inference Engine ”Binarized Neural Networks: Training Deep Neural Networks with
for Neural Networks on FPGA.” IEEE Conference on Design and Weights and Activations Constrained to +1 or -1.” Computer Research
Architectures for Signal and Image Processing, Oct. 2018. Repository, arXiv preprint:1602.02830, Feb. 2016.
[8] Z. H. Zhou and J. Feng, ”Deep forest: Towards an Alternative to Deep [19] E. Wang, J. J. Davis, R. Zhao, H. C. Ng, X. Niu, W. Luk, P. Y. Cheung
Neural Networks.” arXiv preprint:1702.08835. Feb. 2017. and G. A. Constantinides, ”Deep Neural Network Approximation for
[9] P. Kontschieder, M. Fiterau, A. Criminisi and S. B. Rota, ”Deep Neural Custom Hardware: Where We’ve Been, Where We’re Going.” arXiv
Decision Forests.” IEEE International Conference on Computer Vision, preprint:1901.06955. Jan. 2019.
2015. [20] N. Frosst and G. Hinton, ”Distilling a Neural Network into a Soft
[10] P. Gysel, M. Motamedi and S. Ghiasi, ”Hardware-Oriented Approxi- Decision Tree.” arXiv preprint:1711.09784, Nov. 2017.
mation of Convolutional Neural Networks.” arXiv preprint:1604.03168, [21] Q. Zhang, Y. Yang, Y. N. Wu and S. C. Zhu, ”Interpreting CNNs via
Apr. 2016. Decision Trees.” arXiv preprint:1802.00121, Feb. 2018.
[11] K. Abdelouahab, M. Pelcat, J. Serot and F. Berry, ”Accelerating CNN [22] A. M. Abdelsalam, A. Elsheikh, S. Chidambaram, J. P. David and J. M.
Inference on FPGAs: A Survey.” arXiv preprint:1806.01683, May 2018. P. Langlois, ”POLYBiNN: Binary Inference Engine for Neural Networks
[12] C. Hettinger, T. Christensen, B. Ehlert, J. Humpherys, T. Jarvis and S. using Decision Trees.” Journal of Signal Processing Systems. May 2019.
Wade, ”Forward Thinking: Building and Training Neural Networks One [23] K. Miller, C. Hettinger, J. Humpherys, T. Jarvis and D. Kartch-
Layer at a Time.” arXiv preprint:1706.02480, Jun. 2017. ner, ”Forward Thinking: Building Deep Random Forests.” arXiv
[13] L. Breiman, ”Random forests.” Machine Learning, Oct. 2001. preprint:1705.07366, May 2017.
[14] S. B. Akers, ”Binary Decision Diagrams.” IEEE Transactions on Com- [24] T. Ojala, M. Pietikainen and T. Maenpaa, ”Multiresolution Gray-Scale
puters, Jun. 1978. and Rotation Invariant Texture Classification with Local Binary Pat-
[15] P.T. Tang, ”Table-Lookup Algorithms for Elementary Functions and terns.” IEEE Transactions on Pattern Analysis Machine Intelligence,
Their Error Analysis.” IEEE Symposium on Computer Arithmetic, Jun. pp. 971-987, Jul. 2002.
1991. [25] R. C. Gonzalez and R. E. Woods, Digital image processing. Publishing
House of Electronics Industry, Mar. 2002.
Authorized licensed use limited to: Universidad Peruana de Ciencias Aplicadas (UPC). Downloaded on August 30,2024 at 04:33:03 UTC from IEEE Xplore. Restrictions apply.
18