19.transfer Learning Based Data-Efficient Machine Learning Enabled Classification
19.transfer Learning Based Data-Efficient Machine Learning Enabled Classification
Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress
Houbing Song
Embry-Riddle Aeronautical University
Daytona Beach, FL 32114 USA
Email: [email protected]
Abstract—Recently, waste sorting has become more and and soil contamination. For example, toxic materials can be
more important in our daily life. It plays an essential role transferred into human bodies and wildlife from air, water,
in the big picture of waste recycling, reducing environmental and food. Moreover, soil contamination can seriously hurt
pollution significantly. Deep learning (DL) methods have been
dominating the field of image classification and have been all fields related to agriculture. As shown in the study of
successfully applied to waste sorting tasks to achieve state- [8], the expense of pollution control has been exponentially
of-art performance. However, most traditional DL methods increasing in the past few decades, and many potential
require a massive amount of annotated data for the training solutions have been proposed. To the best of our knowledge,
phase. Unfortunately, there is only one small data set for recycling is widely acknowledged as one of the proven ways
waste sorting, TrashNet created by Standford. In addition,
manually collecting and labeling a massive data-set can be too to reduce environmental pollution effectively. In general,
costly. To address this issue, we decided to implement transfer the benefits of cycling are listed as follows: reducing the
learning (TL) techniques to construct a robust model based on waste lost in landfills, reducing greenhouse gas emissions,
a fairly small set of training data by transferring knowledge and saving resources for making raw materials. Furthermore,
from existing deep networks, such as AlexNet, Resnet, and accurately sorting the waste from our daily life is the first
DensNet. As an innovation, we propose a novel domain loss
function, Dual Dynamic Domain Distance (4D), to produce a and very important step of the big picture of recycling.
more accurate domain distance measurement. There are three Therefore, finding an effective and efficient way is the key
contributions to this paper. First, our model has achieved the to the success of the cycling process.
best performance on the TrashNet data. Secondly, it is the In this paper, our focus is on building a DL model
first time that TL has been used for waste sorting. Finally, the for solid waste sorting, which lands in the field of image
proposed novel 4D domain loss has improved the performance
of TL for this task. In this paper, we implemented two types classification. Firstly, traditional image processing methods
of transfer learning methods, DDC, DeepCoral, to TrashNet use hand-designed features to complete tasks like classifica-
data-set. Moreover, the DeepCoral-Resnet50 model yields the tion, detection, segmentation. However, designing features
best performance of 96% test accuracy. More importantly, this by hand is a very time-consuming and costly process.
work can be easily generalized to other image classification Furthermore, it does not always output promising perfor-
tasks.
mance in complicated tasks. In the recent decade, DL has
Keywords-Deep learning, Transfer learning, image classifica- dominated this field by dramatically setting our hands free
tion, waste sort, Data-Efficient Machine Learning from designing features, and improving the performance.
Additionally, one of the most famous DL models, convo-
I. I NTRODUCTION lutional neural network (CNN), has shown its great power
We are entering a new era of smart cities, which offers in a number of different fields, such as object classification,
great promise for improved wellbeing and prosperity but object detection, and speech reorganization. Generally, a
poses significant challenges [1]–[3]. Machine learning and deep neural network (DNN) tends to enable the machine
data analytics have emerged as essential tools to address to learn how to accomplish the task. In other words, DNN
these challenges, which smart cities are facing [4]–[7]. can be considered as a black box of a massive amount of
Rapidly increasing pollution from overpopulation and hyper-parameters. The goal is to get the best performance
industrialization is causing serious damage to the natural by iteratively adjusting the values of parameters based on a
environment of the Earth. As the consequences, water pollu- set of rules. However, most DL methods require a huge set
tion, air pollution, and deforestation are causing a number of of well-labeled training data to get promising performance.
negative effects on our health and the economy, such as the In many real-world problems, we do not have a sufficient
increasing cancer rate, new diseases, extinction of species, amount of labeled data for training, or we cannot even
Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:27:17 UTC from IEEE Xplore. Restrictions apply.
find unlabeled training data. Researchers started focusing projects that are related to waste sorting. Moreover, for a
on transfer learning to address this issue, which allows better understanding, we categorize them into three sub-
us to leverage the knowledge stored in other well-trained fields: traditional methods, conventional DL methods, trans-
models. Moreover, we do not have many datasets for waste fer learning methods.
sorting tasks that can provide enough training data for deep
A. Traditional Methods
networks. Therefore, we propose a transfer learning model
for this topic. Firstly, a traditional model, support vector machine
According to [9], there are three common transfer learning (SVM), is considered one of the best initial image classi-
settings: inductive transfer learning, transductive transfer fication methods. Moreover, comparing to DL models, it is
learning, and unsupervised transfer learning. In general, simpler to build and easier to train. [10] built an SVM model
there are multiple different domains in a transfer learning for waste sorting based on a hand-designed feature detector,
task: one target domain and one or multiple source domains. SIFT. In addition, the SIFT descriptor is one of the most
As for inductive transfer learning, supervised training data powerful feature detectors, and it is invariant to scale, noise,
is always available in the target domain. In the setting and illumination [13]. Thus, it is extremely helpful to waste
of transductive transfer learning, the well-labeled data is sorting. Furthermore, the best kernel of SVM was found
only available in the source domain. Differently, there is no after testing a number of different kernels. It is defined as:
labeled data in both the source domain and the target domain 0 2
in the setting of unsupervised transfer learning. In this paper, 0
x−x
the setting of the proposed model fits into inductive transfer K(x, x ) = exp(− ) (1)
2σ 2
learning. In addition, we only have a small set of data [10] And, the best performance achieved by SMV was 63%
that contains 2530 images in total, which might not be testing accuracy.
enough for building a robust waste sorting model. We tend to
use domain adaption techniques to leverage the knowledge B. Conventional DL Methods
stored in deeply-trained models like, AlexNet [11], ResNet Importantly, as mentioned in the earlier contents, one of
[12], that are trained on ImageNet dataset. By doing so, we DL methods’ greatest advantages is that deep networks can
were able to push the testing accuracy to 96% by using such automatically learn features, instead of designing features
a small dataset. by hands. However, DL models require matching the size
The rest of the paper is organized as follows. Section of data and the size of the network. A significant mismatch
II presents related work. Dataset is introduced in Section usually causes over-fitting or under-fitting. [10] built a CNN
III. We present our proposed methodologies in Section IV. that is considered as a simplified version of AlexNet [11]. As
Moreover, experimental results are discussed in Section V. claimed by the authors, this model only achieved 22% testing
Section VI concludes this paper. accuracy, which is worse than a pure guess. Moreover,
[14] selected three successful DL architectures, namely,
Cardboard) Glass Metal
MobileNet [15], DenseNets [16], and Inception [17], to train
from scratch. As a result, those models achieved testing
accuracies, 84%, 84%, and 89%, respectively. DL models
achieve better performance than traditional models.
However, there are two main drawbacks of conventional
DL methods. Firstly, those selected models are reasonably
deep and complicated. Training from scratch is very time-
consuming and can be over-fitting with such a small dataset.
Paper Plastic Trash)
Secondly, one advantage we have is that there are a number
of datasets that contain the objects that are in TrashNet.
Furthermore, we can benefit from those samples in other
datasets if distribution mismatches can be reduced. However,
conventional DL methods cannot take advantage of those
samples from other domains.
Figure 1. Source Data & Target Data.
C. Transfer Learning Model
To address drawbacks of conventional methods, numerous
II. R ELATED W ORK transfer learning methods have been proposed. Commonly,
Previously, many image classification projects have been the distribution mismatches between the source domain and
created. However, there are not many that are related to the target domain are the main issue that prevents us from
waste sorting. In this section, we introduce a number of using samples collected from different domains for training.
621
Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:27:17 UTC from IEEE Xplore. Restrictions apply.
900
As one of the solutions, fine-tuning is acknowledged to be Source
800 Target
an effective way to deal with the distribution mismatch.
Primarily, [14] also implemented fine-tuning on the selected 700
DL architectures to improve the performance to a new
600
level. As shown in Table-I, the authors pushed the best
testing accuracy to 95% by combining fine-tuning and data 500
622
Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:27:17 UTC from IEEE Xplore. Restrictions apply.
Classification Domain
Loss Loss Domain Confusion (DDC) [18], an AlexNet-based [11]
Convolutional Neural Network (CNN) with one adaptation
fc8 fc8
layer was proposed to learn a semantically meaningful and
domain invariant representation. Additionally, the evaluation
fc_adapt fc_adapt
metric can also be used to determine the position and the
fc7 fc7 dimensionality of the adaptation layer.
fc6 fc6 Additionally, DDC deploys a loss function that contains
conv5 conv5 two terms, classification loss LC , and MMD constraint
M M D2 . As shown in (2), XS and XT represent the
conv1 conv1 data sets from the source domain and the target domain.
Moreover, λ determines how strongly we would like to
Labeled Images Unlabeled Images confuse the domains.
2
nS
X nT
X
enough, so we need to transfer the knowledge learned from M M D(XS , XT ) = φ(xiS ) − φ(xjT ) (3)
the source domain. This paper implemented a novel loss i=1 j=1
H
function with dynamic weighting and built four different
models, DDC-AlexNet, DDC-ResNet, DeepCoral-AlexNet, In addition, λ is a fixed coefficient, as described in the
and DeepCoral-Resnet. original paper. However, setting a reasonable value to it is
not a simple process. Greater value can lead the model to
fc6 fc7 focus too much on reducing the distribution mismatch, while
Cov1
Cov5
smaller value might get poor classification accuracy on the
fc8
target domain due to not focusing enough on the distribution
Classification
Loss mismatch. Therefore, we proposed to make λ to be a
dynamic factor. As described in (4), it is a hyperbolic-tan
function that scales from 0-1. Theoretically, we wish to focus
on extracting domain-invariant features in the early stage and
Shared
Shared
Shared
Shared
Shared
CORAL Loss
λ = tanh(0.02x) (4)
fc8
B. DDC-ResNet
Cov1 Cov5
fc6 fc7 Moreover, DDC is transfer learning architecture that can
be easily generalized to other pre-trained DL models. In
Figure 4. Deep coral with AlexNet backend. this paper, we also examined ResNet-based DDC model.
However, the adaption layer with dynamic loss function is
A. DDC-AlexNet added after the last average-pooling layer.
Previously, Alexnet [11] won the ILSVRC02012 compe- C. DeepCoral-AlexNet
tition and achieved top-5 test error rate of 15.3% on the Furthermore, [16] introduced another transfer learning
ImageNet data-set. Firstly, the idea of the adaptation layer framework, DeepCoral, which shares a similar idea as DDC.
was proposed by [27]. It introduced a modified feedforward As shown in Figure3 , it places one adaption layer after
neural network, Domain Adaptive Neural Network (DaNN), the last fully connected layer with a new loss function,
with one adaptation layer. Importantly, the loss function is CoralLoss. `CORAL , is defined as the distance between
constructed by two parts, the general loss, and the MMD the second-order covariances of the source and the target
regularizer, respectively. Additionally, the MMD loss is used features. And, it is described in (5),
to evaluate the distribution mismatch between the source
1
and target domains. However, it is a very shallow and `CORAL = kCS − CT k2F (5)
simple model, so the performance is still limited. To achieve 4d2
2
better performance, we wish to extend the potential of where CS and CT are feature covariance matrices, k·k is
DaNN to deeper networks. As illustrated in Figure3, Deep the squared matrix Frobenius norm. Moreover, inspired by
623
Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:27:17 UTC from IEEE Xplore. Restrictions apply.
FC
Table II
T RANSFER L EARNING P ERFORMANCE
V. E XPERIMENTAL R ESULTS
0.8
Test Accuracy
A. Experimental Setup
0.7
As mentioned in section III, we have 2754 labeled-images
in the source domain, and 2530 labeled-images in the target
domain. In addition, images in two domains have the same 0.6
set of labels but different distributions. In the experiment,
we split the target dataset into Target train and Target test CoralRes
0.5 CoralAlex
by the ratio of 80/20. Moreover, the total epoch is set to DDCRes
200. Additionally, to extend the dataset even further, we DDCAlex
624
Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:27:17 UTC from IEEE Xplore. Restrictions apply.
To show that the 4D loss function can improve the perfor- Lastly, models built in this experiment used labeled-target
mance, we made a comparison between DeepCoral ResNet data for training. However, other TL methods do not require
with regular loss function and the same model with a labeled-target for training, which might be more helpful for
dynamic loss function. As we can tell from Figure7, dynamic those real-world problems that do not have adequate labeled
loss function does not only faster convergence but also data.
gives a smoother curve. More importantly, the concept of
ACKNOWLEDGMENT
4D loss can be generalized to more different distribution
measurements by using a dynamical combination. This research was partially supported through Embry-
Riddle Aeronautical University’s Faculty Innovative Re-
1 search in Science and Technology (FIRST) Program.
0.95 R EFERENCES
0.9 [1] H. Song, R. Srinivasan, T. Sookoor, and S. Jeschke, Smart
Cities: Foundations, Principles and Applications. Hoboken,
0.85 NJ: Wiley, 2017.
Test Accuracy
0.8
[2] G. T. Reddy, M. P. K. Reddy, K. Lakshmanna, D. S. Rajput,
0.75 R. Kaluri, and G. Srivastava, “Hybrid genetic algorithm and
a fuzzy logic classifier for heart disease diagnosis,” Evolu-
0.7 tionary Intelligence, vol. 13, no. 2, pp. 185–196, 2020.
0.65 [3] H. Song, D. Rawat, S. Jeschke, and C. Brecher, Cyber-
Physical Systems: Foundations, Principles and Applications.
0.6
Boston, MA: Academic Press, 2016.
0.55 dynamic
regular [4] G. Dartmann, H. Song, and A. Schmeink, Big Data Analytics
0.5 for Cyber-Physical Systems: Machine Learning for the Inter-
0 20 40 60 80 100 120 140 160 180 200 net of Things. Elsevier, 2019.
Epoch
[5] G. T. Reddy, M. P. K. Reddy, K. Lakshmanna, R. Kaluri, D. S.
Figure 7. Dynamic loss vs Regular loss. Rajput, G. Srivastava, and T. Baker, “Analysis of dimension-
ality reduction techniques on big data,” IEEE Access, vol. 8,
pp. 54 776–54 788, 2020.
VI. C ONCLUDING R EMARKS
[6] Y. Sun, H. Song, A. J. Jara, and R. Bie, “Internet of things
First of all, recycling is an essential process for our Earth. and big data analytics for smart and connected communities,”
Pollution has caused a number of species extinctions, and IEEE Access, vol. 4, pp. 766–773, 2016.
the number is still increasing.
Secondly, DL is one of the most powerful ways for many [7] Z. Lv, H. Song, P. Basanta-Val, A. Steed, and M. Jo, “Next-
generation big data analytics: State of the art, challenges,
computer vision tasks. However, most DL methods have and future research topics,” IEEE Transactions on Industrial
heavily relied on the Big Data and computational power Informatics, vol. 13, no. 4, pp. 1891–1899, 2017.
to output state-of-art performances. In other words, the Big
Data is not only the power of DL, but also the limitation [8] A. Ghorani-Azam, B. Riahi-Zanjani, and M. Balali-Mood,
“Effects of air pollution on human health and practical mea-
of it. To address this issue, transfer learning has attracted sures for prevention in Iran,” Journal of research in medical
more and more attention in the past few years, and many TL sciences: the official journal of Isfahan University of Medical
algorithms have been proven to be successful. As introduced Sciences, vol. 21, 2016.
by Andrew Ng at NIPS 2016, TL will become the main
[9] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE
direction of DL in the future.
Transactions on knowledge and data engineering, vol. 22,
Finally, in this waste sorting experiment, we first have no. 10, pp. 1345–1359, 2009.
justified that TL models have achieved the best performance
better than all existing models built on TrashNet. And then, [10] M. Yang and G. Thung, “Classification of trash for recycla-
the novel domain loss function 4D proposed by us has bility status,” CS229 Project Report, vol. 2016, 2016.
shown the potential to benefit the TL models significantly [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
with more accurate domain loss measurement. As in the classification with deep convolutional neural networks,” in
future, few ideas can potentially push the results to an even Advances in neural information processing systems, 2012, pp.
higher level. First, GANs-based data augmentation might 1097–1105.
perform better than traditional data augmentation techniques. [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
Then, other metrics that can calculate the distance between learning for image recognition,” CoRR, vol. abs/1512.03385,
two different domains could also enhance the performance. 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
625
Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:27:17 UTC from IEEE Xplore. Restrictions apply.
[13] X. Yue, Y. Liu, J. Wang, H. Song, and H. Cao, “Software [27] M. Ghifary, W. B. Kleijn, and M. Zhang, “Domain adaptive
defined radio and wireless acoustic networking for ama- neural networks for object recognition,” in Pacific Rim inter-
teur drone surveillance,” IEEE Communications Magazine, national conference on artificial intelligence. Springer, 2014,
vol. 56, no. 4, pp. 90–97, April 2018. pp. 898–904.
[14] R. A. Aral, S. R. Keskin, M. Kaya, and M. Haciomeroglu, [28] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan,
“Classification of trashnet dataset based on deep learning M. Pontil, K. Fukumizu, and B. K. Sriperumbudur, “Optimal
models,” 2018 IEEE International Conference on Big Data kernel choice for large-scale two-sample tests,” in Advances
(Big Data), pp. 2058–2062, 2018. in neural information processing systems, 2012, pp. 1205–
1213.
[15] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,
W. Wang, T. Weyand, M. Andreetto, and H. Adam,
“Mobilenets: Efficient convolutional neural networks for
mobile vision applications,” CoRR, vol. abs/1704.04861,
2017. [Online]. Available: http://arxiv.org/abs/1704.04861
626
Authorized licensed use limited to: Frankfurt University of Applied Sciences. Downloaded on June 17,2022 at 15:27:17 UTC from IEEE Xplore. Restrictions apply.