26.AR Dynamic Image Recognition Tech Based On Deep Learning Algorithm
26.AR Dynamic Image Recognition Tech Based On Deep Learning Algorithm
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number
Natural Science Foundation of China (No. U1904119):Research on the perception of foreign object invasion in airport clearance area
based on one-class learning. Natural Science Foundation of China (No. 51705472): Balancing Dynamics and Optimization of Mixed-
model Assembly Line Network for Complex Products based on Digital Twin.
ABSTRACT Augmented reality is a research hotspot developed on the basis of virtual reality. Friendly
human-computer interaction interface makes the application prospect of augmented reality technology very
broad. Convolutional neural networks in deep learning have been widely used in the field of computer
vision and become an important weapon in dynamic image recognition tasks. Combining deep learning and
traditional machine learning techniques, this paper uses convolutional neural networks to extract features
from image data. The convolutional neural network uses the last layer of features and uses the softmax
recognizer for recognition. This paper combines a convolutional neural network that can learn good feature
information with integrated learning that has good recognition effects. In the recognition tasks of the
MNIST database and the CIFAR-10 database, comparison experiments were performed by adjusting the
hierarchical structure, activation function, descent algorithm, data enhancement, pooling selection, and
number of feature maps of the improved convolutional neural network. The convolutional neural network
uses a pooling size of 3*3, and uses more cores (above 64), small receptive fields (2*2), and more
hierarchical structures. In addition, the Relu activation function, gradient descent algorithm with
momentum, and enhanced data set are also used. The research results show that under certain experimental
conditions, the dynamic image recognition results have dropped to a very low error rate in the MNIST
database, and the error rate in the CIFAR-10 database is also ideal.
INDEX TERMS dynamic image recognition; deep learning; CNN-XGBOOST model; augmented reality;
integrated learning.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
improves the recognition speed and accuracy of the overall and edge detection on the edge and texture features of the
model [10,11]. Related scholars have proposed the Deep image, respectively, and extract the underlying features of
Belief Network (DBN), which uses a series of Restricted the image [37,38]. In addition to the Scale Invariant Feature
Boltzmann Machine (RBM) and uses unsupervised layer- Transform (SIFT) algorithm for feature description, the
by-layer greedy training to extract features, obtain multiple local feature descriptor Spin Image is also used to count the
layers of deep network structure, and supervise adjustments characteristics of the two-dimensional coordinate
[12-14]. This research solved the problem of the distribution histogram around the feature points [39,40].
disappearance of gradients in deep network training, and The algorithm model also adopts a dense feature extraction
started the upsurge of deep learning research [15,16]. Goog method based on a fixed grid to extract features at multiple
Le Net uses a modular network structure and divides the scales [41].
entire inception network into 9 modules [17]. By increasing This paper conducts experiments on the MNIST database
the depth and width of the model, not only the sparsity of and the CIFAR-10 database by convolution kernel size and
the network structure is realized, but also the high its number, pooling size and method selection, parameter
computing performance of the dense matrix is used to update algorithm selection, activation function selection,
achieve precision identification and detection [18-20]. The and data enhancement, and analyzes the results.
Google Net network uses average pooling instead of the Specifically, the technical contributions of this article can
fully connected layer, and 2 additional softmaxes are added be summarized as follows:
to the network to conduct forward gradients, avoiding the First: We combine the multi-layer features of the
problem of gradient disappearance during model training convolutional neural network model in deep learning with
[21,22]. Alex Net uses the Re LU activation function, the traditional machine learning technology XGBoost
which not only fundamentally solves the problem of algorithm. The features extracted by the convolutional
gradient disappearance of deep learning networks, but also neural network are serially fused, and Principal
greatly speeds up the convergence rate [23]. Since the Re Components Analysis (PCA) technology is used to reduce
LU function can well suppress the gradient disappearance the dimension.
problem, Alex Net does not use "pre-training + "Fine- Second: In the experiment of dynamic image recognition
tuning" method, but fully adopts supervised training [24]. on these two databases, the number of selected cores is 64,
Alex Net also extends the Le Net5 structure, adding a the size of the receptive field is 3*3, the random pool is 3*3,
Dropout layer and LRN layer to reduce overfitting and and the Stochastic Gradient Descent (SGD) with
enhance generalization capabilities [25,26]. momentum is used. Optimization algorithm uses Relu as
Image recognition technology is the technology to the activation function, increases the amount of data and
recognize the target in the image, that is, the use of uses a 5-layer deep convolutional neural network, achieving
computer technology to simulate the human senses to a good recognition effect.
complete the image recognition and understanding process The rest of this article is organized as follows. Section 2
[27]. Recognition of target objects in images is a key analyzes the related theories and technologies of augmented
research direction in the field of image recognition, and has reality dynamic image recognition and deep learning.
been widely used in the fields of security, transportation, Section 3 constructs the CNN-XGBOOST augmented
and the Internet [28]. Object recognition can be divided into reality dynamic image recognition model. Section 4
object recognition and object detection [29]. Object conducted simulation experiments and analyzed the results.
recognition only needs to describe the characteristics of the Section 5 summarizes the full text.
target object in the image, and object detection not only
needs to obtain the feature description of the object in the II. RELEVANT THEORIES AND TECHNOLOGIES OF
image, but also needs to obtain the specific location AUGMENTED REALITY DYNAMIC IMAGE
information of the object [30,31]. Therefore, in addition to RECOGNITION AND DEEP LEARNING
characterizing the target, object detection also requires
analysis of the object structure [32]. Object recognition A. Key technologies of augmented reality
mainly focuses on feature learning. Relevant scholars The structure of the augmented reality system is shown
applied the Bag of Words (BoW) model in text recognition in Figure 1. It is implemented by a group of closely linked
to the field of image object recognition, and proposed a software and hardware working in real time [42]. On the
visual word packet model based on image recognition [33- one hand, the image collected by the camera is directly
35]. Through the underlying feature extraction and feature displayed on the display device and presented to the user in
coding, it is more discriminative and robust. Then, the a real scene; the image of the virtual object generated by the
feature expression of the whole image is obtained through computer is also transferred to the display device [43,44].
the feature set aggregation operation [36]. Finally, the In the integration of virtual and real, the full-scale
support vector machine is used for image recognition. The alignment of virtual and real scenes depends on the support
image recognition algorithm based on the word packet of the registration system [45]. Finally, the user is presented
model uses H augmented reality ris-Laplace operator and with the scene of virtual and real fusion.
Laplacian Gaussian operator to perform corner detection
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
Image
preprocessing
AR imaging
system
Interactive
sensor projection
Augmented PC monitor system
reality display
Image Multi- system
Identification channel Real-time
interaction HD camera
Scene fusion system
Model of
virtual Wireless
scene image
transmission
Viewpoint coordinates
Multi-eye
vision tracking Communication
system Camera Virtual
Virtual object
Real world coordinates object
model
coordinates coordinates
User
Real world
scene
Video camera Virtual scene
generation system
Virtual scene
coordinates
Virtual-real Based on visual
Display synthesis system registration
Virtual camera coordinates
system Real scene
coordinates Tracking
Human-computer registration
Coordinate calibration of real
interaction system system
camera and virtual camera
1) Tracking technology relative to the signal receiver based on the current signal of
In the augmented reality system, the picture that the user the receiver.
sees changes as the viewing angle changes. The augmented The magnetic field tracking technology is not limited by
reality system needs to accurately track the user's location, the sight line or obstacles, and can be free from the
line of sight and other information in real time. The influence of other objects besides the conductive and the
performance of the tracking technology determines the magnet on the tracking, and the system refresh rate is
performance of the augmented reality system. higher and the real-time performance is better. In addition,
There are four main sensor-based tracking technologies the sensing device in the magnetic field tracking system is
commonly used in augmented reality systems: magnetic small and light, which is convenient for users. The
field tracking technology, optical tracking technology, applications of magnetic field tracking technology are
acoustic tracking technology, and inertial tracking mainly small-area, augmented reality applications without
technology. conductive magnets.
The magnetic field tracking system is usually composed In the optical tracking technology, the light source and
of control components, signal transmitters and receivers. the photosensitive device are diverse. The photosensitive
The signal transmitter and receiver are composed of device can be an ordinary camera, or a photodiode. The
mutually orthogonal electromagnetic induction coils. The transmission medium of the optical tracking system is an
signal generator generates a magnetic field from an optical signal, so the signal reception speed is fast and the
electromagnetic induction coil, and the signal receiver refresh rate is high, which is suitable for occasions with
receives the magnetic field and generates a corresponding high real-time requirements. However, the optical tracking
induced current. The algorithm set by the control unit can system requires that there should be no obstacles between
calculate the position and direction of the tracking target the sensor and the optical element. In addition, the cost of
optical tracking systems is relatively expensive.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
Acoustic tracking technology uses ultrasound when The final effect of the augmented reality system is
tracking the target position. Compared with the magnetic displayed to the user through various means, and the
field tracking technology, the acoustic tracking technology display effect determines the user's intuitive feeling.
will not be interfered by the magnetic field, and the cost of Therefore, display technology is very important in
the system is much lower than other tracking systems. augmented reality systems.
However, similar to optical tracking technology, the The video perspective shows that the helmet closes the
reflection and diffraction of sound waves by obstacles can user's line of sight, with one or two cameras to shoot the
affect the accuracy of the system. In addition, the real scene. The camera video and graphics are synthesized
propagation rate of ultrasonic waves is low, the data refresh by the scene synthesizer, and the fusion result is finally
rate of the system is low, and the real-time performance is transmitted to the display in front of the user. The user can
poor. see through the synthesizer not only the real scene in front,
Inertial tracking technology uses inertial sensors to track. but also the virtual image reflected by the synthesizer.
The gyroscope can measure the three-degree-of-freedom In the display mode of video perspective, there are many
rotational motion of the tracking target to determine the positioning methods, a wider field of view, and more
direction of the head; the accelerometer measures the flexible synthesis of scenes, which can be delayed matching
motion acceleration of the head to determine the position of and brightness matching. The display method of optical
the head. perspective has high resolution, good safety, simple
The inertial tracking device is relatively light and structure, and no visual deviation compensation is required.
portable, suitable for dynamic tracking and outdoor The simplest and most used display method in the
tracking. Combined with GPS technology, it can achieve a augmented reality system is the display, and its schematic
better outdoor tracking effect. But the inertial tracking error diagram is shown in Figure 2. In the display of an ordinary
has a cumulative effect, so the accuracy is not high, and the display, the real scene captured by the camera and the
equipment is more expensive. virtual information generated by the computer are
2) Display technology combined and displayed on the ordinary display.
In the design of augmented reality systems, a basic
problem is the fusion of virtual information with real scenes.
Built-in
Video camera Image interaction image
module recognition Gaze Other
Camera orientation object program
information modules
Extended Call image
gesture response Event
Graphic
recognition function registration
system
Call the line of
sight response
Virtual object Sight Image object
Image function
Image object interaction response
tracking Get
management module function
image
Image synthesis Call the image
object
response function
Image
Enhance recognition
scene Event registration
Image interaction Other
images
module program
Incident response modules
Image event
management
Monitor
FIGURE 2.Schematic diagram of the display augmented reality
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
continuously minimize the error, and reach the set Capsule is a carrier containing multiple neurons, and
expectations. each neuron represents various attributes of a specific entity
2) Deep learning appearing in the image. These attributes can include many
With the in-depth development of shallow learning, the different types of instantiation parameters, such as attitude
SVM model develops fastest. Because these shallow (position, size, direction), deformation, speed, hue, and
models have only one hidden layer or no hidden layer and texture. One attribute of Capsule is the existence of an
are easy to train, they are used to deal with some simple or instance of a certain category in the image. The size of the
constrained problems, but their ability to deal with complex output value is the probability of the existence of the entity.
problems such as sound, natural language and images is Generally, the vector described in mathematics is a
limited. concept with direction and length. Comparing Capsule to a
Deep learning is actually a new branch of machine mathematical vector, it also has the so-called "length" and
learning algorithms. Through the hierarchical structure of "direction". Assuming that a Capsule represents the human
the staged information processing mechanism, eye, it is called a "human eye capsule", then its length
unsupervised feature extraction, pattern analysis and represents the probability that the eye exists in a certain
classification can be explored. This is the essence of deep position of the image, and the direction represents some
learning. You use the structure of multi-layer neural parameters of the eye, such as position, rotation angle,
network to realize the function of machine learning sharpness, etc.
algorithm. When training a deep network, since the Caps Nets are composed of n sub-networks (Capsule).
superposition of multiple linear functions does not change Each Capsule is dedicated to complete some individual
the nature of the function, it is necessary to replace the tasks, and Capsule itself needs a multi-layer network to
linear function neurons with non-linear function activation achieve. The output vector includes the state information of
neurons in the hidden layer, so as to increase the the object and the probability of the type of the object. The
representation ability of the deep network. In the field of output parameters of the lower layer capsule will be
dynamic image recognition, deep network learning can be converted into the higher layer capsule's prediction of the
understood as: the first layer, you learn some edge features entity state. If the predictions are consistent, the parameters
through image pixels; the second layer, you can learn the of the layer will be output.
contour, edge, corner and other features of the target; at a The Caps Nets model also uses a convolution structure
higher level based on these features, more essential and for feature extraction, but Primary Caps (main capsule layer)
complex features can be abstracted. Deep learning is a tool can divide the data information into multiple units under
for training through data, and the final purpose is feature multiple channels, thereby generating vectors that retain
learning and classification recognition. spatial information according to each unit. This structure
Compared with traditional machine learning methods, replaces the pooling layer in the traditional convolutional
deep learning methods are very practical when faced with network, which can effectively reduce the loss of
massive amounts of data. Deep learning methods can information. The last layer is similar to the fully connected
reduce model deviations through more complex models to layer, but each neuron is transformed into a Capsule
improve statistical estimation accuracy. In addition, deep structure for classification output, called Digit Caps layer.
learning is an end-to-end model that discards the 3) Convolutional neural network
intermediate steps of artificial rules and applies the learned For Convolutional Neural Network (CNN), it was
prior knowledge to other models. These advantages make originally inspired by a biological vision system and
deep learning methods very suitable for dynamic image designed a multi-layer perceptron model for the recognition
recognition. of two-dimensional data. CNN is essentially a combination
of a feature extractor and a classifier. Through continuous
D. Deep learning model feature learning on the input image, a set of feature vectors
1) Multi-layer perceptron closest to the meaning of the image is obtained, and then
Multi-Layer Perceptron (MLP) is also called an artificial the tail classifier is input to classify and identify the data.
neural network. In addition to the input and output layers, Figure 3 shows the overall structure of the convolutional
there can be many hidden layers in the middle. The MLP neural network. The input layer is usually a matrix, such as
layer is fully connected with each other. The bottom layer is an image. From the perspective of the feedforward network,
the input layer, the middle is the hidden layer, and finally the convolutional layer and the pooling layer can be
the output layer. The input layer neurons are responsible for regarded as hidden layers with special functions, and the
receiving information. For example, if an n-dimensional other layers except the input layer are ordinary hidden
vector is input, there are n neurons. The hidden layer layers. These hidden layers are generally calculated
neuron is responsible for processing the input information. according to different calculation methods. Usually, a
First, it is fully connected to the input layer. learning (training) process is needed to tune most of the
2) Capsule Network weight parameters.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
Output layer
300*300
80*80
Convolutional Pooling layer
400*400 layer
Feature
learning
FIGURE 3.CNN overall structure diagram
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
lower dimensions and identify the most important features. The base learner of the XGBoost algorithm can be a
Assuming that the dimension of the feature vector is 2, then linear recognizer. The GBDT algorithm uses only the first
the original data has two features. The PCA method derivative during optimization. The XGBoost algorithm
calculates the covariance matrix of the data set, and then does a second-order Taylor expansion of the loss function,
calculates the eigenvalue and eigenvector of the covariance introducing the first derivative and the second derivative.
matrix. The feature vector corresponding to the small At the same time, the cost function also introduces
eigenvalue is the secondary linear component, and the regularization to control the model complexity. Suppose
feature dimension can be reduced to 1. data set D has n samples and m features, expressed as:
The main idea of PCA is to transform the data from the D = ( X i , yi )( D = n, X i R m , y R) (2)
original coordinate system to the new coordinate system.
We use K functions to predict the final output:
The new coordinate system is determined by the data itself, K −1
that is, the n-dimensional feature is mapped to the k-
dimensional feature. The k-dimensional feature is a brand-
yˆ i = f
k =0
k ( X i ), f k F (3)
new orthogonal feature, known as main ingredient. The first
Among them, F represents the set of regression decision
principal component is selected from the direction with the
trees:
largest data difference (ie, the largest variance). The second
principal component selects the direction with the second
F = f ( X ) = wq ( X ) (q : R m T , w R T )
(4)
largest data difference and is orthogonal to the first q(X) represents the structure of the tree, and maps the
principal component, and so on. Most of the variance is samples to the corresponding leaf nodes, w is the weight of
contained in the first k principal components, and the latter the leaf nodes, and wq(X) represents the prediction value of
part is almost close to 0. By selecting the matrix of feature the regression decision tree for the samples. T represents
vectors corresponding to the k features with the largest the number of leaf nodes of the tree, and each f k represents
feature value (that is, the largest variance), the original data the corresponding independent tree structure and leaf node
can be transformed into a new space. The flow chart of weight. Unlike the recognition tree, each leaf node on the
principal component analysis to analyze the principal regression tree is a continuous value. For a given example,
components of multivariable series is shown in Figure 4. the tree structure (that is, given q) is used to identify it to
the corresponding leaf node, and the final prediction result
Start
End
is calculated by summing the weight of the corresponding
Selected image
leaf node (that is, w).
Principal
data series When building a model, I hope that the smaller the loss
matrix
components function, the better. It is impossible to enumerate the
Yes Standardize the structure of all trees. You can use the greedy method, each
image data
series matrix
time you try to split a leaf node, you calculate how much
No Is the factor
load greater the loss function is reduced before and after the split, and
Computing
than 0.8?
correlation matrix
choose the one that reduces the most. Let I L and IR be the
left sample set and the right sample set after leaf node
Calculate the
Calculate the load matrix eigenvalues of the splitting, respectively. The loss reduction after splitting is
of each factor correlation matrix calculated by the following formula:
Calculate the feature vector
corresponding to the feature root
Calculate the
eigenvector of the
correlation matrix
iI
gi
gi
Lsplite = 0.5 • −
whose cumulative contribution i I
L
+ R
(5)
rate is greater than 0.9
+ hi + hi
Is the iI L iI R
Yes characteristic
value greater
than 0.03?
D. Combination of CNN and XGBoost
No
Calculate the variable corresponding to The convolutional neural network has a good effect on
Get a new series
the maximum weight in the eigenvector the feature information extraction of image data. By
corresponding to the eigenvalue with
eigenvalue less than 0.03
directly inputting the image data into the convolutional
neural network, the features with important information in
the image data can be extracted. However, in terms of
recognition and recognition, the convolutional neural
Discard load less
than 0.9 factor
network is not optimal. The convolutional neural network
series only uses the softmax recognizer for recognition. It assigns
FIGURE 4.The flow chart of principal component analysis to analyze the a high value to a certain neuron, and the rest of the neurons
principal components of multivariable series
are assigned low values, making the result polarized. In
practical applications, the ability to correct errors is reduced,
C. XGBoost algorithm
especially in images that are easily confused. The
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
recognition performance of ensemble learning is better, the best split scheme based on the summarized statistical data.
recognition accuracy is high, and it is not easy to overfit An important step in the approximation algorithm is to
and has good generalization ability. However, it is difficult propose candidate split points. In order to make the
to obtain good learning effects for the complex and candidate split points evenly distributed on the data, the
deformed features of image data. Therefore, this article hundreds of digits of the feature value are usually selected
combines the convolutional neural network that can learn as the candidate split points.
good features with the integrated learning with good 2) Reasons for high accuracy of XGBoost algorithm
recognition effect, and learns from each other to improve The XGBoost algorithm adds regularization to the loss
the accuracy of the recognition task. function, controls the complexity of the model, and uses
In this paper, the XGBoost algorithm in integrated pruning to improve the generalization ability of the model;
learning is selected. Because of its high operating efficiency and the second-order Taylor expansion of the loss function
and high precision, the XGBoost algorithm makes it an improves the efficiency and accuracy of the model solution.
important weapon in the task of target recognition. It is a The XGBoost algorithm has built-in cross-validation,
very sophisticated algorithm. which makes it easier to choose better hyperparameters and
1) Reasons for the high operating efficiency of XGBoost the model achieves better results.
algorithm Using the trained Alex Net model, we randomly extract
The XGBoost method uses parallel processing when it is 45,000 image data from the CIFAR-10 training set into the
implemented. Compared with the standard gradient lifting model, and save the last three layers of features in the
algorithm, it has a speed leap. The gradient lifting algorithm network model. The last three layers of features are serially
is composed in series, and the formation of the next base fused, and the PCA algorithm is used for feature
learner depends on the previous base learner, so it is not dimensionality reduction.
parallel when the base learner is trained. The most time- The XGBoost algorithm is used to train the features after
consuming step of the decision tree is to select the features dimensionality reduction, and the parameters of the
and feature values and then split the nodes. Before iterating, XGBoost recognizer are obtained through the cross-
XGBoost sorts the data according to the features and stores validation method. Like the Alex Net model, 45,000 in the
them as a block structure. The data of each block is stored CIFAR-10 data set are used for the training set, 5000 for
in a compressed column format. Each column is sorted the verification set, and 10,000 for the test set.
according to the corresponding characteristic value. Each
time the model is built iteratively, the block structure is
reused, which reduces the amount of calculation when IV. EXPERIMENT AND ANALYSIS
building the model, and can be processed in parallel when
calculating the gain of the feature. A. Selection of the number of cores and the size of the
When looking for the best split point, the greedy receptive field
algorithm is used to consider all the possible split points of In the convolutional layer, in theory, the number of
each feature value. The efficiency is too low. The XGBoost kernels (filters) actually represents the number of feature
method uses an approximate algorithm to speed up the split. maps. The more kernels, the more feature maps are
The algorithm first proposes candidate split points based on extracted, and the larger the network representation feature
the percentile of the feature distribution, then maps the space. The stronger the ability, the more accurate the final
continuous features to the areas divided by these candidate identification.
points, summarizes the statistical data, and proposes the
3 25
Train error Train error
Test error Test error
2.5
20
2
Error rate/%
Error rate/%
15
1.5
10
1
5
0.5
0 0
8-8-32 16-16-32 32-32-32 32-32-64 64-64-64 64-64-128 8-8-32 16-16-32 32-32-32 32-32-64 64-64-64 64-64-128
Three-layer convolution structure Three-layer convolution structure
(a) The recognition results of different core number designs on the MNIST database (b) The recognition results of different core number designs on
the CIFAR-10 database
FIGURE 5.Recognition results of different core number designs
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
In order to study the influence of the number of cores on In order to study the effect of the size of the receptive
the performance of the convolutional neural network, we field (convolution kernel) on the performance of the
designed several structural models based on CNN- convolutional neural network, we adjusted the size of the
XGBOOST. Keeping its hierarchical structure and other three-layer receptive field according to the CNN-
factors unchanged, we adjust its three-layer convolution XGBOOST structural model: 10*10, 9*9, 8* 8, 7*7, 5*5,
structure to be: 8-8-32, 16-16-32, 32-32-32, 32-32-64, 64- 3*3, 2*2. We keep other factors unchanged, and experiment
64-64, 64-64-128. Experiments were conducted on two on two databases. The experimental results of the MNIST
databases. The experimental results of the MNIST database database are shown in Figure 6(a), and the experimental
are shown in Figure 5(a), and the experimental results of results of the CIFAR-10 dataset are shown in Figure 6(b).
the CIFAR-10 dataset are shown in Figure 5(b).
2.5 30
Train error Train error
Test error Test error
25
2
Error rate/%
Error rate/%
20
1.5
15
1
10
0.5 5
10*10 9*9 8*8 7*7 5*5 3*3 2*2 10*10 9*9 8*8 7*7 5*5 3*3 2*2
Conv size Conv size
(a) The recognition results of different receptive field sizes on the MNIST database (b) The recognition results of different receptive field sizes on the
CIFAR-10 database
FIGURE 6.Recognition results of different receptive field sizes
0.5
1
0
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
The best test results on the CIFAR-10 database are the effect. For the random pooling method, the optimal size
shown in Figure 8(a), and the test results of different is 5*5. A smaller pooling size will cause overfitting, and an
pooling sizes are shown in Figure 8(b). excessively large pooling size will increase the error due to
In the pooling method, the larger the pooling size used in too much noise in downsampling.
the maximum pooling and the average pooling, the better
30
Error rate/%
10
20
0
mean pooling 10
max pooling
Pooling method
stochastic pooling 2
1 0
2*2
4*4 5*5 3*3
Error category Pooling size
(a) The recognition results of several pooling methods on the CIFAR-10 database (b) The recognition results of different pooling sizes on the CIFAR-
10 database
FIGURE 8.CIFAR-10 database recognition results
6 65
SGD SGD
SGD with momentum 60 SGD with momentum
5 ADAGRAD ADAGRAD
NAG 55 NAG
Error rate/%
50
Error rate/%
4
45
3
40
35
2
30
1 25
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Epoch Epoch
(a) Experimental results of different optimization algorithms on the MNIST database (b) Experimental results of different optimization algorithms on
the CIFAR-10 database
FIGURE 9.Experimental results of different optimization algorithms
It can be seen from the experiment that SGD without ADAGRAD is very stable, so the curve is smoother and the
momentum can gradually reduce the error, but the decline effect is better.
speed and effect are general; the effect of SGD and NAG
with momentum is better than SGD, but the gradient D. Optimization of layer selection
decline rate is slower in the early stage, and SGD with In order to study the impact of each layer on the
momentum is better than NAG. The falling speed of performance of the convolutional neural network, we
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
removed some layers. CNN-XGBOOST(a) is to delete the Using the above five architectures, we keep other
CNN-XGBOOST fully connected layer structure; CNN- parameters the same as CNN-XGBOOST, and perform
XGBOOST(b) is to delete the CNN-XGBOOST third recognition experiments in the MNIST database. The
convolution pooling layer structure; CNN-XGBOOST(c) is recognition result is shown in Figure 10(a). Using the above
to delete the structure of the third and fourth convolution five architectures, we perform recognition experiments on
pooling layers of CNN-XGBOOST; CNN-XGBOOST(d) is the CIFAR-10 data set. The recognition result is shown in
the structure of deleting the third and fourth convolution Figure 10(b).
pooling layers and fully connected layers of CNN-
XGBOOST.
7 60
Train error Train error
Test error Test error
6
50
5
40
Error rate/%
Error rate/%
4
30
20
2
10
1
0 0
CNN-XGBOOST CNN-XGBOOST(a)CNN-XGBOOST(b)CNN-XGBOOST(c)CNN-XGBOOST(d) CNN-XGBOOST CNN-XGBOOST(a) CNN-XGBOOST(b) CNN-XGBOOST(c) CNN-XGBOOST(d)
Different hierarchies Different hierarchies
(a) The recognition results of different hierarchical structures on the MNIST database (b) The recognition results of different hierarchical structures
on the CIFAR-10 database
FIGURE 10.Identification results of different hierarchical structures
From the experimental results, deleting the top fully complexity also increases. The choice of increasing levels
connected layer does not have a great impact on the results, is also optimized based on its data set.
and deleting a convolutional pooling layer has a significant
decrease rate. This shows that the convolutional pooling E. Change of activation function
layer has an impact on the results. We delete the two-layer For the activation of neurons, an activation function
convolution pooling layer, the error increases particularly needs to be used. Commonly used are sigmoid, tanh, Relu,
fast. Through experiments, it also verified that the depth of LRelu, and PRelu. By changing the activation function of
the network has a great influence on the results. The deeper CNN-XGBOOST, experiments were conducted on the two
the network, the higher the accuracy, and the learned data sets. The experimental results are shown in Figure 11(a)
features are also different. However, the deeper the network, and Figure 11(b).
the more parameters there are for the network, and the
1.5
Error rate/%
Error rate/%
40
1
20
0.5
0 0
sigmoid sigmoid
tanh tanh
Relu Relu
LRelu Activation function LRelu
Activation function PRelu 2 2 PRelu
1 1
Error category Error category
(a) The recognition results of different activation functions on the MNIST database (b) The recognition results of different activation functions on the
CIFAR-10 database
FIGURE 11.Recognition results of different activation functions
The sigmoid activation function is a popular model used it explains the activation of neurons very well, but sigmoid
in the past, especially in shallow neural networks, because will cause two problems:
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
5 60
55
4 50
Error rate/%
45
3 Error rate/% 40
35
2 30
25
1 20
0 5 10 15 0 5 10 15
Training cases x 10
4 Training cases x 10
4
(a) The error of the MNIST database as the data increases (b) The error of the CIFAR-10 database as the data increases
FIGURE 12.Variation of error with data enhancement
During the training process, it can be seen that as the ADAGRAD and SGD with momentum converge faster and
number of trainings on the MNIST database and the have a better effect. The parameter optimization algorithm
CIFAR-10 database increases, the test error will change in CNN is generally a batch stochastic gradient descent
dynamically. Therefore, the increase in data can effectively algorithm with momentum, because it always converges
prevent overfitting. faster and has a better final value, where the momentum is
generally set to 0.9, the initial learning rate is 0.01. The
choice of layers is that the more layers, the better, but the
V. CONCLUSION depth of the layers should be selected according to the size
Augmented reality technology is a fusion of virtual and of the image data. For the MNIST database and the CIFAR-
real technology, which aims to accurately register 10 database, considering its space and time complexity, a 3-
computer-generated virtual information into real-time scene layer convolution pooling layer, a fully connected layer,
images collected in real time to form an enhanced image for and a softmax layer are optimal. In order to prevent
display to users, thereby enhancing the user's sensory overfitting and improve the generalization ability of the
enjoyment. This paper combines the convolutional neural structure, the method of increasing the data set was selected
network with the XGBOOST algorithm in integrated for the MNIST database and the CIFAR-10 database, and a
learning to make up for the problem of lack of feature good recognition effect was achieved.
information caused by the traditional neural network using
only the last layer of features for recognition. This paper
adopts the method of fusing multi-layer features to retain
more feature information for recognition, and then uses the REFERENCES
XGBOOST recognizer with good recognition effect for [1] F. Cheng, H. Zhang, W. Fan, et al., “Image recognition technology
recognition. When choosing a pooling method, it is better to based on deep learning," Wireless Personal Communications,
vol.102, no.2, pp.1917-1933, Jan. 2018.
choose random pooling for databases like MNIST and [2] X. Wang, W. Zhang, X. Wu, et al., “Real-time vehicle type
CIFAR-10. For the random pooling method, the optimal classification with deep convolutional neural networks," Journal of
size is 5*5. A smaller pooling size will cause overfitting, Real-Time Image Processing, vol.16, no.1, pp.5-14, Aug. 2019.
and an excessively large pooling size will increase the error [3] P. R. Jeyaraj, E. R. S. Nadar, “Computer-assisted medical image
classification for early diagnosis of oral cancer employing deep
due to too much noise in downsampling. However, in the learning algorithm," Journal of cancer research and clinical
selection of maximum pooling and average pooling, the oncology, vol.145, no.4, pp.829-837, Jan. 2019.
smaller the pooling size used, the better the effect. [4] S. Law, C. I. Seresinhe, Y. Shen, et al., “Street-Frontage-Net: urban
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
image classification using deep convolutional neural networks," [23] Y. Ariji, M. Fukuda, Y. Kise, et al., “Contrast-enhanced computed
International Journal of Geographical Information Science, vol.34, tomography image assessment of cervical lymph node metastasis in
no.4, pp.681-707, Dec. 2020. patients with oral cancer by using a deep learning system of
[5] R. Hang, Q. Liu, D. Hong, et al., “Cascaded recurrent neural artificial intelligence," Oral surgery, oral medicine, oral pathology
networks for hyperspectral image classification," IEEE Transactions and oral radiology, vol.127, no.5, pp.458-463, May 2019.
on Geoscience and Remote Sensing, vol.57, no.8, pp.5384-5394, [24] C. F. Higham, D. J. Higham, “Deep learning: An introduction for
Mar. 2019. applied mathematicians," SIAM Review, vol.61, no.4, pp.860-891,
[6] Y. Qin, L. Bruzzone, B. Li, et al., “Cross-domain collaborative Nov. 2019.
learning via cluster canonical correlation analysis and random [25] T. S. Borkar, L. J. Karam, “DeepCorrect: Correcting DNN models
walker for hyperspectral image classification," IEEE Transactions against image distortions," IEEE Transactions on Image Processing,
on Geoscience and Remote Sensing, vol.57, no.6, pp.3952-3966, Jan. vol.28, no.12, pp.6022-6034, Jun. 2019.
2019. [26] D. Ribli, A. Horváth, Z. Unger, et al., “Detecting and classifying
[7] Y. Ariji, Y. Yanashita, S. Kutsuna, et al., “Automatic detection and lesions in mammograms with deep learning," Scientific reports,
classification of radiolucent lesions in the mandible on panoramic vol.8, no.1, pp.1-7, Mar. 2018.
radiographs using a deep learning object detection technique," Oral [27] R. R. Saritha, V. Paul, P. G. Kumar, “Content based image retrieval
surgery, oral medicine, oral pathology and oral radiology, vol.128, using deep learning process," Cluster Computing, vol.22, no.2,
no.4, pp.424-430, Oct. 2019. pp.4187-4200, Feb. 2019.
[8] J. R. Hagerty, R. J. Stanley, H. A. Almubarak, et al., “Deep learning [28] P. Helber, B. Bischke, A. Dengel, et al., “Eurosat: A novel dataset
and handcrafted method fusion: higher diagnostic accuracy for and deep learning benchmark for land use and land cover
melanoma dermoscopy images," IEEE journal of biomedical and classification," IEEE Journal of Selected Topics in Applied Earth
health informatics, vol.23, no.4, pp.1385-1391, Jan. 2019. Observations and Remote Sensing, vol.12, no.7, pp.2217-2226, Jun.
[9] X. Zhu, Z. Li, X. Y. Zhang, et al., “Deep convolutional 2019.
representations and kernel extreme learning machines for image [29] N. Shibata, M. Tanito, K. Mitsuhashi, et al., “Development of a deep
classification," Multimedia Tools and Applications, vol.78, no.20, residual learning algorithm to screen for glaucoma from fundus
pp.29271-29290, Nov. 2019. photography," Scientific reports, vol.8, no.1, pp.1-9, Oct. 2018.
[10] X. He, Y. Chen, “Optimized input for cnn-based hyperspectral [30] C. Ju, A. Bibaut, M. van der Laan, “The relative performance of
image classification using spatial transformer network," IEEE ensemble methods with deep convolutional neural networks for
Geoscience and Remote Sensing Letters, vol.16, no.12, pp.1884- image classification," Journal of Applied Statistics, vol.45, no.15,
1888, May 2019. pp.2800-2818, Feb. 2018.
[11] J. W. Kim, P. K. Rhee, “Image Recognition based on Adaptive Deep [31] X. Lv, D. Ming, Y. Y. Chen, et al., “Very high resolution remote
Learning," The Journal of The Institute of Internet, Broadcasting sensing image classification with SEEDS-CNN and scale effect
and Communication, vol.18, no.1, pp.113-117, Feb. 2018. analysis for superpixel CNN classification," International journal of
[12] S. Mahdizadehaghdam, A. Panahi, H. Krim, et al., “Deep dictionary remote sensing, vol.40, no.2, pp.506-531, Sep. 2019.
learning: A parametric network approach," IEEE Transactions on [32] X. Yuan, P. He, Q. Zhu, et al., “Adversarial examples: Attacks and
Image Processing, vol.28, no.10, pp.4790-4802, May 2019. defenses for deep learning," IEEE transactions on neural networks
[13] S. Zhou, Z. Xue, P. Du, “Semisupervised stacked autoencoder with and learning systems, vol.30, no.9, pp.2805-2824, Jan. 2019.
cotraining for hyperspectral image classification," IEEE [33] I. M. Baltruschat, H. Nickisch, M. Grass, et al., “Comparison of
Transactions on Geoscience and Remote Sensing, vol.57, no.6, deep learning approaches for multi-label chest X-ray classification,"
pp.3813-3826, Jan. 2019. Scientific reports, vol.9, no.1, pp.1-10, Apr. 2019.
[14] S. Li, W. Song, L. Fang, et al., “Deep learning for hyperspectral [34] F. Özyurt, T. Tuncer, E. Avci, et al., “A novel liver image
image classification: An overview," IEEE Transactions on classification method using perceptual hash-based convolutional
Geoscience and Remote Sensing, vol.57, no.9, pp.6690-6709, Apr. neural network," Arabian Journal for Science and Engineering,
2019. vol.44, no.4, pp.3173-3182, Jul. 2019.
[15] J. M. Haut, M. E. Paoletti, J. Plaza, et al., “Visual attention-driven [35] K. Yasaka, H. Akai, A. Kunimatsu, et al., “Deep learning with
hyperspectral image classification," IEEE Transactions on convolutional neural network in radiology," Japanese journal of
Geoscience and Remote Sensing, vol.57, no.10, pp.8065-8080, Jun. radiology, vol.36, no.4, pp.257-272, Mar. 2018.
2019. [36] S. Bianco, L. Celona, P. Napoletano, et al., “On the use of deep
[16] S. Antholzer, M. Haltmeier, J. Schwab, “Deep learning for learning for blind image quality assessment," Signal, Image and
photoacoustic tomography from sparse data," Inverse problems in Video Processing, vol.12, no.2, pp.355-362, Aug. 2018.
science and engineering, vol.27, no.7, pp.987-1005, Sep. 2019. [37] D. Bychkov, N. Linder, R. Turkki, et al., “Deep learning based
[17] Y. Niu, Z. Lu, J. R. Wen, et al., “Multi-modal multi-scale deep tissue analysis predicts outcome in colorectal cancer," Scientific
learning for large-scale image annotation," IEEE Transactions on reports, vol.8, no.1, pp.1-11, Feb. 2018.
Image Processing, vol.28, no.4, pp.1720-1731, Apr. 2018. [38] J. Dean, D. Patterson, C. Young, “A new golden age in computer
[18] K. Z. Haider, K. R. Malik, S. Khalid, et al., “Deepgender: real-time architecture: Empowering the machine-learning revolution," IEEE
gender classification using deep learning for smartphones," Journal Micro, vol.38, no.2, pp.21-29, Jan. 2018.
of Real-Time Image Processing, vol.16, no.1, pp.15-29, Sep. 2019. [39] Q. Mao, F. Hu, Q. Hao, “Deep learning for intelligent wireless
[19] Y. J. Heo, S. J. Kim, D. Kim, et al., “Super-high-purity seed sorter networks: A comprehensive survey," IEEE Communications Surveys
using low-latency image-recognition based on deep learning," IEEE & Tutorials, vol.20, no.4, pp.2595-2621, Jun. 2018.
Robotics and Automation Letters, vol.3, no.4, pp.3035-3042, Oct. [40] P. Wang, H. Liu, L. Wang, et al., “Deep learning-based human
2018. motion recognition for predictive context-aware human-robot
[20] J. Zhou, L. Y. Luo, Q. Dou, et al., “Weakly supervised 3D deep collaboration," CIRP annals, vol.67, no.1, pp.17-20, May 2018.
learning for breast cancer classification and localization of the [41] A. K. Singh, B. Ganapathysubramanian, S. Sarkar, et al., “Deep
lesions in MR images," Journal of Magnetic Resonance Imaging, learning for plant stress phenotyping: trends and future
vol.50, no.4, pp.1144-1151, Mar. 2019. perspectives," Trends in plant science, vol.23, no.10, pp.883-898,
[21] Z. Gong, P. Zhong, Y. Yu, et al., “A CNN with multiscale Aug. 2018.
convolution and diversified metric for hyperspectral image [42] S. Pang, A. Du, M. A. Orgun, et al., “A novel fused convolutional
classification," IEEE Transactions on Geoscience and Remote neural network for biomedical image classification," Medical &
Sensing, vol.57, no.6, pp.3599-3618, Jan. 2019. biological engineering & computing, vol.57, no.1, pp.107-121, Jul.
[22] W. Wu, H. Li, X. Li, et al., “PolSAR image semantic segmentation 2019.
based on deep transfer learning—Realizing smooth classification [43] Y. Chen, K. Zhu, L. Zhu, et al., “Automatic design of convolutional
with small training sets," IEEE Geoscience and Remote Sensing neural network for hyperspectral image classification," IEEE
Letters, vol.16, no.6, pp.977-981, Jan. 2019. Transactions on Geoscience and Remote Sensing, vol.57, no.9,
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3012130, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
pp.7048-7066, Apr. 2019. Shukui Bo, Received his Ph.D. in Cartography and
[44] X. Liu, R. Zhang, Z. Meng, et al., “On fusing the latent deep CNN Geographic Information System from Chinese
feature for image classification," World Wide Web, vol.22, no.2, Academy of Sciences in 2007. He is currently a
pp.423-436, Jue. 2019. Professor in the School of Intelligent Engineering,
[45] M. Andrews, M. Paulini, S. Gleyzer, et al., “End-to-End Physics Zhengzhou University of Aeronautics, Zhengzhou,
Event Classification with CMS Open Data: Applying Image-Based China. His current research interests cover the fields
Deep Learning to Detector Data for the Direct Classification of of information extraction from remote sensing
Collision Events at the LHC," Computing and Software for Big imagery, pattern recognition and image processing.
Science, vol.4, no.1, pp.1-14, Mar. 2020.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.