0% found this document useful (0 votes)
26 views6 pages

Cascading Training For Relaxation CNN On Handwritten Character Recognition

Uploaded by

bob wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views6 pages

Cascading Training For Relaxation CNN On Handwritten Character Recognition

Uploaded by

bob wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2016 15th International Conference on Frontiers in Handwriting Recognition

Cascading Training for Relaxation CNN on Handwritten Character Recognition

Li Chen, Song Wang, Wei Fan, Jun Sun, Satoshi Naoi


Fujitsu Research & Development Center, Beijing, China
{chenli, song.wang, fanwei, sunjun, naoi.satoshi}@cn.fujitsu.com

      
Abstract—With the development of deep learning, many
difficult recognition problems can be solved by deep learning
models. For handwritten character recognition, the CNN is
used the most. In order to improve the performance of CNN,
many new models have been proposed and in which the relax-
ation CNN [35] is widely used. The relaxation CNN has more
complicated structure than CNN while the recognition time is
the same with which. However, the training of relaxation CNN
needs much more time than CNN. In this paper, we propose the
cascading training for relaxation CNN. Our method can train       
a relaxation CNN of better performance while using almost
the same training time with normal CNN. The experimental
results proved that the relaxation CNN trained by cascading
training is able to achieve the state-of-the-art performance on
handwritten Chinese character recognition.
Keywords-cascading training; character recognition; CNN;

I. I NTRODUCTION
Since the groundbreaking paper was published by Hinton Figure 1. The convolutional layers of conventional CNN and R-CNN.
in 2006 [1], the deep learning became the most popular
topic in both industry and academia. The deep learning
technology has great advantage in various tasks in arti- The strong characteristic of the image data is that each pixel
ficial intelligence field and which is seen as an efficient of the image is strongly related to the neighborhood and
way to simulate the process of human brain. Actually the which has little connection with the pixels far away. The
deep learning is very flexible technology and for different CNN is able to reflect such locality principle of the image
machine learning tasks, a lot of deep learning models has very well. As shown in Fig. 1, the neuron of CNN is only
been proposed. For example, the Deep Believe Network connected to the neurons in a small local area of the prior
(DBN) [2], PCA Network (PCANet) [3], Stacked Auto- layer, which is called the local receptive field.
Encoder (SAE) [4], [5], Deep Sparse Coding (DeepSC) [6] There is a long history of the research of the handwritten
and Convolutional Neural Network (CNN) [7]. Moreover, character recognition (digits, Chinese, Arabic, etc.) [19],
we can easily improve the performance of the deep learning [20], [21]. However, due to the various appearances of the
models by extending the network scale and using more character (different writer, writing behavior and noise), the
layers. Consequently, the state-of-the-art performance of handwritten character recognition is a very difficult task. As
various tasks were achieved by deep learning. Nowadays, shown in Fig. 2, the conventional method often contains
many mature commercial applications are built based on two main modules: the feature extraction and the classifier.
deep learning, such as speech recognition of Microsoft. In previous research, people tried to design better features
For image-based tasks, the CNN is the most popular one to improve the performance, such as gradient feature [22],
compared with other deep learning models. It has been contour feature [23] and image keypoint [24] (SIFT [25],
successfully applied to different image detection and recog- SURF [26], etc.). The feature design can determine the
nition tasks, such as image recognition [8], [9], human face performance of the method to a great extent. For the
recognition [10], human pose estimation [11] and character classification, most of the mature classifiers can be used.
recognition [12], [13], [14], [15], [16], [16], [17], [18]. For example, the SVM [27], MQDF [28] and MLP [29].
The main reason for the success of CNN is that the CNN However, although many researchers have made a lot
can “understand” the image characteristics better than other of efforts on conventional method in the past decades, the
models. In other words, CNN is very suitable to construct progress was slow and the performance was very hard to be
the connections of the image data structure. Obviously, the improved. With the rise of deep learning, some researcher
image can be seen as some structured data representation. turned to notice the application of CNN on handwritten char-

2167-6445/16 $31.00 © 2016 IEEE 162


DOI 10.1109/ICFHR.2016.38
 Feature    
     Result
vector

  

 Result   


            
   

Figure 2. The comparison between conventional method and CNN.


Figure 3. The framework of cascading training for R-CNN.

acter recognition, such as [13], [14], [15], [18]. These trials much more time than conventional CNN.
prove that CNN is very powerful tool for the handwritten In this paper, we proposed a cascading training method for
character recognition problem. R-CNN, which contains the pre-training step and the fine-
As shown in Fig. 2, the process of CNN is obviously tuning step. The proposed method is able to accelerate the
different from the conventional methods. The CNN is just training of R-CNN. At the same time, the performance of
like a “black box” and both of the feature extraction and the R-CNN can also be improved. The experimental results
classification can be done automatically within the CNN on CASIA-HWDB database[36] proved the advantage of the
model. In other words, we just need to prepare the training proposed method. In the following part, the proposed method
data for CNN and then it can learn a very good way to will be introduced in detail.
recognize characters. We don’t need to design features for
different kinds of characters. Instead, the only thing need to II. R ELATED W ORK
do is to collect enough training samples for CNN.
The pre-training is common to see in the training of deep
There are also many trials which aim to improve the learning models. In [2] the layer-wise pre-training method
general performance of CNN, which are introduced as is proposed and which was used for the training of DBN.
follows. In [37] the layer-wise pre-training was applied to auto-
• Stochastic pooling [30]: a stochastically selection of the encoder and in [4] it was applied to CNN.
output neuron is applied instead of the normal pooling The fine-tuning has been widely used for the transfer
process of CNN. The probability of selecting a certain learning based on CNN. For example, in [38] the CNN
neuron is proportional to its value. was trained on ImageNet database and then fine-tuned
• Network In Network [31]: an extra convolutional op- on Caltech-101 database. Similarly, in [39] the CNN was
eration (kernel size is 1 × 1) is applied to each feature trained on ILSVRC 2012 and then fine-tuned on VOC
map. 2007. In [40] the transfer learning is well discussed and
• Maxout [32]: different activation function is used. Extra the importance of the fine-tuning is emphasized.
weights are used to calculate different linear combina-
tions of the same input neurons. Then we will select III. M ETHODOLOGY
the maximum values as the output neurons. The framework of the cascading training is shown in
By using the above technology, the CNN will have better Fig. 3. First of all, there is a sample generation process and
performance on handwritten character recognition. which can increase the diversity of the training samples.
Besides the approaches above, recently, the relaxation Moreover, as mentioned above, the framework including
CNN (R-CNN) has been applied to different tasks with very two training steps: the pre-training and fine-tuning. Besides,
promising improvement [10], [33], [34], [35]. As shown in before the fine-tuning, there is an relaxation weight initial-
Fig. 1, compared with conventional CNN, the kernels of R- ization process. The R-CNN structure and the modules of
CNN are not shared by the prior feature map. In other words, the framework are described as follows.
for different part of the feature map, the kernel used for the
convolutional calculation. This means that the R-CNN has A. The Structure of CNN and R-CNN
more weights than CNN at the convolutional layer. Clearly, The R-CNN used in this paper is generated based on
with more weights, the R-CNN is able to represent more a CNN structure. This CNN structure contains four con-
complicated samples. Please note that although the number volutional layers and one fully-connected layer. As shown
of weights is increased in R-CNN, the computational cost is in Fig. 4, the last layer of the CNN is relaxed to have
the same with the conventional CNN. Since the R-CNN has better representation ability. Besides, in order to improve the
more weights to describe the input image, it is very suitable performance, the fully-connected layer is trained by dropout.
for handwritten character recognition. However, because of At present, it is very common for deep learning models to
the increase of the weights, the training of R-CNN needs use dropout on fully-connected layer.

163
 

 

 
  
     

Figure 5. The process of sample generation.


Initialize

Clearly, the CNN structure used is quite large for training.  

Moreover, if we relax the CNN to R-CNN, the training  
will be more time-consuming. Consequently, it is necessary
to use different training strategy to accelerate the training
process.  

 
B. Sample Generation
Since our model is used for handwritten character recog- Figure 6. The duplicate initialization.
nition, the sample generation is conducted by random distor-
tion. As shown in Fig. 5, we applied two different random
transformation: the local distortion and global distortion. The tochastic values. This is just as the same as the weight
local distortion is implemented by adding a transformation initialization of the pre-training.
field to the image to move the pixels by a small distance. • Duplicate initialization: as shown in Fig. 6, we copy
The global distortion is simply the affine transformation. For the kernels of the pre-trained CNN to those relaxed
each input training sample, we will use a set of random kernels in the relaxation convolutional layers. As a
parameters to conduct the random distortion. Therefore, result, if the input is the same, then the output of the
there will not be two training samples which are exactly R-CNN will be exactly the same with the pre-trained
the same during the whole training process. By doing so, CNN. Therefore, the following fine-tuning can be seen
the training will be more efficient and the model will be as a improving process based on the performance of
more robust. the pre-trained CNN.

C. Pre-training E. Fine-Tuning
In the pre-training step, the R-CNN is first degraded to The fine-tuning is the second training process and in
normal CNN and then it is trained by the training samples. which the initialized R-CNN is finally trained. Usually, the
In other words, we first train a CNN which has the same fine-tuning needs much less time than the pre-training since
structure with the R-CNN as the pre-training and then this the network is already trained to a certain level. There are
CNN is relaxed to R-CNN. The pre-training is similar with also two different strategies for the fine-tuning and which
the training process of conventional CNN. The differences are introduced as follows.
is that compared with the training of conventional CNN, • Local fine-tuning: only the weights of the relaxation
the pre-training stops a little earlier. By doing so, the whole convolutional layer are adjusted while the rest weights
network will not get too close to some local minimum and are kept unchanged.
there will be some room for the fine-tuning to achieve better • Global fine-tuning: the whole R-CNN will be trained
state. and all the weights will be adjusted.
D. Relaxation Weight Initialization IV. E XPERIMENTAL A NALYSIS
After the pre-training, a trained CNN can be obtained and The experiments contain two parts: the framework analy-
which already has a certain level recognition performance. sis and the performance evaluation. In the experiments, the
The next step is to relax the convolutional layer to create CASIA-HWDB database was used for training while the
an R-CNN. In the relaxation, except the relaxation convo- CASIA-HWDB [41] competition database was used for test.
lutional layer, the rest weights of the model will keep the In the framework analysis, since there are different strategies
same as the trained CNN. For the weights of the relaxation, in the proposed framework, we have tested different com-
there are two ways to initialize them, which are described positions of the strategies to find the optimal composition,
as follows. which are shown as follows.
• Stochastic initialization: all the weights of the new 1) Stochastic-local: use stochastic initialization and local
relaxation convolutional layers are initialized by s- fine-tuning.

164
ൈ ൈ ൈ ൈ ൈ ൈ ൈ
" !" " !" " !" "
      "   
ൈ
!" # ""$
%% 

# ""$
%% 



    
 
   
 


   




ൈ
 
      

ͺ ൈ Ͷ ൈ Ͷ ൈ ʹ ൈ ൈ ൈ ൈ
             
      

Figure 4. The network structure of R-CNN.

2) Stochastic-global: use stochastic initialization and as good as the duplicate initialization. This means that we
global fine-tuning. should keep the pre-training results and then to improve it
3) Duplicate-local: use duplicate initialization and local by relaxation and fine-tuning. Moreover, the best result was
fine-tuning. obtained by duplicate-global, it proved that in the fine-tuning
4) Duplicate-global: use duplicate initialization and it is better to adjust the whole network to find a good local
global fine-tuning. optimal of the R-CNN.
For comparison, we also tested the conventional CNN of the
Table I
structure and the same R-CNN which was trained without E RROR RATES (%) OF CNN AND R-CNN OF DIFFERENT TRAINING
cascading training. The CASIA-HWDB contains different METHOD .
subsets from CASIA-HWDB 1.0 to 1.2. In order to save
time, in the framework analysis we only used the CASIA- Model/method Error rate (%)
CNN 5.09
HWDB 1.1 for the training. In the performance evaluation, R-CNN 5.26
the best strategies were used and the whole CASIA-HWDB Stochastic-local 6.57
(1.0 to 1.2) was used for training. At same time, we also Stochastic-global 6.12
compared different deep learning methods on the same task. Duplicate-local 5.06
Duplicate-global 4.82
A. Framework Analysis
In the experiments of framework analysis, we have fixed The training time for CNN, R-CNN and cascading train-
the number of epoch to 100. By doing so, the training time ing is shown in Table II. We can find out that the training
and performance can be compared fairly. For the cascading time of normal R-CNN was about 50% more than CNN. In
training, the first 80 epochs are for the pre-training and the contrast, the time for cascading training is not so different
last 20 epoches are for the fine-tuning. The final error rates with the training of CNN.
are shown in Table I. First, one interesting observation is that From the experimental results above, we can draw a
the R-CNN which trained normally even performed worse conclusion that the cascading training method is able to
than CNN, though it had more weights than CNN. This train an R-CNN of better performance by using almost
proved the importance of the training method. Second, in the same time with the conventional CNN. Meanwhile, the
the cascading training, the stochastic initialization was not computational cost of recognition process of the R-CNN

165
Table II [2] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning
T HE TIME CONSUMPTION OF CNN AND R-CNN OF DIFFERENT algorithm for deep belief nets,” Neural computation, vol. 18,
TRAINING METHOD . no. 7, pp. 1527–1554, 2006.
Model/method CNN R-CNN Cascading training [3] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma,
Time (hours) 10.9 15.8 11.8 “PCANet: A Simple Deep Learning Baseline for Image
Classification?” arXiv preprint arXiv:1404.3606, 2014.

[4] J. Masci, U. Meier, D. Ciresan, and J. Schmidhuber, “S-


and CNN is just the same. Consequently, compared with tacked convolutional auto-encoders for hierarchical feature
conventional CNN, the R-CNN trained by cascading training extraction,” in Artificial Neural Networks and Machine Learn-
has a great advantage. ing—ICANN 2011. Springer, 2011, pp. 52–59.

B. performance Evaluation [5] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A.


Manzagol, “Stacked denoising autoencoders: Learning useful
As mentioned above, the best strategy composition of representations in a deep network with a local denoising cri-
the cascading training is the duplicate-global. Therefore, terion,” The Journal of Machine Learning Research, vol. 11,
pp. 3371–3408, 2010.
in order to compare with the methods of GoogleNet [17],
MCDNN [18] and ATRCNN [35], we have re-train the [6] Y. He, K. Kavukcuoglu, Y. Wang, A. Szlam, and Y. Qi,
R-CNN by cascading training of the best strategies on “Unsupervised Feature Learning by Deep Sparse Coding,”
CASIA-HWDB (1.0 to 1.2). The experimental results are arXiv preprint arXiv:1312.5783, 2013.
shown in Table III. Clearly, the proposed method achieved
[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-
the best performance among all the methods. Especially based learning applied to document recognition,” Proceedings
in ATRCNN [35] the R-CNN was also used. However, of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
they only used the normal training method for the R-CNN.
Consequently, our R-CNN performed better than theirs. [8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper
with convolutions,” arXiv preprint arXiv:1409.4842, 2014.
Table III
E RROR RATES (%) OF DIFFERENT METHODS .
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
classification with deep convolutional neural networks,” in
Method Error rate (%)
Advances in neural information processing systems, 2012, pp.
GoogleNet [17] 3.65 1097–1105.
MCDNN [18] 4.22
ATRCNN [35] 4.96 [10] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface:
Proposed method 3.23 Closing the gap to human-level performance in face verifica-
tion,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2014, pp. 1701–1708.
V. C ONCLUSION AND F UTURE W ORK
[11] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint
In this paper, we proposed an advantage training method training of a convolutional network and a graphical model for
human pose estimation,” in Advances in Neural Information
for R-CNN: the cascading training. By using two different
Processing Systems, 2014, pp. 1799–1807.
training steps, the cascading training is able to train an
R-CNN of better performance compared with the R-CNN [12] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet,
trained normally. At the same time, the cascading training “Multi-digit number recognition from street view imagery
requires much less time than the normal training. Though using deep convolutional neural networks,” arXiv preprint
arXiv:1312.6082, 2013.
the experimental results we can see that the R-CNN trained
by cascading training achieved very promising results on [13] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices
Handwritten Chinese character recognition. for convolutional neural networks applied to visual document
In the future, we will extend the cascading training to analysis,” in 2013 12th International Conference on Docu-
ment Analysis and Recognition, vol. 2. IEEE Computer
different deep learning models. For example, the normal Society, 2003, pp. 958–958.
CNN, Long-Short Term Memory Recurrent Neural Network
(LSTM-RNN) and so on. Besides, we will also test differ- [14] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmid-
ent image databases to prove the advantage of cascading huber, “Convolutional Neural Network Committees for Hand-
training. written Character Classification.” IEEE, Sep. 2011, pp.
1135–1139.
R EFERENCES [15] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column
deep neural networks for image classification,” in 2012 IEEE
[1] G. E. Hinton, “Reducing the Dimensionality of Data with Conference on Computer Vision and Pattern Recognition.
Neural Networks,” Science, vol. 313, no. 5786, pp. 504–507, IEEE, 2012, pp. 3642–3649.
Jul. 2006.

166
[16] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, “End-to- [30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
end text recognition with convolutional neural networks,” in R. Salakhutdinov, “Dropout: A simple way to prevent neural
International Conference on Pattern Recognition. IEEE, networks from overfitting,” The Journal of Machine Learning
2012, pp. 3304–3308. Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[17] Z. Zhong, L. Jin, and Z. Xie, “High performance offline [31] M. Lin, Q. Chen, and S. Yan, “Network In Network,” arX-
handwritten chinese character recognition using googlenet iv:1312.4400 [cs], Dec. 2013.
and directional feature maps,” in Document Analysis and
Recognition (ICDAR), 2015 13th International Conference [32] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville,
on. IEEE, 2015, pp. 846–850. and Y. Bengio, “Maxout Networks,” arXiv:1302.4389 [cs,
stat], Feb. 2013.
[18] D. Ciresan and U. Meier, “Multi-column deep neural net-
works for offline handwritten chinese character classification,” [33] D. Hebb, The organization of behavior. New York: Wiley,
in Neural Networks (IJCNN), 2015 International Joint Con- 1949.
ference on. IEEE, 2015, pp. 1–6.
[34] Y. Sun, D. Liang, X. Wang, and X. Tang, “Deepid3: Face
[19] C.-L. Liu, K. Nakashima, H. Sako, and H. Fujisawa, “Hand- recognition with very deep neural networks,” arXiv preprint
written digit recognition: benchmarking of state-of-the-art arXiv:1502.00873, 2015.
techniques,” Pattern Recognition, vol. 36, no. 10, pp. 2271–
2285, Oct. 2003. [35] C. Wu, W. Fan, Y. He, J. Sun, and S. Naoi, “Handwritten
Character Recognition by Alternately Trained Relaxation
[20] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “Online Convolutional Neural Network,” 2014 International Confer-
and offline handwritten Chinese character recognition: Bench- ence on Frontiers in Handwriting Recognition (ICFHR),,
marking on new databases,” Pattern Recognition, vol. 46, 2014.
no. 1, pp. 155–162, Jan. 2013.
[36] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “CASIA
[21] L. Lorigo and V. Govindaraju, “Offline Arabic handwriting
Online and Offline Chinese Handwriting Databases.” IEEE,
recognition: a survey,” IEEE Transactions on Pattern Analysis
Sep. 2011, pp. 37–41.
and Machine Intelligence, vol. 28, no. 5, pp. 712–724, May
2006.
[37] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol,
[22] C.-L. Liu, “Normalization-cooperated gradient feature extrac- “Extracting and composing robust features with denoising
tion for handwritten character recognition,” IEEE Transac- autoencoders,” in Proceedings of the 25th international con-
tions on Pattern Analysis and Machine Intelligence, vol. 29, ference on Machine learning. ACM, 2008, pp. 1096–1103.
no. 8, pp. 1465–1469, 2007.
[38] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,
[23] C.-L. Liu, K. Nakashima, H. Sako, and H. Fujisawa, “Hand- E. Tzeng, and T. Darrell, “Decaf: A deep convolutional acti-
written digit recognition: investigation of normalization and vation feature for generic visual recognition,” arXiv preprint
feature extraction techniques,” Pattern Recognition, vol. 37, arXiv:1310.1531, 2013.
no. 2, pp. 265–279, Feb. 2004.
[39] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich
[24] S. Uchida and M. Liwicki, “Part-based recognition of hand- feature hierarchies for accurate object detection and semantic
written characters,” in 2010 International Conference on segmentation,” in Proceedings of the IEEE conference on
Frontiers in Handwriting Recognition (ICFHR),. IEEE, computer vision and pattern recognition, 2014, pp. 580–587.
2010, pp. 545–550.
[40] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How trans-
[25] D. G. Lowe, “Distinctive image features from scale-invariant ferable are features in deep neural networks?” in Advances
keypoints,” International journal of computer vision, vol. 60, in Neural Information Processing Systems, 2014, pp. 3320–
no. 2, pp. 91–110, 2004. 3328.

[26] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up [41] F. Yin, Q.-F. Wang, X.-Y. Zhang, and C.-L. Liu, “Icdar
robust features,” in Computer vision—ECCV 2006. Springer, 2013 chinese handwriting recognition competition,” in 2013
2006, pp. 404–417. 12th International Conference on Document Analysis and
Recognition (ICDAR). IEEE, 2013, pp. 1464–1470.
[27] C. J. Burges, “A tutorial on support vector machines for
pattern recognition,” Data mining and knowledge discovery,
vol. 2, no. 2, pp. 121–167, 1998.

[28] F. Kimura, K. Takashina, S. Tsuruoka, and Y. Miyake, “Mod-


ified quadratic discriminant functions and the application to
Chinese character recognition,” IEEE Transactions on Pattern
Analysis and Machine Intelligence,, no. 1, pp. 149–153, 1987.

[29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning


representations by back-propagating errors,” Cognitive mod-
eling, vol. 5, 1988.

167

You might also like