0% found this document useful (0 votes)
33 views5 pages

2017 IEVC Yanagisawa

Uploaded by

metona6702
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views5 pages

2017 IEVC Yanagisawa

Uploaded by

metona6702
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Proceedings of the Fifth IIEEJ International Workshop

on Image Electronics and Visual Computing 2017


Da Nang, Vietnam, February 28- March 3, 2017

RECOGNITION OF PANEL STRUCTURE IN COMIC IMAGES USING FASTER R-CNN

Hideaki Yanagisawa† Hiroshi Watanabe†



Graduate School of Fundamental Science and Engineering, Waseda University

ABSTRACT extracting function using modified connected component


labeling (CCL) method. For character detection, Ishii et
For efficient e-comics creation, automatic extracting al. [6] proposed an approach using machine learning with
technique for comic components such as panel layout, HOG features to detect character face areas. We applied
speech balloon, and characters is necessary. In the Fast R-CNN in character face detection. [7] From its
conventional methods, comic components are extracted result, Fast R-CNN showed higher detection rate than
using geometrical characteristics such as line drawings or HOG features.
connected pixels. However, it is difficult to extract all Conventional methods extract comic components
comic components by focusing on a particular geometric according to the geometric characteristics, e.g. line
feature, since the components are drawn in various detection or extracting connected pixels. However, in
expressions. In this paper, we extract comic components some of comic images, panels and speech balloons are
using Faster R-CNN regardless of various comic illustrated in special expressions. Therefore, it is difficult
expressions, and recognize panel structure. Experimental to detect such components as drawn in specific shapes or
results show proposed method succeed to recognize overlapped other objects.
67.5% of panel structures on average.
3. FASTER R-CNN
1. INTRODUCTION
Garshick et al. [8] proposed Regions with Convolutional
Current state of publishing industry has been shifting Neural Network features (R-CNN) as a general object
from the traditional paper based version to e-books. In the detection method using convolutional neural network
e-book market in Japan, e-comic dominates 80% of sales (CNN). R-CNN detects objects in following process.
amount [1]. In order to improve convenience of e-comics, First, objects’ region proposals are extracted from input
services using metadata of e-comics have been proposed. image by selective search [9]. Second, the region
Such services are, e.g. comic search system using proposals are input to CNN and image feature values are
particular scene or dialogue in comics, or automatic calculated. Then, the output feature values are classified
digest generation system. However, most of e-comics are by support vector machine (SVM). Finally, the deviation
converted from scanned paper comics. Therefore, it is of region proposals is corrected by bounding box
necessary to manually extract comic structure regression. However, R-CNN is slow since it calculates
components such as panel layout, speech balloon, convolutional network features for each object proposal.
characters (in this paper, we use the word ‘character’ as In order to improve this problem, Fast R-CNN is
actors in comics) and so on. To reduce a cost of metadata introduced. Fast R-CNN enables end-to-end detector
extraction, a technique which extracts comic components training on shared convolutional features. Therefore, it
automatically is important. In this paper, we evaluate a shows compelling accuracy and speed [10].
system, which automatically obtains the number of Ren et al. [11] proposed Faster R-CNN as a faster
speech balloons and characters in panels using Faster R- improved object detection technique. Faster R-CNN is
CNN from comics. single network connected Fast R-CNN and Region
Proposal Network (RPN) that share full-image
2. RERATED WORK convolutional features with the detection network. RPN
is fully convolutional network that simultaneously
For detecting panel layout, Ishii et al. [2] proposed to predict object bounds and object scores at each position.
identify panels by detecting dividing line using gradient In addition, RPN is trained end-to-end to generate high-
concentration. Nonaka et al. [3] introduced panel layout quality region proposals, which are used by Fast R-CNN
recognition method by detecting lines and rectangles for detection. Therefore, Faster R-CNN can detect object
according to a characteristic that panels are often more quickly and shows higher detection accuracy than
represented as rectangles. Next, for speech balloon state-of-the-art methods.
extraction, Tanaka et al. [4] proposed a method that
identify text areas using Ada-Boost and detect white 4. PROPOSED METHOD
areas in speech balloons. Moreover, in a study for
structure recognition of comics, Arai et al. [5] proposed We propose a method of panel structure recognition from
a detection method for panel, speech balloon and text comic images by detection of panels, speech balloons and
area. That based on the image blob detection and character faces. We create annotations of comic images
Proceedings of the Fifth IIEEJ International Workshop
on Image Electronics and Visual Computing 2017
Da Nang, Vietnam, February 28- March 3, 2017

© Atsushi Sasaki

Fig.1 Flow diagram of panel structure recognition

by specifying peripheral regions of each component in


rectangles, and 3 types of detectors are generated by
training of Faster R-CNN. The flow diagram of panel
structure obtaining is shown in Fig.1. First, panels are
detected from an input image and sorted them. The
(a)
sorting order is based on the height of detected areas. In
addition, if the heights of areas are same, they are sorted © Hishika Minamisawa
from right side one. Figure 2 shows example images of
panel location and sorting orders. Then, there is a slight
shift in the position of each panel detected by Faster R-
CNN. Therefore, they are normalized per 50 pixels in y-
axis direction. Next, speech balloon and character face
are detected. They are belonged to the panel that
overlapping more than 50% over the detected area. If
there is a component which overlaps 50% or more on
multiple panels as seen in Fig.3, the component is
belonged to the panel sorted back side. Finally, the
numbers of speech balloons and character faces that
belong to each panel are obtained.

5. EXPERIMENT

In this section, we evaluate the detection accuracy of


comic components using Faster R-CNN. Also, the
recognition accuracy of panel structures is evaluated. In
this experiment, we use an algorithm published in
https://github.com/rbgirshick/py-faster-rcnn [11] for
training and evaluation of Faster R-CNN, and set
vgg_cnn_m_1024 [12] as architecture of CNN for
training. Datasets for training and evaluation are made of
comic images provided in Manga 109 database
(http://www.manga109.org/) [13]. The training dataset
consists of each 100 images in 20 titles of comics drawn (b)
by different authors. The test dataset consists of each 30
images in 5 titles of comic named as Comic A-to-E drawn Fig.2 Examples of panel sorting
by different authors from training images.
Proceedings of the Fifth IIEEJ International Workshop
on Image Electronics and Visual Computing 2017
Da Nang, Vietnam, February 28- March 3, 2017

© Hishika Minamisawa
1
0.98
0.96
A
0.94
P 0.92
test
0.9
train
0.88

iteration number

(a) Panel detection

1
 Panel 1 has 2 characters and 3 balloons 0.98
 Panel 2 has 1 character and 2 balloons A 0.96
P 0.94
Fig.3 Example of panel structure recognition test
0.92
train
In this experiment, we define a true positive as the 0.9
detected area overlapping the correct area more than 50%.

5.1. Iteration number iteration number

We verified relationship between iteration number in the


(b) Speech balloon detection
training process of Faster R-CNN and average precisions
(AP) for each comic component. AP means the average 1
values of precisions at each level of recalls. In this 0.95
experiment, AP is calculated for 2000 images in the 0.9
training dataset and 150 images in the test dataset. A 0.85
Experimental results are shown in Fig.4. In this figure, x- P 0.8 test
axis indicates iteration number and y-axis indicates AP.
0.75
From this result, the detection rates are increased by train
0.7
increasing of iteration number. In addition, when the
iteration number is over 70000, the AP for training
images is converged.
iteration number
5.2. Threshold of confidence
(c) Character face detection
We evaluate the detailed results of comic component
detection for 150 images in test dataset using the Fig.4 Relationship elation between average precision and
detectors trained with 70000 iterations. Faster R-CNN iteration number increasing
calculates a confidence of object in the region proposals,
and detects a region when its confidence is larger than a
Experimental results show that the precision rates of
threshold. In this experiment, the threshold of confidence
Faster R-CNN are more than 90%, and this method
is set to 0.6 at panel detection, and those are set to 0.8 at
exceeds the conventional method at panel and speech
speech balloon and character face detection. The
balloon detection. Examples of detection results are
thresholds are heuristic values. Experimental results are
shown in Fig.5. From this figure, it is shown that blob
shown in Table 1. In this table, “Total” means total
extraction is hard to separate panels when a panel
numbers of comic components in test images, “TP”
overlapping another panels. On the other hand, R-CNN
means true positive, “FN” means false negative and “FP”
can detect panels independently of those layouts.
means false positive. We also measure parameters of
recall (R) and precision (P). Table 2 shows the detection
5.3. Recognition rate of panel construction
results of panels and speech balloons by the method of
[5] for same test set.
We evaluate a recognition accuracy of panel structures
for each 30 pages in 5 comics. The recognition accuracy
Proceedings of the Fifth IIEEJ International Workshop
on Image Electronics and Visual Computing 2017
Da Nang, Vietnam, February 28- March 3, 2017

© Atsushi Sasaki

(a) Examples of panel detection by [5] (b) Examples of panel detection by Faster R-CNN

Fig.5 Examples of panel detection for flat panels and connected panels

Table 1 Results of comic component extraction for 5 layout as shown in Fig.7. In Fig.6 and Fig.7, red rectangle
comic sources by Faster R-CNN shows the detected area as comic component.
R P
Total TP FN FP
(%) (%) 6. CONCLUSION & FUTURE WORK
Panel 859 770 90 40 89.5 95.1
Balloon 1190 1161 29 42 97.6 96.5 In this paper, we evaluated panel structure recognition
Character 937 803 134 50 85.7 94.1 using Faster R-CNN. Experimental results show our
proposed method success to recognizing 67.5% of panel
structures on average.
Table 2 Results of comic component extraction for 5 comic
For future works, there are some possible
sources by [5] improvements in detection for panels and character faces
R P those are hard to detected in this method. As a specific
Total TP FN FP technique, it is considerable to combine image processing
(%) (%)
such as highlighting division lines of panels with Faster
Panel 859 481 378 183 56.0 72.4
R-CNN detection. In addition, for obtaining metadata to
Balloon 1190 790 400 650 66.4 54.9
be used for automatic generation of comic summaries, we
need to consider a technique for classifying main
characters from detected character faces.
Table 3 Results of panel structure recognition for 5 comic
sources
7. REFERENCES
B (%) C (%) B + C (%)
Comic A 83.0 74.5 68.1 [1] Internet Media Research Institute: “eComic Marketing
Comic B 91.4 89.8 84.9 Report 2012”, Impress R&D, pp.14 (2012).
Comic C 81.7 72.8 66.3
Comic D 94.6 69.0 65.2 [2] D. Ishii, K. Kawamura, H. Watanabe: “A Study on Frame
Comic E 62.3 62.9 52.8 Decomposition of Comic Images", IEICE Transactions, Vol.
J90-D, No.7, pp. 1667—1670 (2007).

is defined as follows: “B” means the panels which speech [3] S. Nonaka, T. Sawano, N. Haneda: “Development of “GT-
balloon numbers correctly extracted, “C” means the Scan”, the Technology for Automatic Detection of Frames in
panels which character face numbers correctly extracted Scanned Comic”, FUJIFILM RESEARCH &
and “B + C” means the panels which both numbers of DEVELOPMENT, No.57, pp.46—49 (2012).
speech balloon and character face correctly extracted. An
experimental result is shown in Table 3. From this result, [4] T. Tanaka, F. Toyama, J. Miyamichi, K. Shoji: “Detection
and Classification of Speech Balloons in Comic Images”,
the highest value of B + C is 84.9% in comic B and the
Journal of the Institute of Image Information and Television
lowest value is 52.8% in comic E.
Engineers, Vol.64, No.12, pp.1933—1939 (2010).
An example case of failure to panel structure
recognition is the detection failure caused by deformed
faces as shown in Fig.6. In addition, the reason of low
recognition rate in Comic E is that it contains fuzzy panel
Proceedings of the Fifth IIEEJ International Workshop
on Image Electronics and Visual Computing 2017
Da Nang, Vietnam, February 28- March 3, 2017

© Satoshi Arai © Saya Miyauchi

Fig.6 Example of failure to detect character faces

[5] Arai K, Tolle Herman: “Method for Real Time Text


Extraction from Digital Manga Comic”, International Journal
of Image Processing Vol.4, No.6, pp.669—676 (2011).

[6] D. Ishii, H. Watanabe: “A Study on Automatic Character


Detection and Recognition from Comics”, The Journal of the
Institute of Image Electronics Engineers of Japan, Vol.42, No.4
(2013)

[7] H. Yanagisawa, H. Watanabe: “A study of Multi-view Face


Detection for Characters in Comic Images”, Proceedings of the
2016 IEICE General Conference, D—12—12 (2016).

[8] R. Girshick, J. Donahue, T. Darrell, J. Malik: “Rich feature


hierarches for accurate object detection and semantic Fig.7 Example of failure to detect panels in Comic E
segmentation,” in IEEE Conference on Computer Vision and
Pattern Recognition, (2014).

[9] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W.


M. Smeulders: “Selective Search for Object Recognition”,
International Journal of Computer Vision, Vol.102, No.2
pp.154—171, (2013).

[10] R. Girshick: “Fast R-CNN”, arXiv:1504.08083, (2015).

[11] S. Ren, K. He, R. Girshick, J. Sun: “Faster R-CNN:


Towards Real-Time Object Detection with Region Proposal
Networks”, Advances in Neural Information Processing
Systems (NIPS), (2015).

[12] S. Farfade, M. Saberian: “Multi-view Face Detection


Using Deep Convolutional Neural Networks”,
arXiv:1502.02766, (2015).

[13] Y.Matsui, K.Ito, Y. Aramaki, T.Yamasaki, K. Aizawa:


“Sketch-based Manga Retrieval using Manga109 Dataset”,
arXiv:1510.04389,(2015).

You might also like