0% found this document useful (0 votes)
8 views9 pages

Thesis 1

The document presents a pipeline for automating the construction of a comic text-image dataset, addressing the challenges of panel detection and text extraction from comic strips. The proposed system, named DCP, includes web scraping, panel extraction, and text transcription, significantly improving efficiency and accuracy compared to manual methods. The evaluation of DCP demonstrates its effectiveness in processing large volumes of comic strips, achieving high success rates in panel extraction and substantial error reduction in text extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views9 pages

Thesis 1

The document presents a pipeline for automating the construction of a comic text-image dataset, addressing the challenges of panel detection and text extraction from comic strips. The proposed system, named DCP, includes web scraping, panel extraction, and text transcription, significantly improving efficiency and accuracy compared to manual methods. The evaluation of DCP demonstrates its effectiveness in processing large volumes of comic strips, achieving high success rates in panel extraction and substantial error reduction in text extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Delft University of Technology, Bachelor Seminar of Computer Science and Engineering

Automated Text-Image Comic Dataset Construction


Maciej Styczeń
under supervision of
Lydia Chen, Zilong Zhao
Delft University Of Technology

Abstract—Comic illustrations and transcriptions form an attrac-


tive dataset for several problems, including computer vision
tasks, such as recognizing character’s faces, generating new
comics, or natural language processing tasks like automated
comic translation or detecting emotion in the dialogues. However,
despite a large number of comic strips published online, very
few datasets of annotated comic illustrations are available. This Fig. 1: Input (left) and output (right) of the dataset creation process.
forms a bottleneck for further advancements in the field. The Comic from ”Piled Higher and Deeper” by Jorge Cham
source of the data scarcity is the manual labor required for
annotation — one has to download the comic strips, separate each
strip into panels (individual illustrations), and transcribe the text.
Automating the process is needed, but it poses several challenges. humor detection [8], and automated translation could also be
Panel detection in comic strips is non-trivial, due to varying applied to comic dialogues. However, to be able to perform
layouts and styles of comics. Automated transcription is also those innovative experiments, a large, high-quality dataset is
challenging, as the out-of-the-box optical character recognition required.
(OCR) models struggle with diverse fonts, hand-writing styles,
and backgrounds. We design an automatic comic text-image Data collection and preparation is a significant bottleneck in
dataset construction pipeline, termed DCP, consisting of three machine learning [9]. Obtaining enough high-quality training
components: (i) web scraping, (ii) panel extraction, and (iii)
text extraction. A multi-threaded comic scraper is created to samples is necessary for the success of machine learning
download all the comics. A panel extraction algorithm based systems [10]. However, manual preparation can be tedious
on panel frame detection is developed to divide the comic for large datasets that require complex pre-processing. That
strips into individual illustrations. Lastly, to effectively extract is precisely the case for the comics data — to construct a
the text using OCR, we propose additional pre-processing and text-image comic dataset, a multi-phase preparation procedure
post-processing steps, namely, up-scaling and binarizing images,
clustering-based text ordering, and dictionary-based autocorrect. is required. The comics strips need to be downloaded from the
We extensively evaluate the prototype of DCP on three comic web, then each comic strip should be separated into individual
series: PHD Comics, Dilbert, and Garfield. Web scraping is used illustrations (panels), and the text transcription has to be
to downloading over 25000 comic strips at an average pace of extracted (see Fig. 1). Performing that preparation procedure
149ms per image. Panel extraction results on 1118 panels show manually for tens of thousands of comic strips is unfeasible,
success rates of 100%, 97%, and 71% for Dilbert, PHD Comics
and Garfield respectively, outperforming the baseline in terms therefore automating it would be valuable.
of accuracy and speed. The text extraction algorithm, tested on
The purpose of this paper is to design, implement, and evaluate
1100 comics, achieves a 7x error reduction compared to the out-
of-the-box OCR. an automatic comic text-image dataset construction pipeline,
termed DCP, consisting of three components: (i) web scraping,
(ii) panel extraction, and (iii) text transcription, see Fig. 2. To
I. I NTRODUCTION achieve that, current state-of-the-art web-scraping, image seg-
mentation, and optical character recognition (OCR) techniques
Comic series like Garfield [1], PHD Comics [2], or Dilbert [3], are evaluated, adapted, and combined. More specifically, the
contain thousands of comic strips, forming an attractive dataset goal is to answer the following research questions:
for several experiments. For example, current face detection
and recognition systems consistently perform well on human 1) How to efficiently scrape the image data from comic
faces [4]. The same models, if appropriately trained, could websites?
prove equally successful when applied to comic characters’ 2) How to create a panel extraction algorithm that can deal
faces. The introduction of Generative Adversarial Networks with varying comic layouts?
(GANs) [5] sparked rapid advancements in the field of image 3) How accurately can the state-of-the-art OCR tools ex-
synthesis, including generating images based on a descrip- tract the text from comic strips?
tive piece of text [6, 7]. Such GAN models can be used 4) What pre-processing and post-processing techniques can
to research synthesizing comic illustrations based on their be applied to the images to improve the performance of
transcriptions. Natural language processing tasks, such as the OCR models?
Fig. 2: An overview of the automated dataset creation pipeline.

To automatically perform the dataset creation without sacri- possible to retrieve and process the contents of an arbitrary
ficing data quality, one needs to overcome several challenges. static website, such as the comic strip websites [15].
Comic panel extraction is a difficult task, as the layouts
Panel extraction has been studied widely, mostly in the context
and drawing styles of comics vary significantly. Evaluation
of applying it to mobile comic viewing apps [16]. Some
of the existing segmentation methods, for example using the
solutions use convolutional neural networks (CNN) to train
intersection-over-union score, is needed to find one that per-
an object detection model that can find panel positions in a
forms consistently on comic illustrations. Transcribing comics
comic [17]. However, this approach would most likely require
using OCR is also challenging, as the current OCR models
additional training to perform well on unseen comic series,
struggle with varying hand-writing styles and backgrounds
therefore it is not easily applicable to this papers’ problem.
[11]. This research attempts to improve the OCR accuracy
Other methods utilize image processing techniques, such as
by pre-processing the input using image binarization with
mathematical morphology and region growing for finding the
adaptive thresholding [12] and up-scaling. Another difficulty
background and extracting the panels [18, 19, 20]. Those
is the correct ordering of the words output by the OCR.
solutions are better suited for our use case and they form a
By default, the output is ordered in a line-by-line manner,
basis for the developed panel extraction algorithm.
but for comics also the panel-by-panel, and bubble-by-bubble
orderings have to be considered. Text extraction is performed Current text recognition solutions have a very high accuracy at
at the panel level and optionally concatenated to obtain full identifying and extracting text but still have some limitations,
comic transcription. Hierarchical clustering [13] is applied to including dealing with colored backgrounds, small fonts, and
group all bounding boxes belonging to the same speech bubble handwritten text [11]. Unfortunately, those are all present in
and determine the right ordering. comics — comic strips are characterized by high variability of
fonts and styles, often along with complex backgrounds, noise,
Section II covers the related work and presents the cur-
and low resolution. To overcome those difficulties, researchers
rent state-of-the-art in the domain. Section III describes the
propose domain-specific training of OCR models [21] and out-
methodology behind this paper’s contributions: it introduces
put post-processing, such as text validation and correction [22].
the design of the pipeline, describes in detail the created panel
The reading order of a comic is not straightforward either:
extraction algorithm, and presents the solutions developed to
comics are read panel after panel, bubble after bubble, unlike
improve the performance of the OCR. Section IV gives insight
most documents, where top-to-bottom, left-to-right ordering
into conducted experiments and their results, aiming to give a
is sufficient. Overall, comics form a challenge for OCR, and
detailed evaluation of each of the steps of the DCP. Section
current engines cannot deal with it out of the box.
V discusses the results and gives ideas for future work on the
topic, and Section VI reflects on the ethical implications and
reproducibility of the findings. III. DCP: DATASET C ONSTRUCTION P IPELINE

This section describes in detail the techniques used to build


II. R ELATED WORK the automated dataset construction pipeline. First, the system
design and its components are presented. Then, for each step
We structured the related work in the following categories:
of the DCP pipeline, the proposed method is explained and
work regarding automated dataset creation, web scraping tech-
motivated.
niques, proposed panel extraction methods, and text recogni-
tion in comic strips.
A. System design
Existing literature shows success in automated dataset build-
ing, including web-scraped image datasets, such as Face- The system is designed as a pipeline consisting of three
Scrub [14]. However, automated pre-processing of comic components: (i) web scraping, (ii) panel extraction, and (iii)
strip images remains an unexplored territory. Furthermore, text extraction, see overview in Figure 2.
no publications on automated dataset creation involve image
The input for the system is a list of URLs to raw comic images.
segmentation and text extraction steps in the pipeline. This
The URLs are passed to the first stage of the pipeline: the
paper aims to explore this gap.
web scraper. The scraper downloads all the comics and stores
Web scraping has become common and achievable — with the them locally. Then, the full comic strips need to be segmented
support of modern programming languages and libraries, it is into individual illustrations: the images are fed to the panel
Fig. 3: A visualization of the steps of the panel extraction procedure, for simplicity, presented on a single panel example. The same steps
apply to a strip with multiple panels. From the left: a): the original image in grayscale, b): binarized image c): all the contours identified in
the image, d): the outermost contours, e): the outermost contours filled with white color, f): noise removal using morphological opening g):
proposed panel bounding box. (Comic strips from ”Piled Higher and Deeper” by Jorge Cham www.phdcomics.com [2])

extraction stage, where each full comic is divided into panels, In the first two steps, the image is converted to grayscale and
and the panels are saved to disk. Finally, to perform automated binarized using adaptive Gaussian thresholding, as visualized
text transcription, the illustrations are fed into the OCR stage. Fig. 3a-b. Adaptive thresholding establishes the threshold
First, pre-processing is applied to the images, to make the value separately for each pixel based on its neighborhood,
text easier to detect, then the illustrations are sent to the resulting in less noise than global thresholding and preserving
OCR engine, and lastly, the OCR output is post-processed the contours and features in the image. Usually, this results in
to reduce the error. The output of the system consists of the an image that is easier to analyze [23], see Fig. 4.
individual panels, along with their transcriptions, or optionally,
the original, full comics with their transcriptions obtained via
concatenating the individual panels’ transcriptions.

B. Downloading the comics

Downloading the comic images can be achieved using the


following scraping procedure:

1) For each comic website, list the URLs of all the comic
images. Fig. 4: Original img. vs. global threshold vs. adaptive threshold [23]
2) For each URL, send an HTTP GET request.
3) For each response, extract the image and save it to disk.

Performing these operations on a single thread does not scale After binarization, contours are identified in the image. Only
well, as, for each URL, the process would have to wait for the outermost, top-level contours are interesting for this task,
the server’s response, before it can send the next request. as those form the candidates for the panel bounding boxes (see
Therefore, multi-threading is utilized to improve efficiency — Fig. 3c-d).
several threads send requests and deal with the responses in After identifying the outermost contours, they are filled in with
parallel achieving significant speedup. white color, but the resulting image still has some noise, see for
There exists extensive library support for web scraping for example the ”WWW.PHDCOMICS.COM” text on the bottom
most popular programming languages. For instance, the pro- of Fig. 3e. A morphological opening operation is applied to
cess can be implemented easily using Python’s requests and remove the noise. It is a combination of erosion and dilation
BeautifulSoup libraries. The solution for this task can be re- operations, allowing for the removal of the small elements of
used for an arbitrary comic website — the only additional the image while preserving the shape and size of the larger
work when introducing a new comic source is getting the list ones.
of all the image URLs. The resulting image, as presented in Fig. 3f has no more
noise. Contours present in that image, determine the position
C. Panel extraction of the final bounding box, see Fig. 3g. Once the algorithm
identifies the positions of all the panel frames, the original
The panel extraction process leverages the presence of image is divided into separate illustrations, by extracting the
frames around comic panels to perform the segmentation. An areas surrounded by the bounding boxes and saving them as
overview of the process is presented in Fig. 3. new images.
(a) All b-boxes (b) Identified clusters (a) Initial illustration (b) Before denoising (c) Final image
Fig. 5: Example of bounding-box clustering. Illustration Fig. 6: Example of text removal from an illustration. Illustration taken from
taken from https://dilbert.com. https://dilbert.com.

D. Text extraction The source of this issue is the lack of information about the
comic bubbles composition, which is crucial for determining
After the comic is segmented into individual illustrations, the correct order of words in comic dialogues. Comics are
the text extraction has to be performed on each illustration. read bubble by bubble, rather than simply line by line. To
That can be achieved using optical character recognition — a correct the output, the bounding boxes need to be grouped into
conversion of image representation containing text into plain clusters corresponding to bubbles, as presented in Fig. 5b. We
text strings. calculate the bubbles’ centers and use them to sort bubbles
1) Engines: There is a wide selection of OCR software on by their centers’ x and y positions. We can then read the text
the market, therefore it is not feasible to test all of it, but two individually for each bubble, and concatenate the results to
most popular engines are selected: obtain the corrected transcription:

• Tesseract [24] - the leading open-source OCR engine. Bubble 1: ”climate change is caused by gravity”
• Google’s Vision API OCR [25] - the state-of-the-art Bubble 2: ”that’s right!”
commercial OCR API. Concatenated: ”climate change is caused by grav-
ity. that’s right!”
2) Pre-processing: Pre-processing techniques, such as up-
scaling and binarization, can be applied to images to improve The bubble grouping can be performed using agglomerative
the performance of OCR [26]. hierarchical clustering [13, 27]. Initially, each bounding box
starts alone, in a singleton cluster. Then clusters are merged
a) Upscaling: Character height is considered a key factor for until all the pairwise distances between clusters are higher
OCR output quality, and for optimal performance, it should than a certain threshold. To determine the distance between
be between 20-40 pixels. Unfortunately, the character heights two clusters, a single linkage approach is used - the distance
in the comic datasets are often much smaller — an up-scaling between clusters is the minimal distance between a pair
step is needed. To know the re-scaling factor, one has to find consisting of elements of those two clusters (one from each).
the character height in the original image. For this purpose, The distance between two bounding boxes is defined as the
an initial OCR pass is performed on the unprocessed image, sum of minimum spacings between their edges in x and y
and the character height is defined as the median of the directions, see Fig 7. Additionally, if there is overlap in an
heights of the bounding boxes returned by the engine. Then, axis, the spacing in that axis is set to 0. The distance threshold
images are re-scaled using cubic interpolation by a factor of for clusters can be determined based on the calculated letter
desired letter height
initial letter height , where a common value for the desired height — spacing between two lines of text within one bubble
letter height is 30 pixels. would rarely be larger than the height of one letter.
b) Binarization: : The images are then binarized using adap-
tive thresholding [12], and fed into the final OCR pass, poten-
tially resulting in better accuracy than before pre-processing.
3) Post-processing:
a) Clustering: The OCR output consists of detected words
along with their bounding boxes, see Fig. 5a. The boxes are
initially ordered top-to-bottom, left-to-right, as in a standard Fig. 7: x and y spacings between two bounding boxes.
printed text page. However, this does not work for comics,
as it does for example, for Fig. 5a, this would result in the b) Autocorrect: Single character mistakes are very common
following output: in the OCR output — often the majority of the characters
are detected correctly, but some letters are classified as a
”climate change is caused by that’s gravity. right!” different character than they actually are. In that cases, it can
TABLE I: Web scraping and panel extraction evaluation.
(a) Web scraping - average time to scrape one panel (in seconds), and (b) Comparison of panel extraction performance between our method (DCP) and
speed-up factor achieved by the use of parallelization. Time estimates Kumiko [20]: single panel and full strip success rates, intersection-over-union scores,
based on scraping 1000 Dilbert, 1000 PHD Comics and 1000 Garfield and time efficiency. Tested on 300 comic strips with total of 1118 panels.
images from the web.
Dilbert PHD Comics Garfield
Metric or dataset
Result or Dataset Dilbert PHD Comics Garfield Average Kumiko DCP Kumiko DCP Kumiko DCP
1 thread 1.46 2.72 0.43 1.54 Panel succ. rate 97% 100% 92% 99% 91% 91%
10 threads 0.22 0.18 0.046 0.149 Strip succ. rate 95% 100% 78% 96% 73% 72%
speed-up 6.6 15.1 9.3 10.33 Average IoU 0.99 0.99 0.96 0.98 0.97 0.95
Time per comic 680ms 2.3ms 400ms 1.1ms 890ms 1.2ms

be beneficial to make use of spelling correction algorithms.


TextBlob [28] library provides an autocorrect implementation
based on Peter Norvig’s [29] correction algorithm. That imple-
mentation is used to correct the OCR output, aiming to reduce
the error.
4) Text removal from image: Removing text from comic
illustrations is an important feature for several use cases. For Fig. 8: Example IoU scores for bounding box evaluation. Taken from
example, with the help of automated translation tools, one can Wikipedia [30].
remove the original text, and print new, translated text on the
image. Moreover, the text is a prominent feature of the image,
that will cause severe noise problems when trying to train C. Panel extraction
models for illustration generation - removing it simplifies the
When evaluating panel extraction, one needs to compare de-
data and allows the models to focus on the primary elements
tected segments’ locations with the manually marked, ground
of the illustration, such as characters or objects.
truth locations. Intersection over Union (IoU), also known as
The text removal method proposed in this paper is based on the the Jaccard index, is a commonly used metric for measuring
bounding boxes found by the OCR model. For each bounding region overlap for all kinds of segmentation or detection tasks.
box, a binary mask is established based on binary thresholding, The IoU is defined as the size of the intersection divided by
1s correspond to dark spots (letters), and 0s correspond to the size of the union of two sets [30], see Equation 1.
the background. Then, the background color is calculated by |A ∩ B|
taking the average color value of non-text pixels. The letter J(A, B) = (1)
|A ∪ B|
pixels are then colored with the background color, as presented
in Fig. 6b. To give a smoother look to the image, denoising,
In the case of panel extraction, IoU is the area of the overlap
and blurring are applied, see 6c.
between the detected bounding box and the ground-truth one,
divided by the area of the union of those two bounding boxes.
IV. E XPERIMENTS AND RESULTS The IoU score will approach 1 if the bounding boxes are the
same and 0 if there is no overlap between them, see Fig. 8.
A. Datasets
In this experiment, a panel is marked as correctly detected if
The evaluation is performed by attempting to automatically the IoU for the detected and ground truth bounding boxes is
construct illustration-transcription datasets for PHD Comics, at least 0.9. The comic strip is marked as correctly segmented
Garfield and Dilbert — three popular comic series. All of them if all panels are correctly detected.
have been consistently updated with new comics for decades, Testing the panel extraction is done using comic strips from
providing thousands of comic strips to work with. three different sources: Dilbert, PHD Comics and Garfield1 .
The test dataset consisted of 300 comics, containing 1118
B. Comic scraping panels. The results of the evaluation are presented in Table Ib.

Web scraper is evaluated by downloading comic strips from Overall, the proposed panel extraction algorithm achieves
dilbert.com and phdcomics.com and pt.jikos.cz/garfield. Over almost perfect results on Dilbert and PHD comics comics —
14000 Garfield, 12000 Dilbert and 2100 PHD Comics strips leveraging the presence of the frames enables outperforming
are downloaded. Multi-threaded scraping is significantly faster Kumiko. The performance on Garfield is noticeably worse, as
than single-threaded, with 10 speed-up factor for Garfield, 6.5 no frames are present for some of the panels, making it harder
for PHD Comics, and 15.1 for Dilbert, see Table Ia. To give to find the panel boundaries — see Fig. 9b for an example.
a better idea of the scale, scraping all 12000 Dilbert comics 1 Panel extraction for Garfield is performed using global threshold binariza-
would take approximately 9 hours with a single thread, but tion rather than adaptive binarization, as it performed better when no frames
only 35 minutes with 10 threads. are present.
Fig. 9: Examples of panel extraction results — the detected panels are represented by the green areas. (Comic strips from Garfield [1])

(a) Correct segmentation example: all three panels are detected correctly.

(b) Incorrect segmentation example: the middle panel is not detected correctly, as there is no clear border around it.

When it comes to efficiency, our algorithm is significantly To evaluate the impact of this paper’s contribution to comic
faster than Kumiko, making it more suited for processing large dialogue transcription, a baseline scenario is established: feed-
datasets. It also has a significant advantage over deep-learning- ing the entire comic strip into the OCR engines, without pre-
based panel detection techniques - no dataset-specific training processing and dividing it into panels, denoted as Exp. #1 in
is needed, the method can be directly applied to any other Table IIa. As presented in Table IIb, results achieved using
comic. the baseline approach are extremely inaccurate, making them
completely unusable. Two primary reasons for the failure are:
D. Text extraction
1) The OCR picks up a lot of text from outside the actual
Evaluation of text extraction is performed by comparing illustrations. That text is not part of the dialogues, it
ground-truth, manual transcriptions of comic strips with the mostly contains other information such as publication
output of automated transcription using OCR. The evaluation dates, comic artist’s name, or website URLs.
is conducted on Garfield, Dilbert and PHD Comics datasets, 2) As there is no information about panel division in this
containing 500, 500, and 100 annotated comic strips respec- experiment, the OCR engines struggle with determining
tively. The Garfield and Dilbert transcriptions are available the correct order of the output — e.g. some text from
online in Alfred Arnold’s transcription archive [31], and the the second panel can appear before some parts of text
PHD Comics annotations are obtained via manual transcrip- from the first panel.
tions.
The first major improvement to this scenario is experiment #2
The Levenshtein distance, also known as the edit distance, is
from Table IIa, where instead of scanning the whole image
used as a primary metric for text extraction evaluation. Given
at once, the OCR is performed separately on each panel, and
two strings, the Levenshtein distance is defined as the minimal
the results are then concatenated. We can observe a significant
number of single-character edits (insertions, deletions, substi-
decrease in error rates. All the later experiments are conducted
tutions) needed to change one of the strings into the other:
on separate panels, rather than on the full comic strip.
Ldist (st , sd ) = Cins (st , sd ) + Cdel (st , sd ) + Csub (st , sd ) (2)
In the next two experiments — #3 and #4 from Table IIa — the
Where st is the ground-truth string, and sd is the detected impact of pre-processing techniques is evaluated. As presented
string. The distance can be normalized by dividing by the in Table IIb - #3, adding an up-scaling step has a minor, but
length of the ground truth string: positive impact on the performance — especially in the case
Ldist of PHD Comics, where the initial image resolution is low for
Ldist norm (st , sd ) = min(1, ) (3) some of the older strips. Experiment #4 results indicate that
| st |
adding a binarization step has a slightly negative impact on the
Given the comics text is almost always capitalized, the eval-
outcome, contrary to general OCR pre-processing recommen-
uation of the transcriptions is performed in a case-insensitive
dations from the literature. Based on these evaluation results,
manner — the strings are converted to lowercase before
in the later experiments the binarization step is skipped, and
comparison. Therefore, no distinctions are made between
only the up-scaling step is applied.
lowercase and uppercase letters; for example, ”CAT” and
”cat” are treated as the same string with distance 0. Experiment #5 aims to evaluate the impact of adding a
TABLE II: Text extraction evaluation - experiment on 500 Dilbert, 500 Garfield and 100 PHD Comics with ground-truth strings obtained
via manual transcription.
(a) Experiment setup: six experiments are conducted to evaluate text extraction. (b) Experiment results: normalized Levehnstein distance between
Experiments test OCR on full and segmented strips, using the proposed pre- detected and ground-truth transcriptions. Comparison of Vision API
processing and post-processing techniques. and Tesseract OCR on Dilbert, Garfeld, and PHD Comics datasets.
Pre-processing Post-processing Dilbert PHD Comics Garfield
Exp. no. Segmentation Exp. no.
re-sizing binarization clustering autocorrect Tess. V. API Tess. V. API Tess. V. API
Exp. #1 7 7 7 7 7 Exp. #1 0.68 0.61 0.731 0.538 0.650 0.381
Exp. #2 4 7 7 7 7 Exp. #2 0.233 0.044 0.786 0.109 0.532 0.163
Exp. #3 4 4 7 7 7 Exp. #3 0.222 0.044 0.698 0.104 0.501 0.159
Exp. #4 4 4 4 7 7 Exp. #4 0.242 0.048 0.727 0.112 0.534 0.150
Exp. #5 4 4 7 4 7 Exp. #5 0.188 0.032 0.699 0.075 0.468 0.120
Exp. #6 4 4 7 4 4 Exp. #6 0.276 0.097 0.694 0.0781 0.485 0.121

bounding box clustering step on the OCR performance. Table The proposed panel extraction method achieved success rates
IIb shows a significant positive impact on the accuracy — for Dilbert and PHD Comics, achieving 100% and 97% suc-
clustering reduces the error rates by up to 30%. This shows, cess rates respectively, outperforming the baseline algorithm in
that a significant fraction of the errors is caused by the wrong terms of both accuracy and efficiency. Unfortunately, it failed
ordering of the output words due to a lack of information about to detect a significant fraction of Garfield panels, achieving
the comic speech bubbles. Clustering fixes that issue for most only a 71% success rate. The source of the errors was the lack
data points. of clear panel boundaries in some of the strips. The proposed
algorithm used contour detection and thresholding to detect
Finally, experiment #6, from Table IIa evaluates the impact of
the panel boundary, therefore it dealt flawlessly with panels
auto-correcting the OCR output on the extraction error. Intu-
that had frames, but struggled when no frame was present.
itively, one could expect some improvement from dictionary-
based correction, but the results in Table IIb show an opposite The automatic text transcription was the most challenging
effect. One explanation for this could be that comics contain stage of the process, as the existing OCR solutions performed
a lot of names, onomatopoeias, and exclamations — such as poorly in their out-of-the-box state. Moreover, the proposed
”Dilbert”, ”Woo” or ”Pow” — which are not present in the OCR pre-processing methods, such as binarization and up-
dictionary and get mistakenly corrected into other words. scaling, brought no significant improvement to the perfor-
mance. However, performing OCR on individual illustrations
Overall, the final performance of the OCR is satisfactory,
and correcting the order of the output by grouping the text
but not perfect. Vision API performs better than Tesseract
bounding boxes into speech bubbles reduced the error rates
in all cases. Tesseract completely fails with PHD Comics
by a factor of 7. It was thought that performing dictionary au-
and Garfield, in a big part of the comics it does not detect
tocorrect on the OCR output would bring further improvement,
text at all. Panel separation and clustering have a significant,
but the effect was the opposite, possibly due to the presence
positive impact on the performance, but the other elements of
of non-dictionary words, like onomatopeias and exclamations
the proposed method do not bring improvement. Best error
in the comic dialogues. Overall, the final OCR output is fairly
rates of 0.03 on Dilbert, 0.07 on PHD Comics, and 0.12 on
close to the true transcriptions, with normalized Levenshtein
Garfield give a solid base for automatic transcription, but in
distances of 0.03, 0.07, and 0.12 for Dilbert, PHD Comics,
the current state most likely the transcription would still have
and Garfield respectively.
to be corrected by a human.
In general, the proposed pipeline can successfully construct a
dataset of comic illustrations and transcriptions for most data
V. D ISCUSSION , CONCLUSIONS AND FUTURE WORK points, but the output still contains an observable amount of
errors. Part of the errors can be attributed to the mistakes in
The purpose of this paper was to design, implement, and eval- segmenting the comic strips where no clear panel frames are
uate an automated dataset construction pipeline for building present. It could be beneficial to conduct further research to
an illustration-transcription comics dataset. To do so, web- develop a solution that can deal with such cases. The majority
scraping was applied to automatically download the comic of the errors occur at the text extraction stage - there might be
strips, image processing techniques were utilized to divide a need to construct a human-in-the-loop software, including a
the comics into individual illustrations, and OCR was used tool for manual correction of the output transcriptions. Another
to automatically generate transcriptions. possibility for improvement is to experiment with dataset-
specific training of the OCR models on a small, manually
The web scraping technique proved successful in the experi- annotated dataset. Finally, some text-region detection algo-
ments. We were able to download thousands of Garfield, Dil- rithms, such as the EAST text detector [32], could be used
bert and PHD Comics strips, and thanks to the multithreaded to detect candidate text areas and feed those into the OCR
implementation, we achieved an average rate of between 5 and pipeline instead of the whole comic strips, potentially resulting
20 comics per second.
in better OCR accuracy. R EFERENCES
[1] Garfield. URL: https://garfield.com/.
VI. R ESPONSIBLE RESEARCH
[2] PHD Comics. URL: http://phdcomics.com/.
The following three paragraphs reflect on the ethical and legal [3] Dilbert. URL: https://dilbert.com/.
implications of this research project, and the reproducibility [4] Faizan Ahmad, Aaima Najam, and Zeeshan Ahmed.
of results achieved in the experiments. “Image-based face detection and recognition:” state of
the art””. In: arXiv preprint arXiv:1302.6379 (2013).
[5] Ian J Goodfellow et al. “Generative adversarial net-
A. Copyright issues
works”. In: arXiv preprint arXiv:1406.2661 (2014).
The data used for experiments was obtained via web scraping, [6] Han Zhang et al. “Stackgan: Text to photo-realistic
which is a topic of debates concerning legal issues such as image synthesis with stacked generative adversarial
copyright and privacy violations [33]. There is no risk of networks”. In: Proceedings of the IEEE international
privacy violations in the case of this research project — all of conference on computer vision. 2017, pp. 5907–5915.
the data points are publicly available artworks, created to be [7] Scott Reed et al. “Generative Adversarial Text to Image
shared with a wide audience. However, the copyright violation Synthesis”. In: Proceedings of The 33rd International
is a real threat: the comic strips from PHD Comics and Dilbert Conference on Machine Learning. Ed. by Maria Florina
are the intellectual property of their creators, and cannot be Balcan and Kilian Q. Weinberger. Vol. 48. Proceedings
freely distributed by a third party. Therefore, to avoid any of Machine Learning Research. New York, New York,
concerns regarding copyright violation, the datasets used for USA: PMLR, 2016, pp. 1060–1069. URL: http : / /
experiments will not be published. Some of the PHD Comics proceedings.mlr.press/v48/reed16.html.
strips are still used in the paper to illustrate the methods [8] Rada Mihalcea, Carlo Strapparava, and Stephen Pul-
and experiments, but this kind of usage is explicitly listed man. “Computational models for incongruity detection
as permitted on PHD Comics website [34]. in humour”. In: International Conference on Intelli-
gent Text Processing and Computational Linguistics.
B. Potential software misuse Springer. 2010, pp. 364–374.
[9] Yuji Roh, Geon Heo, and Steven Euijong Whang. “A
Another possible ramification of this project is the potential Survey on Data Collection for Machine Learning: A Big
misuse of the published software. For instance, the proposed Data - AI Integration Perspective”. In: IEEE Transac-
scraping mechanism could be used to mass-download copy- tions on Knowledge and Data Engineering 33.4 (2021),
righted comics, which could then be republished illegally pp. 1328–1347. DOI: 10.1109/TKDE.2019.2946162.
on a third-party website. Moreover, the text removal method [10] Hillary Sanders and Joshua Saxe. “Garbage in, garbage
implemented in the project could be used to clear the existing out: How purportedly great ML models can be screwed
dialogues from a comic and add new ones, creating an alter- up by bad data”. In: Proceedings of Blackhat 2017
native story. This usually goes against the comic publisher’s (2017).
regulations, as it involves producing derivative work from [11] Cem Dilmegani. Current State of OCR: Is it a solved
copyrighted content. In general, software misuse cannot be problem in 2021? 2021. URL: https : / / research .
fully prevented, but explicit warning regarding this topic is aimultiple.com/ocr-technology/#: ∼ :text=of%20OCR%
included in the software documentation. 20tools?- ,OCR%20is%20not%20a%20stand- alone%
20solution % 20in % 20human - machine , structured %
C. Reproducibility 20data%20from%20their%20documents..
[12] Jamileh Yousefi. “Image binarization using Otsu thresh-
Reproducibility of results is a crucial aspect of research, but, olding algorithm”. In: Ontario, Canada: University of
unfortunately, it is often overlooked by scientists [35], also in Guelph (2011).
the computer science field [36]. To ensure the reproducibility [13] William HE Day and Herbert Edelsbrunner. “Effi-
of the results achieved in this research, the source code will cient algorithms for agglomerative hierarchical cluster-
be published as open-source software on github.com 2 , along ing methods”. In: Journal of classification 1.1 (1984),
with a usage guide. This way, the experiments mentioned in pp. 7–24.
the paper can be easily repeated by interested parties and [14] H. Ng and S. Winkler. “A data-driven approach to
compared with new research. Moreover, the code can also cleaning large face datasets”. In: 2014 IEEE Interna-
be forked and modified to be used in a different context or tional Conference on Image Processing (ICIP). 2014,
improved by new ideas. pp. 343–347. DOI: 10.1109/ICIP.2014.7025068.
[15] Ryan Mitchell. Web scraping with Python: Collecting
2 Project repository: https://github.com/mstyczen/comic-dcp more data from the modern web. ” O’Reilly Media,
Inc.”, 2018.
[16] Van Nguyen Nhu, Christophe Rigaud, and Jean- [32] Xinyu Zhou et al. “East: an efficient and accurate scene
Christophe Burie. “What do We Expect from Comic text detector”. In: Proceedings of the IEEE conference
Panel Extraction?” In: 2019 International Conference on Computer Vision and Pattern Recognition. 2017,
on Document Analysis and Recognition Workshops (IC- pp. 5551–5560.
DARW). Vol. 1. 2019, pp. 44–49. DOI: 10 . 1109 / [33] Vlad Krotov and Leiser Silva. “Legality and ethics of
ICDARW.2019.00013. web scraping”. In: (2018).
[17] Toru Ogawa et al. “Object detection for comics [34] PHD Comics: About. URL: https : / / phdcomics . com /
using manga109 annotations”. In: arXiv preprint about.php.
arXiv:1803.08670 (2018). [35] Monya Baker. “Reproducibility crisis”. In: Nature
[18] Anh Khoi Ngo Ho, Jean-Christophe Burie, and Jean- 533.26 (2016), pp. 353–66.
Marc Ogier. “Panel and speech balloon extraction [36] Matthew Hutson. Artificial intelligence faces repro-
from comic books”. In: 2012 10th IAPR International ducibility crisis. 2018.
Workshop on Document Analysis Systems. IEEE. 2012,
pp. 424–428.
[19] Xufang Pang et al. “A Robust Panel Extraction Method
for Manga”. In: Proceedings of the 22nd ACM Inter-
national Conference on Multimedia. MM ’14. Orlando,
Florida, USA: Association for Computing Machinery,
2014, pp. 1125–1128. ISBN: 9781450330633. DOI: 10.
1145/2647868.2654990. URL: https://doi.org/10.1145/
2647868.2654990.
[20] Kumiko. URL: https://github.com/njean42/kumiko/.
[21] Christophe Rigaud et al. “Toward speech text recog-
nition for comic books”. In: Proceedings of the 1st
International Workshop on coMics ANalysis, Processing
and Understanding. 2016, pp. 1–6.
[22] Christophe Ponsard, Ravi Ramdoyal, and Daniel
Dziamski. “An OCR-Enabled Digital Comic Books
Viewer”. In: Computers Helping People with Special
Needs. Ed. by Klaus Miesenberger et al. Berlin, Hei-
delberg: Springer Berlin Heidelberg, 2012, pp. 471–478.
ISBN : 978-3-642-31522-0.
[23] Image Thresholding. URL: https : / / docs . opencv. org /
master/d7/d4d/tutorial py thresholding.html.
[24] Tesseract. URL: https : / / github . com / tesseract - ocr /
tesseract.
[25] Google Vision API. URL: https : / / cloud . google . com /
vision/docs/ocr.
[26] Wojciech Bieniecki, Szymon Grabowski, and Wojciech
Rozenberg. “Image Preprocessing for Improving OCR
Accuracy”. In: 2007 International Conference on Per-
spective Technologies and Methods in MEMS Design.
2007, pp. 75–80. DOI: 10 . 1109 / MEMSTECH . 2007 .
4283429.
[27] Agglomerative clustering - scikit-learn documentation.
URL: https://scikit-learn.org/stable/modules/generated/
sklearn.cluster.AgglomerativeClustering.html.
[28] TextBlob: Simplified Text Processing. URL: https : / /
textblob.readthedocs.io/en/dev/index.html.
[29] Peter Norvig. How to Write a Spelling Corrector? URL:
http://norvig.com/spell-correct.html.
[30] Jaccard index. May 2021. URL: https://en.wikipedia.
org/wiki/Jaccard index.
[31] Alfred Arnold’s FTP server with comic transcriptions.
URL: http://john.ccac.rwth-aachen.de:8000/ftp/dilbert/.

You might also like