0% found this document useful (0 votes)
14 views13 pages

Paper 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views13 pages

Paper 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Journal of Ambient Intelligence and Humanized Computing (2022) 13:1853–1865

https://doi.org/10.1007/s12652-021-02951-1

ORIGINAL RESEARCH

An end to end system for subtitle text extraction from movie videos
Hossam Elshahaby1 · Mohsen Rashwan1

Received: 30 April 2020 / Accepted: 4 February 2021 / Published online: 22 February 2021
© The Author(s), under exclusive licence to Springer-Verlag GmbH, DE part of Springer Nature 2021

Abstract
A new technique for text detection inside a complex graphical background, its extraction, and enhancement to be easily
recognized using the optical character recognition (OCR). The technique uses a deep neural network for feature extraction
and classifying the text as containing text or not. An error handling and correction (EHC) technique is used to resolve clas-
sification errors. A multiple frame integration (MFI) algorithm is introduced to extract the graphical text from its background.
Text enhancement is done by adjusting the contrast, minimize noise, and increasing the pixels resolution. A standalone
software Component-Off-The-Shelf (COTS) is used to recognize the text characters and qualify the system performance.
Generalization for multilingual text is done with the proposed solution. A newly created dataset containing videos with dif-
ferent languages is collected for this purpose to be used as a benchmark. A new HMVGG16 convolutional neural network
(CNN) is used for frame classification as text containing or non-text containing, has accuracy equals to 98%. The introduced
system weighted average caption extraction accuracy equals to 96.15%. The correctly detected characters (CDC) average
recognition accuracy using the Abbyy SDK OCR engine equals 97.75%.

Keywords Graphical text · Films application · Multimedia · Text detection · Text extraction · Subtitles frame

1 Introduction computer vision for instance bank trading, autonomous driv-


ing, factory robots, and OCR as shown in Fig. 1. We will
Computer vision plays an important role to facilitate our focus on our work on extracting and enhancing the subtitle
daily life problems. Machine learning has become one of text in film multimedia application which helps in better and
the main players in computer vision that is used to replace faster film analysis.
human beings and solve different problems like autonomous
driving, object detection, object recognition, banking trans-
actions, having smart cities and property intrusion using 2 Motivation
image processing, artificial intelligence, pattern recognition,
Internet of Things (IoT), blockchain, physics, mathematics, We will focus on multimedia applications and specifically
and signals processing. Digitalization is an essential step to the foreign film subtitles which can help others to build
achieving computer vision. Analog inputs shall be converted an idea about the film main story, goal, and content. This
into digital to be able to process it using the computing units. extracted text when converted to a plain text can be used to
For instance, the book has to be scanned using a scanner to classify the film automatically as tragedy, comedy, science
perform OCR. Also, a video shall be used in a digital format fiction, action, etc. We can also build an idea about how a
like Moving Picture Experts Group (MPEG) for processing film is influencing the audience knowing that it is social,
in various applications of image recognition and speech rec- psychological, etc. This can be performed using pattern-
ognition. There are many applications where we can apply based networks (like DeConvolutional Neural Networks)
or feature-based networks (like generative adversarial net-
works). Our contribution is first text extraction with high
* Hossam Elshahaby
[email protected]
significant performance. Second, we made a generalization
and tested it with four different languages. Third, the plain
Mohsen Rashwan
[email protected]
text from OCR is placed in a text file for further processing
by film critics and the audience.
1
Cairo University, Giza, Egypt

13
Vol.:(0123456789)
1854 H. Elshahaby, M. Rashwan

and shape features for shots segmentation. They apply set


theory with a sliding window approach to integrate the
same shots in order to decide scene boundaries. This is
useful for movie trailer generation. The system showed a
correct detection of 86.95%.
Yang et al. (2018) detect Chinese text captions from web
videos using fully convolutional neural networks (FCN).
FCN makes predictions for text line existence without a seg-
mentation algorithm for word or character. The predictions
are input to a character classifier. They fixed the drawbacks
of the proposed method and remove intermediate steps using
a pixel-wise classification approach. For text detection, they
first sort all the word and character boxes by their y coor-
dinates and calculate if there is a gap distance between two
adjacent boxes. Second, they make a threshold to separate
the text line boxes into groups according to the gap dis-
tances. Third, they group all word box groups using a bound-
ary box. They achieved 21 frames per second on GTX1080.
The proposed system does not cover multi-oriented text
Fig. 1  Computer vision applications a bank trading, b autonomous detection. They got on dataset 1 a precision equals 86.13%,
driving, c factory robots, d OCR
recall equals 92.85%, and F1-score equals 89.36%. Using
dataset 2, they got a precision equals to 87.31%, recall equals
3 Related published work 92.62%, and F1-score equals 89.88%. Using dataset 3, they
got a precision equals to 85.11%, recall equals 91.7%, and
Separation of text from its background complex graphical F1-score equals 88.28%.
image is also one of the challenging topics for research- Zhang et al. (2016) proposed a method for scene text
ers. It contains several sub-problems including feature detection. First, they generate the text prediction maps and
extraction, feature matching, text extraction, and charac- geometric approaches for inclined proposals using fully con-
ter recognition. In some cases, like film application, the volutional network (FCN). Second, they combine the sali-
background is moving which triggers the researchers to ent map and character components to estimate the text line
search for algorithms that can help to resolve this problem. hypotheses. Last, they predict the centroid of each character
In this section, we will find several papers that addressed using another FCN classifier to remove the false hypotheses.
this topic in the past. The method handles text in multiple languages and fonts.
Lu and Wang (2019) position the multimedia video The proposed method consistently achieves the performance
text for subtitle frames automatically. They improve the on three text detection benchmarks. The model detects a
detection precision and the efficiency of multimedia video multi-oriented scene text.
information. Their method uses the feature of independent Hoang and Tabbone (2010) focuses on text extraction
video captioning (ICA) which covers the high order cor- from graphical images. Text extraction is done from a com-
relation of multimedia video data. An adaptive iterative plex background. They made the advancement of spare
localization algorithm is used for text localization. Detec- representation with two chosen discriminative complete
tion results using this method are row recall equals 90.1%, dictionaries based on Undecimated Wavelet Transform and
row precision equals 95.2%, block recall equals 89.5%, and Curvelet Transform. They used the morphological compo-
block precision equals 93% while detection results based nent analysis (MCA). The method had better text extraction
on multi-modal fusion are row recall equals 76.9%, row from graphics. It does not depend on text style, size, or ori-
precision equals 86.9%, block recall equals 74.5%, and entation. Text extraction accuracy is 93.75%
block precision equals 84.9%. Audithan and Chandrasekaran (2009) had a quick method
Haq et al. (2019) use object detection to perform movie for text extraction using Haar Discrete Wavelet Transform
scene segmentation using a convolutional neural network for edge detection using the Canny detector. A Canny detec-
(CNN). The first step segments the input movie into shots, tor extracts stroke information. Non-edges can be removed
the second step detects objects in the segmented shots, using a thresholding technique. Text edges can be connected
and the third step performs object-based shots matches using a morphological operator. A line feature vector was
for detecting the scene boundaries. They gather texture created based on an edge map and filtered. Text extraction
accuracy is 94.76%.

13
An end to end system for subtitle text extraction from movie videos 1855

Grover et al. (2009) detect colored text from complex


background. The system is independent of contrast. Using
the Sobel edge detector, they perform convolution with its
mask and the image after converting the RGB image to
grayscale. They eliminated non-maxima and made thresh-
olding for weak edges. The edge image is divided into non-
overlapping blocks depending on image resolution. Block
classification using the defined threshold then classifies the
text from the non-text. Text extraction accuracy is 94.80%.
Jung and Kim (2004) made a hybrid learning mechanism
with an artificial neural network-based approach (ANNs)
and a non-negative matrix factorization-based filtering
(NNMF) approach to extract text from complex images.
Multilayer perceptron neural network classifier increased a
recall rate and a precision rate using NMF-filtering-based
analysis to get connected components (CC). Text detection
was performed using neural networks without having a fea-
ture extraction stage. MLPs have automatically generated a Fig. 2  Stepwise methodology for text detection/recognition
texture classifier that discriminates text regions from non-
text regions based on three color bands. They learn a precise
boundary between text and non-text classes using the boot-
strap method. They used CC-based filtering with the NMF
technique in order to overcome the locality property of the
texture-based method. Processing time reduction was done
using CAM shift for the video images and X–Y recursive cut
algorithm for the document images. Text extraction accuracy
is 91.88%.

4 Problem definition

There are subtitle texts in Films which is important when


correctly recognized for various applications. The main chal-
lenge as shown in Fig. 2 is to extract text, enhance it as in
Sect. 4.4, increase its resolution for the OCR engine to easily
segment, and recognize the characters.
In foreign films application, there is a challenge to detect Fig. 3  Imagery text a graphical text input frame from movie. b Out-
and extract the text from its graphical complex background put result for a movie frame
as shown in Fig. 3. However, the resolution quality of the
text is high. Many powerful OCRs are existing for the
researches like Abbyy Software Development Kit (SDK), several types as shown in Fig. 4. It can be caption text
Tesseract, etc. They can help as standalone tools that support (overlay text or cut line text), graphical text, point-and-
most languages, common themes, fonts, and sizes. shoot text taken by the camera, and incidental scene text.

5 What is imagery text?


6 Text extraction methodologies
To be able to detect text correctly, we need to define it.
Imagery text is a group of edges, strokes, connected com- In order to extract text from the graphical background, we
ponents (CCs), and texture found in image format (Ye need first to process an array of an unsigned integers data
and Doermann 2015). Normally, text related to a certain frame of dimensions (h, w, 3) where h is the height and
language is composed of several characters which have w is the width, locate the text, verify that location, and
well-known patterns (group of features). Imagery text has extract the text.

13
1856 H. Elshahaby, M. Rashwan

geometrical method which best suites our problem nature


which contains the text. It can detect edges, connected
components, strokes, maximally stable extreme regions
(MSER), and texture. The geometric properties that can be
processed are bounding boxes, aspect ratio, eccentricity,
solidity, extent, and Euler number. The advantage of using
this method is that it is simple but it depends on the geom-
etry of the text like texture, font, and size so it does not
provide language-independent solution. The second one
is deep learning neural network method which classifies a
frame as has text or not. This method is much more power-
ful than the geometrical method. It has better processing
time per frame and accuracy percentage as in Tables 1, 2,
3, 4, 5 and 6. It provides language independent solution
for text image features extraction. However, it requires a
Fig. 4  Video text types
pre-trained network which can be done easily offline. In
this pre-training stage, the DNNs consist of three layers:
the input layer, the hidden layer(s), and the output layer.
6.1 Text image localization The input layer x is mapped to the output y using
y = s(Wx + b), (1)
The video frames can be read one by one. Assuming that
the subtitled text is usually placed at the lower third part where ‘W’ is the weight matrix, ‘b’ is the bias value, and
of the frame, we filter the frame processing to this part to ‘s’ is the transfer function of Softmax which is defined by:
increase the performance and accuracy while decreasing the
system’s overall processed data. Cropping the frame will eyi
𝜎(y)i = ∑k , (2)
be done automatically without user intervention when this j=1 eyi
option is selected from our graphical user interface (GUI).
There are some other options to select the middle, top, and where i = 1, …, k and y = ­(y1, …, ­yk) ∈ ­Rk.
full-frame just in case, the text is existing there. The standard exponential is applied to each y­ i element
of the input vector ‘y’, then the values are normalized by
6.2 Text image verification dividing by the sum of these exponentials so that the sum
of the components of the output vector 𝜎(y) is equal to 1.
Features extraction mainly consists of two stages which The cross-entropy loss function E (θℓ) can be defined as:
are feature selection then features dimensionality reduc- E(𝜃𝓁 ) = −(z log(p) + (1 − z) log(1 − p)), (3)
tion. In literature features selection can be done using state
vector machine (SVM), decision tree (DT), autoencoder, where ‘ℓ’ is the iteration number, ‘θ’ is the parameter vector,
etc., while features dimensionality reduction can be done ‘z’ is a binary indicator (0 or 1) if class label c is the correct
using principal component analysis (PCA), isomap, mul- classification of the observation o then z equals 1, and ‘p’ is
tilinear subspace learning, etc. We employed two different the predicted probability observation o of class c.
methods on our application that could show better results
than others and compared them. The first one is using a

Table 1  Text or non-text classification experiments comparison using various neural nets
Network Accuracy of Average precision Average recall Average f-score Training time Classification
test data (h) time/frame (ms)

HMLe8Net 0.9688 0.9695 0.9668 0.9681 0.2 1.65


HMAlex12 Net 0.9685 0.9665 0.9689 0.9676 1.8 6.31
HMZF13 Net 0.9794 0.9769 0.9808 0.9787 1.1 3.75
HMVGG16Net 0.9804 0.9773 0.9827 0.9797 17.6 20.41
HMGoogLe24 Net 0.9695 0.9705 0.9673 0.9688 2.8 9.21
HMRes21Net 0.9758 0.9727 0.9776 0.9750 0.5 8.97

it is bold to highlight that this designed neural network has the highest performance regarding accuracy, precision, recall, etc

13
An end to end system for subtitle text extraction from movie videos 1857

Both methods can be combined together in a hybrid way of 3*3, a stride of 1, and padding of 1. It contains Convo-
to provide a robust solution for the problem. lutional_1 and Convolutional_2 layers which have a depth
of “128”, Max. Pooling_3 has a filter size equals 2*2 and
6.3 Text image extraction stride of 2, Convolutional_4, Convolutional_5, and Convo-
lutional_6 layers have a depth of “256”, Max. Pooling_7
Features like a pattern or different structure can be found in has a filter size equals 2*2 and stride of 2, Convolutional_8,
an image, such as a point, line, edge, and patch. Useful fea- Convolutional_9, and Convolutional_10 layers have a depth
tures have repeatable detection, distinctive, and localizable. of “512”, Max. Pooling_11 has a filter size equals 2*2 and
Using the geometrical method: stride of 2, fully connected_12 layer has 1024 neurons, fully
connected_13 layer has 1024 neurons, fully connected_14
1. Compute the stroke width (Ye and Doermann 2015) met- layer has 2 neurons, Softmax_15 layer, and classification_16
ric. layer as in Fig. 5. We used stochastic gradient descent with
2. Threshold the stroke width variation metric. momentum (SGDM) optimizer with an initial learning rate
3. Process/remove regions based on this metric. (α) of 0.0001 and a mini-batch size of 64. HMVGG16 net-
4. Determine the good candidate boundary boxes. work model consists of 16 layers. An SGDM with momen-
5. Compute their overlap ratio. tum equals 0.9 is used to update the network parameters
6. Find the connected text and merge them. (weights and biases) and minimize the loss function by tak-
ing small steps at each iteration in the direction of the nega-
By performing this method on Avengers infinity war tive gradient of the loss i.e. the algorithm can oscillate along
video, we got accuracy equals to 54.55% as in Table 6. the path of the steepest descent towards the optimum and
While using new various deep learning networks for fea- using the momentum the oscillation is reduced as in Eq. (4):
tures extracting, to identify if a frame has text or not, like
HMLe8, HMAlex12, HMZF13, HMVGG16, HMGoogLe24,
𝜃𝓁+1 = 𝜃𝓁 − 𝛼∇E(𝜃𝓁 ) + 𝛾(𝜃𝓁 − 𝜃𝓁−1 ), (4)
and HMRes21, we got better results than the geometrical where α > 0 is the learning rate and γ determines the contri-
method. We performed a statistical significance test for all bution of a previous gradient step to its current step.
models as illustrated in Table 1 for the investigated sam- Using the work flow chart in Fig. 6 the text extraction of
ples. We chose HMVGG16 from all DNNs as it has the best text captions can be performed using its first, middle, and
results concerning the classification accuracy, recall, preci- last appearance. The first appearance frame is resulted from
sion, and f-score but it consumes more pre-training time subtracting this first frame with text from the one proceeding
which is done offline and more processing time than other it without text. The last appearance frame is the one coming
models. The HMVGG16 network model is illustrated in from subtracting the last coming text frame from the one just
Fig. 5 where all its convolutional layers have a filter size

Fig. 5  HMVGG16 neural net for text or non-text classification

13
1858 H. Elshahaby, M. Rashwan

Fig. 6  Films proposed system


flow chart

after it. The middle frame is extracted from multiplying the those that exhibit visual texture that is why it satisfies our
successive binary middle frames to get a resultant middle need here. Initial points for tracking the graphical subtitled
frame. For the binary frame data, ‘0’ represents black pixels text caption are obtained using the minimum eigen features
while ‘1’ represents white pixels. The algorithm process the (MEF) algorithm (Shi and Tomasi 1994).
data coming from these three frames to get the best resultant Using the forward–backward error threshold as an auto-
frame which is clear and clean text extracted from the com- matic algorithm for the detection of tracking failures, the
plex movie graphical background. Tracking points in films tracker tracks each point from the previous to the current
which represents the static subtitled text is done using the frame. Then, it tracks the same points back to the previ-
Kanade–Lucas–Tomasi (KLT) algorithm (Ye and Doermann ous frame as seen below in Fig. 7. The object calculates
2015) as a feature-tracking algorithm. It works particularly the bidirectional error. This value is the distance in pixels
well for tracking objects that do not change shape and for from the original location of the points to the final location

13
An end to end system for subtitle text extraction from movie videos 1859

Fig. 7  Bidirectional error method

after the backward tracking. The corresponding points


are considered invalid when the error is greater than the
value set for this property (Ye and Doermann 2015). This
method extends the range of the estimate of disparity (h)
by smoothing the images. In order to combine the various
values of “h”. We simply average them:
∑ G(x) − F(x) ∑
h≈
F(x)
∕ 1 (5)
x x

[ ]
∑ w(x) G(x) − F(x + hk ) ∑
hk+1 = hk + ∕ w(x). (6)
x
F(x + hk ) x
Fig. 8  Imagery text a original text input frame from movie. b Output
result for pixels tracked for successive movie frames
The estimates sequence of h (Ye and Doermann 2015)
will converge to the best h.
Where:
F (x): Pixel values at each x location function in the 6.4 Text image enhancement
image. G (x): Pixel values at each x location function in
the next image. Image enhancement techniques are applied before text
W (x): Weighting function. image localization by adjusting contrast, color intensity,
F (x + ­hk): Pixel values at each (x + h) location function noise filtering, illumination equalization, etc. It is also
in image. applied after text image extraction by using multi-frame
Using the bidirectional error through successive frames integration algorithm and increasing image resolution
is an effective method to eliminate points that could not from 70 dots per inch (dpi) to 300 dpi.
be tracked in a reliable way. We can only keep points that
contain the highest value for the system. However, the
bidirectional error requires more additional computation
to be done. 7 Error handling and correction
By applying this KLT algorithm to track static pixels
through the successive frames with calculating the bidirec- It includes handling the neural network classification
tional error using the predefined forward–backward error errors which may misclassify a text frame as a non-text
threshold, we can get an output result as shown in Fig. 8. It frame or vice versa in addition to the human translation
is obvious that the algorithm successfully tracked the static errors done by the unprofessional translators related to
pixels representing the text. It accurately creates points adding successive different text without breaks of frames
showing the location of the subtitled text. The algorithm in between without text. One of the populist techniques
has a stabilization effect for the points while tracking them to be used is the correlation-based technique. It could be
through different frames where they exist. This performs used to decide whether it is truly a non-text frame or it is
text stabilization which accordingly enhances the quality an error to resolve classification errors. It can also be used
of extracted text from the films. to decide whether it is truly the same text frame or a totally
new one to resolve translation issues as shown in Fig. 9.

13
1860 H. Elshahaby, M. Rashwan

Table 2  Testing dataset for text classification


Film # of frames Dimensions
width × height

24 h to live 3254 1280 × 720


Avengers infinity war 3361 1280 × 532
Here after 3250 480 × 220
Inferno off teaser 1382 1280 × 720
Interstellar 2610 1280 × 720
Total frames 13,857

Table 3  Training dataset for text classification


Film # of frames Dimensions
width × height

Wonder woman 4104 1280 × 720


Black Panther 2449 1280 × 530
Passengers 4238 1280 × 720
Assassin creed 2909 1280 × 720
Atomic blonde 4177 1280 × 720
Divergent 1706 1280 × 534
Mother 2909 640 × 350
Spider man 3787 1280 × 720
Fig. 9  Classification error corner case handling
Stronger 3412 640 × 350
Suicide squad 3932 1280 × 720
Dunkirk 3922 1280 × 720
8 Dataset
Total frames 37,545

A newly self-created dataset called “FiViD” in http://tc11.


cvc.uab.es/datase​ ts/FiViD_​ 1 is collected with different fonts,
styles, sizes, etc. to evaluate the full system as in Table 4. Table 4  System evaluation dataset for text extraction
and for training and testing purposes as in Tables 2 and 3. Film # of frames Language
To access all our research artifacts, please visit https​://githu​
b.com/helsh​ahaby​/Telev​ision​-Movie​s-Artif​acts. Assassin creed 2016 2726 Arabic
MorrisAusAmerika 2835 Arabic
Kong Skull Island 3286 Arabic
9 Evaluation criteria Avengers infinity war 3361 Arabic
Venom 5824 Arabic
We use M ­ atlab® for the training of the deep neural nets and Hebrew film 3581 Hebrew
its evaluation after tuning its hyperparameters like learning Legend of Naga Pearls 2854 Arabic Chinese
rate, number of epochs, optimizer used, minimum batch size, Sabrina 2805 English
etc. as in. Fig. 10. Our environment has ­CUDA® enabled Total frames 27,272
­NVIDIA® GPU GeForce GTX 960M 4 GB, ­Intel® CPU
Core i7-6700HQ with 2.6 GHz, and 16 GB RAM.
9.2 Text extraction performance
9.1 Execution performance
Using the same video as in Sect. 9.1, we compared the
As in Table 5, we compared the geometrical method and geometrical method and the same two deep learning meth-
two neural network methods which are the HMLe8 network ods to know their text extraction accuracy as in Table 6.
and HMVGG16 network using Avengers infinity war Video HMVGG16 net has the highest text extraction accuracy.
to check the processing time of them. HMLe8 net proved to The following parameters can be used to measure the
have the lowest execution time as it is the simplest neural system’s performance. These parameters are ­NG, ­NC, ­NI,
network. and ­NR where:

13
An end to end system for subtitle text extraction from movie videos 1861

Fig. 10  Deep learning neural


net training in Matlab tool

Table 5  Processing time/frame comparison Table 7  System performance using evaluation dataset
Geometrical HMLe8Net (ms) HMVGG16 (ms) Film NG NC NI NR

366 ms 59.13 115.36 Assassin Creed 2016 10 10 2 0


Kong Skull Island 19 19 0 0
Avengers infinity war 27 27 1 0
Table 6  Percentage of text extraction accuracy comparison Hebrew 12 12 1 0
Legend of Naga Pearls 13 11 0 2
Geometrical HMLe8Net (%) HMVGG16 (%)
Sabrina 19 19 0 0
54.55% 85 90.9 Total 100 98 4 2

NG: Number of ground truth graphical captions in the NR: Number of graphical captions wrongly missed.
film. For the reason that the films’ captions contain four dif-
NC: Number of graphical captions correctly detected. ferent languages, we can perform the weighted mean using
NI: Number of graphical captions wrongly inserted. Eq. (7) for all the system metrics indicated above (Table 7):

13
1862 H. Elshahaby, M. Rashwan

A new TextExtractor GUI is created to facilitate the



k (xk ∗ wk )
Weightedmean = ∑ (7) analysis and evaluation of the proposed system as shown in
wk
Fig. 11. It shows all important video information once it is
k

The system metrics equations are calculated as: loaded by the user. The extracted text images are saved in
image format and also gathered placed in a word format file
∑ NC
Accuracy = = 96.15% (8) to be easily accessed in one document.
k
NG
9.3 Text recognition performance

NC
k
Precision = ∑ ∑ = 93.97% (9) Using Abbyy Cloud SDK as a COTS software tool with
k NC + k NI
our dataset, we recognize characters from four different lan-
∑ guages like Arabic, Latin, Hebrew, and Chinese. The overall
NC
k average accuracy is equal to 97.75%.
Recall = ∑ ∑ = 94.79% (10)
k NC + k NR

2 10 Discussion
f − Score = 1 1
= 94.38% . (11)
+
Precision Recall We propose two techniques to solve the problem of extract-
ing subtitles from movie videos. The first one is geometri-
cal based on style, font size, and language type. While the

Fig. 11  Text extractor GUI new tool for better usability

13
An end to end system for subtitle text extraction from movie videos 1863

second one based on a deep neural network (DNN) that and positioning of multimedia video lines and blocks.
classifies the frame inside the video as containing text or However, the method has a lack of completeness of feature
not. We proposed several deep neural networks like HMLe8, extraction and the accuracy of selection and positioning of
HMAlex12, HMZF13, HMVGG16, HMGoogLe24, and the candidate regions needs to be improved.
HMRes21. Using statistical analysis, we found that the most Haq et al. (2019) segment scene boundaries in movies
powerful network is the HMVGG16 model that is why we using a convolution neural network. However, the method
chose it from all other models. HMVGG16 better solves our can be used for keyframes extraction for indexing and
problem in terms of accuracy, recall, precision, and f-score retrieval, video abstraction, skims selection, and trailer
but it has longer pre-training time processing time while generation.
HMLe8 has lower accuracy but better timing considera- Yang et al. (2018) proposed a deep CNN model that is
tions. For HMVGG16, the learning rate (α) is adjusted to optimized by SGD. They solved the problem of error accu-
be 0.0001, the minimum batch size is 64 which is high to mulation during denoise and binarization. However, the
have larger gradient steps and larger variance in distance. method is sensitive to text style, size, etc.
The system has the advantage of working with multiple lan- Zhang et al. (2016) detects scene text with multi-orienta-
guages and it is evaluated using four different languages. In tion using FCN. The method handles different text orienta-
addition, some of the evaluated films can have two differ- tions, languages, and styles. Failure occurs with extremely
ent language translations at the same time. One challenging low contrast, curvature, strongly reflect light, too close text
problem occurs when translating text is appearing gradu- lines, or a tremendous gap between characters. It cannot be
ally and is removed in a fading way which makes it difficult used in a real-time environment.
for this algorithm to subtract the frame with text with the Hoang and Tabbone (2010) employ the MCA method
same frame containing the background without the text. using undecimated wavelet and curvelet transforms and
This limitation when it occurs decreases the system accu- promote spare representation. The system advantage is that
racy and makes it difficult for the OCR to recognize faded it is invariant to different font styles, sizes, and orientations.
unclear text. This may cause a system failure in some cases. Text extraction accuracy is 93.75%
The system also relies on a pre-trained model (HMVGG16) Audithan and Chandrasekaran (2009) used Haar Dis-
which may underperform with untrained text styles and crete Wavelet Transform (DWT) for edge detection using
fonts. However, our proposed method works particularly the Canny detector. Text. Haar is the fastest among all other
well in extracting text from a complex background with an wavelets because its coefficients are either 1 or − 1. The
accuracy of 96.15% that shows better performance than the method suppresses false alarms. However, it has a limitation
other proposed methods in the open literature. when the gradient color of text and background are quite
The computational complexity of the proposed system close. Text extraction accuracy is 94.76%.
using asymptotic measurements is O (N + pnl1 + nl1nl2 + … Grover et al. (2009) results were also well with high sen-
nl15nl16) where ‘p’ is the number of features and ‘nli’ is sitivity and low false alarm rate. It has a limitation when the
the number of neurons in layer ‘i’ in the neural network. gradient of intensities of text and image are quite similar.
Using our evaluation dataset in Table 4 above where the They used a Sobel Edge Detector and got text extraction
film captions contain four different languages, we can per- accuracy of 94.80%.
form the weighted mean for all the system metrics. We can Jung and Kim (2004) combined the neural net-based
find weighted average insertion error for additionally added detection with NMF based filtering. The main drawback is in
captions equal to 6.68%, we can compute the weighted its locality property i.e. it does not consider the text outside
average deletion error for missed captions equals to 5.36%. the window. However, they adopt CAMShift to enhance time
We can calculate the average weighted accuracy equals performance. Text extraction accuracy is 91.88%.
to 96.15%, average weighted precision equals to 93.97%,
average weighted recall equals to 94.79%, and the average
weighted f-score equals to 94.38%. We finally use Abbyy 11 Conclusion and future work
SDK as a black box OCR engine to qualify our system from
its intended outcome point of view which is to have a clear In this research, the common problem of imagery text detec-
clean plain text for the users of the system. The overall aver- tion and enhancement from videos is discussed. Proposed
age accuracy is equal to 97.75%. solutions for processing text videos to detect text automati-
Lu and Wang (2019) improved automatic positioning cally and extract it from images are implemented. Different
accuracy using the ICA feature which is a stoke segment that machine learning techniques like HMVGG16 and HMLe8
constitutes the image caption base. The adaptive iterative networks are applied to identify graphical text in films appli-
localization algorithm has strong adaptability to complex cation using deep convolutional neural networks. The point
changes of video frames, faster, and more accurate detection tracking technique is adopted for text extraction from its

13
1864 H. Elshahaby, M. Rashwan

complex background. A new self-created dataset “FiViD” He T, Huang W, Qiao Y, Yao J (2016b) Text attentional convo-
is created to be used as a benchmark. The HMVGG16 deep lutional neural network for scene text detection. IEEE Trans
Image Process 25(6):2529–2541
CNN network which is used for frame classification as text He P, Huang W, He T, Zhu Q, Qiao Y, Li X (2017) Single shot text
containing or non-text containing has accuracy equals 98%. detector with regional attention. In: Computer vision and pat-
Using “film videos dataset” to evaluate the graphical caption tern recognition, Cornell University, arXiv​:1709.00138​
extraction, the weighted average caption extraction accuracy Hesham M, Hani B, Fouad N, Amer E (2018) Smart trailer: auto-
matic generation of movie trailer using only subtitles. In: First
is equal to 96.15%, insertion error equals to 6.68%, deletion international workshop on deep and representation learning
error is equal to 5.36%, precision is equal to 93.97%, recall (IWDRL), IEEE, pp 26–30
is equal to 95.27%, and CDC recognition average accuracy Hoang T, Tabbone S (2010) Text extraction from graphical document
is equal to 97.75%. The future work in our film multimedia images using sparse representation. In: Proceedings of the 9th
IAPR international workshop on document analysis systems,
application is to make our own OCR and decrease execution pp 143–150
timing for the frame classifier to run in a real-time environ- https​://pixab​ay.com/vecto​rs/bitco​in-money​-crypt​ocurr​ency-48513​
ment. Also, we can train our text classifier model to support 83/. Accessed 28 Sept 2020
more languages like Russian, Indian, etc. We can enable the https​://www.dream​stime​.com/photo​s-image​s/auton​omous​-car.html.
Accessed 28 Sept 2020
user to translate the existing subtitle plain text to any other https​: //www.freep​i k.com/premi​u m-photo​/ engin​e er-check​- contr​
selected language of user choice. ol-weldi​n g-robot​i cs-autom​a tic-arms-machi​n e_52847​4 2.htm.
Accessed 28 Sept 2020
Acknowledgements I would like to thank God for his help. Special https​://www.robot​s.ox.ac.uk/~vgg/softw​are/texts​pot/. Accessed 10
thanks to the RDI team, Dr. Sven Dickinson, and my faculty depart- June 2020
ment members for supporting me with their experience and data set Huang W, Qiao Y, Tang X (2014) Robust scene text detection with
used in my research. convolution neural network induced MSER trees. In: European
conference on computer vision, Springer, Zurich, pp 497–511
Indermühle E, Liwicki M, Bunke H (2010) IAMonDo-database: an
online handwritten document database with non-uniform con-
tents. In: Proceedings of the 9th IAPR international workshop
References on document analysis systems (DAS ’10), pp 97–104
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Read-
Alves W, Hashimoto R (2010) Text regions extracted from scene ing text in the wild with convolutional neural networks. Int J
images by ultimate attribute opening and decision tree classifica- Comput Vis 116(1):1–20
tion. In: Proceedings of the 23rd Sibgrapi conference on graphics, Jung K, Kim E (2004) Automatic text extraction for content-based
patterns, and images image indexing. In: Proceedings of PAKDD, pp 497–507
Audithan S, Chandrasekaran RM (2009) Document text extraction from Kong T, Yao A, Chen Y, Sun F (2016) Hypernet: towards accurate
document images using Haar discrete wavelet transform. Eur J Sci region proposal generation and joint object detection. In: Pro-
Res 36(04):502–512 ceedings of the IEEE conference on computer vision and pattern
Cho H, Sung M, Jun B (2016) Canny text detector: fast and robust scene recognition, pp 845–853
text localization algorithm. In: Proceedings of the IEEE confer- Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: a fast
ence on computer vision and pattern recognition, pp 3566–3573 text detector with a single deep neural network. In: AAAI, pp
Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region- 4161–4167
based fully convolutional networks. In: advances in neural infor- Liu X, Samarabandu J (2006) Multiscale edge-based text extraction
mation processing systems, pp 379–387 from complex images. In: Proceedings of the international con-
Gidaris S, Komodakis N (2015) Object detection via a multi-region and ference of multimedia and Expo, pp 1721–1724
semantic segmentation-aware CNN model. In: Proceedings of the Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks
IEEE international conference on computer vision, pp 1134–1142 for semantic segmentation. In: Proceedings of the IEEE confer-
Gomez L, Karatzas D (2017) Text proposals: a text specific selective ence on computer vision and pattern recognition, pp 3431–3440
search algorithm for word spotting in the wild. Pattern Recogn Lu Q, Wang Y (2019) Automatic text location of multimedia video
70:60–74 for subtitle frame. J Ambient Intell Humaniz Comput
Gorinski P, Lapata M (2018) What’s this movie about? A joint neural Moradi M, Mozaffari S, Orouji A (2010) Farsi/Arabic text extraction
network architecture for movie content analysis. In: University of from video images by corner detection. In: 2010 6th Iranian
Edinburgh, Proceedings of NAACL-HLT, pp 1770–1781 conference on machine vision and image processing, pp 1–6
Grover S, Arora K, Mitra S (2009) Text extraction from docu- Nagabhushan P, Nirmala S (2009) Text extraction in complex color
ment images using edge information. In: IEEE India Council document images for enhanced readability. Intell Inf Manag
Conference 2:120–133
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text locali- Neumann L, Matas J (2012) Real-time scene text localization and
zation in natural images. In: Proceedings of the IEEE conference recognition. In: Computer vision and pattern recognition
on computer vision and pattern recognition, pp 2315–2324 (CVPR) IEEE conference, pp 3538–3545
Haq I, Muhammad K, Hussain T, Kwon S, Sodanil M, Baik S, Lee M Noh H, Hong S, Han B (2015) Learning deconvolution network for
(2019) Movie scene segmentation using object detection and set semantic segmentation. In: Proceedings of the IEEE interna-
theory. Int J Distrib Sens Netw 15(6) tional conference on computer vision, Santiago: IEEE Computer
He K, Zhang X, Ren S, Sun J (2016a) Deep residual learning for image Society, pp 1520–1528
recognition. In: Proceedings of the IEEE conference on computer Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look
vision and pattern recognition, pp 770–778 once: unified, real-time object detection. In: Proceedings of the

13
An end to end system for subtitle text extraction from movie videos 1865

IEEE conference on computer vision and pattern recognition, Yang C, Pei W, Wu L, Yin X (2018) Chinese text-line detection from
pp 779–788 web videos with fully convolutional networks. Big Data Anal
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real- 3(2):1
time object detection with region proposal networks. In: Advances Ye Q, Doermann D (2015) Text detection recognition in imagery: a
in neural information processing systems, pp 91–99 survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Yin XC, Pei WY, Zhang J, Hao H (2015) Multi-orientation scene text
Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale detection with adaptive clustering. IEEE Trans Pattern Anal Mach
visual recognition challenge. Int J Comput Vis 115(3):211–252 Intell 37(9):1930–1937
Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks Zamberletti A, Noce L, Gallo I (2014) Text localization based on fast
for semantic segmentation. IEEE Trans Pattern Anal Mach Intell feature pyramids and multi-resolution maximally stable extremal
39(4):640–651 regions. In: Asian conference on computer vision, pp 91–105
Shi J, Tomasi C (1994) Good features to track. In: Proceedings of the Zhang Z, Shen W, Yao C, Bai X (2015) Symmetry based text line
IEEE conference on computer vision and pattern recognition, pp detection in natural scenes. In: Proceedings of the IEEE confer-
593–600 ence on computer vision and pattern recognition, pp 2558–2567
Shivakumara P, Dutta A, Pal U, Tan C (2010) A new method for hand- Zhang Z, Zhang C, Shen W, Yao C, Liu W, Bai X (2016) Multi-ori-
written scene text detection in video. In: International conference ented text detection with fully convolutional networks. In: Pro-
on frontiers in handwriting recognition, pp 16–18 ceedings of the IEEE conference on computer vision and pattern
Shrivastava A, Gupta A, Girshick R (2016) Training region-based recognition. Las Vegas: IEEE Computer Society, pp 4159–4167
object detectors with online hard example mining. In: Proceed- Zhang S, Liu Y, Jin L, Luo C (2018) Feature enhancement network: a
ings of the IEEE conference on computer vision and pattern rec- refined scene text detector. In: Thirty-second AAAI conference
ognition, Las Vegas: IEEE Computer Society, arXiv​:1604.03540​ on artificial intelligence (AAAI-18), pp 2612–2619
Sun L, Huo Q, Jia W, Chen K (2015) A robust approach for text detec- Zhong Z, Jin L, Zhang S, Feng Z (2016) DeepText: a unified frame-
tion from natural scene images. Pattern Recogn 48(9):2906–2920 work for text proposal generation and text detection in natural
Tian S, Pan Y, Huang C, Lu S, Yu K, Tan C (2015) Text flow: a uni- images. In: Computer vision and pattern recognition, Cornell
fied text detection system in natural scene images. In: Proceed- University, arXiv​:1605.07314​
ings of the IEEE international conference on computer vision, Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) EAST:
pp 4651–4659 an efficient and accurate scene text detector. In: Computer vision
Tian Z, Huang W, He T, He P, Qiao Y (2016) Detecting text in natu- and pattern recognition, Cornell University, arXiv​:1704.03155​
ral image with connectionist text proposal network. In: European Zhu Y, Yao C, Bai X (2016) Scene text detection and recognition:
conference on computer vision, pp 56–72 recent advances and future trends. Front Comput Sci 10(1):19–36
Vijayakumar V, Nedunchezhianm R (2011) A novel method for super
imposed text extraction in a sports video. Int J Comput Appl Publisher’s Note Springer Nature remains neutral with regard to
15(1):1 jurisdictional claims in published maps and institutional affiliations.
Xiang D, Yan H, Chen X, Cheng Y (2010) Offline Arabic handwriting
recognition system based on HMM. In: 2010 3rd International
conference on computer science and information technology

13

You might also like