Paper 1
Paper 1
https://doi.org/10.1007/s12652-021-02951-1
ORIGINAL RESEARCH
An end to end system for subtitle text extraction from movie videos
Hossam Elshahaby1 · Mohsen Rashwan1
Received: 30 April 2020 / Accepted: 4 February 2021 / Published online: 22 February 2021
© The Author(s), under exclusive licence to Springer-Verlag GmbH, DE part of Springer Nature 2021
Abstract
A new technique for text detection inside a complex graphical background, its extraction, and enhancement to be easily
recognized using the optical character recognition (OCR). The technique uses a deep neural network for feature extraction
and classifying the text as containing text or not. An error handling and correction (EHC) technique is used to resolve clas-
sification errors. A multiple frame integration (MFI) algorithm is introduced to extract the graphical text from its background.
Text enhancement is done by adjusting the contrast, minimize noise, and increasing the pixels resolution. A standalone
software Component-Off-The-Shelf (COTS) is used to recognize the text characters and qualify the system performance.
Generalization for multilingual text is done with the proposed solution. A newly created dataset containing videos with dif-
ferent languages is collected for this purpose to be used as a benchmark. A new HMVGG16 convolutional neural network
(CNN) is used for frame classification as text containing or non-text containing, has accuracy equals to 98%. The introduced
system weighted average caption extraction accuracy equals to 96.15%. The correctly detected characters (CDC) average
recognition accuracy using the Abbyy SDK OCR engine equals 97.75%.
Keywords Graphical text · Films application · Multimedia · Text detection · Text extraction · Subtitles frame
13
Vol.:(0123456789)
1854 H. Elshahaby, M. Rashwan
13
An end to end system for subtitle text extraction from movie videos 1855
4 Problem definition
13
1856 H. Elshahaby, M. Rashwan
Table 1 Text or non-text classification experiments comparison using various neural nets
Network Accuracy of Average precision Average recall Average f-score Training time Classification
test data (h) time/frame (ms)
it is bold to highlight that this designed neural network has the highest performance regarding accuracy, precision, recall, etc
13
An end to end system for subtitle text extraction from movie videos 1857
Both methods can be combined together in a hybrid way of 3*3, a stride of 1, and padding of 1. It contains Convo-
to provide a robust solution for the problem. lutional_1 and Convolutional_2 layers which have a depth
of “128”, Max. Pooling_3 has a filter size equals 2*2 and
6.3 Text image extraction stride of 2, Convolutional_4, Convolutional_5, and Convo-
lutional_6 layers have a depth of “256”, Max. Pooling_7
Features like a pattern or different structure can be found in has a filter size equals 2*2 and stride of 2, Convolutional_8,
an image, such as a point, line, edge, and patch. Useful fea- Convolutional_9, and Convolutional_10 layers have a depth
tures have repeatable detection, distinctive, and localizable. of “512”, Max. Pooling_11 has a filter size equals 2*2 and
Using the geometrical method: stride of 2, fully connected_12 layer has 1024 neurons, fully
connected_13 layer has 1024 neurons, fully connected_14
1. Compute the stroke width (Ye and Doermann 2015) met- layer has 2 neurons, Softmax_15 layer, and classification_16
ric. layer as in Fig. 5. We used stochastic gradient descent with
2. Threshold the stroke width variation metric. momentum (SGDM) optimizer with an initial learning rate
3. Process/remove regions based on this metric. (α) of 0.0001 and a mini-batch size of 64. HMVGG16 net-
4. Determine the good candidate boundary boxes. work model consists of 16 layers. An SGDM with momen-
5. Compute their overlap ratio. tum equals 0.9 is used to update the network parameters
6. Find the connected text and merge them. (weights and biases) and minimize the loss function by tak-
ing small steps at each iteration in the direction of the nega-
By performing this method on Avengers infinity war tive gradient of the loss i.e. the algorithm can oscillate along
video, we got accuracy equals to 54.55% as in Table 6. the path of the steepest descent towards the optimum and
While using new various deep learning networks for fea- using the momentum the oscillation is reduced as in Eq. (4):
tures extracting, to identify if a frame has text or not, like
HMLe8, HMAlex12, HMZF13, HMVGG16, HMGoogLe24,
𝜃𝓁+1 = 𝜃𝓁 − 𝛼∇E(𝜃𝓁 ) + 𝛾(𝜃𝓁 − 𝜃𝓁−1 ), (4)
and HMRes21, we got better results than the geometrical where α > 0 is the learning rate and γ determines the contri-
method. We performed a statistical significance test for all bution of a previous gradient step to its current step.
models as illustrated in Table 1 for the investigated sam- Using the work flow chart in Fig. 6 the text extraction of
ples. We chose HMVGG16 from all DNNs as it has the best text captions can be performed using its first, middle, and
results concerning the classification accuracy, recall, preci- last appearance. The first appearance frame is resulted from
sion, and f-score but it consumes more pre-training time subtracting this first frame with text from the one proceeding
which is done offline and more processing time than other it without text. The last appearance frame is the one coming
models. The HMVGG16 network model is illustrated in from subtracting the last coming text frame from the one just
Fig. 5 where all its convolutional layers have a filter size
13
1858 H. Elshahaby, M. Rashwan
after it. The middle frame is extracted from multiplying the those that exhibit visual texture that is why it satisfies our
successive binary middle frames to get a resultant middle need here. Initial points for tracking the graphical subtitled
frame. For the binary frame data, ‘0’ represents black pixels text caption are obtained using the minimum eigen features
while ‘1’ represents white pixels. The algorithm process the (MEF) algorithm (Shi and Tomasi 1994).
data coming from these three frames to get the best resultant Using the forward–backward error threshold as an auto-
frame which is clear and clean text extracted from the com- matic algorithm for the detection of tracking failures, the
plex movie graphical background. Tracking points in films tracker tracks each point from the previous to the current
which represents the static subtitled text is done using the frame. Then, it tracks the same points back to the previ-
Kanade–Lucas–Tomasi (KLT) algorithm (Ye and Doermann ous frame as seen below in Fig. 7. The object calculates
2015) as a feature-tracking algorithm. It works particularly the bidirectional error. This value is the distance in pixels
well for tracking objects that do not change shape and for from the original location of the points to the final location
13
An end to end system for subtitle text extraction from movie videos 1859
[ ]
∑ w(x) G(x) − F(x + hk ) ∑
hk+1 = hk + ∕ w(x). (6)
x
F(x + hk ) x
Fig. 8 Imagery text a original text input frame from movie. b Output
result for pixels tracked for successive movie frames
The estimates sequence of h (Ye and Doermann 2015)
will converge to the best h.
Where:
F (x): Pixel values at each x location function in the 6.4 Text image enhancement
image. G (x): Pixel values at each x location function in
the next image. Image enhancement techniques are applied before text
W (x): Weighting function. image localization by adjusting contrast, color intensity,
F (x + hk): Pixel values at each (x + h) location function noise filtering, illumination equalization, etc. It is also
in image. applied after text image extraction by using multi-frame
Using the bidirectional error through successive frames integration algorithm and increasing image resolution
is an effective method to eliminate points that could not from 70 dots per inch (dpi) to 300 dpi.
be tracked in a reliable way. We can only keep points that
contain the highest value for the system. However, the
bidirectional error requires more additional computation
to be done. 7 Error handling and correction
By applying this KLT algorithm to track static pixels
through the successive frames with calculating the bidirec- It includes handling the neural network classification
tional error using the predefined forward–backward error errors which may misclassify a text frame as a non-text
threshold, we can get an output result as shown in Fig. 8. It frame or vice versa in addition to the human translation
is obvious that the algorithm successfully tracked the static errors done by the unprofessional translators related to
pixels representing the text. It accurately creates points adding successive different text without breaks of frames
showing the location of the subtitled text. The algorithm in between without text. One of the populist techniques
has a stabilization effect for the points while tracking them to be used is the correlation-based technique. It could be
through different frames where they exist. This performs used to decide whether it is truly a non-text frame or it is
text stabilization which accordingly enhances the quality an error to resolve classification errors. It can also be used
of extracted text from the films. to decide whether it is truly the same text frame or a totally
new one to resolve translation issues as shown in Fig. 9.
13
1860 H. Elshahaby, M. Rashwan
13
An end to end system for subtitle text extraction from movie videos 1861
Table 5 Processing time/frame comparison Table 7 System performance using evaluation dataset
Geometrical HMLe8Net (ms) HMVGG16 (ms) Film NG NC NI NR
NG: Number of ground truth graphical captions in the NR: Number of graphical captions wrongly missed.
film. For the reason that the films’ captions contain four dif-
NC: Number of graphical captions correctly detected. ferent languages, we can perform the weighted mean using
NI: Number of graphical captions wrongly inserted. Eq. (7) for all the system metrics indicated above (Table 7):
13
1862 H. Elshahaby, M. Rashwan
The system metrics equations are calculated as: loaded by the user. The extracted text images are saved in
image format and also gathered placed in a word format file
∑ NC
Accuracy = = 96.15% (8) to be easily accessed in one document.
k
NG
9.3 Text recognition performance
∑
NC
k
Precision = ∑ ∑ = 93.97% (9) Using Abbyy Cloud SDK as a COTS software tool with
k NC + k NI
our dataset, we recognize characters from four different lan-
∑ guages like Arabic, Latin, Hebrew, and Chinese. The overall
NC
k average accuracy is equal to 97.75%.
Recall = ∑ ∑ = 94.79% (10)
k NC + k NR
2 10 Discussion
f − Score = 1 1
= 94.38% . (11)
+
Precision Recall We propose two techniques to solve the problem of extract-
ing subtitles from movie videos. The first one is geometri-
cal based on style, font size, and language type. While the
13
An end to end system for subtitle text extraction from movie videos 1863
second one based on a deep neural network (DNN) that and positioning of multimedia video lines and blocks.
classifies the frame inside the video as containing text or However, the method has a lack of completeness of feature
not. We proposed several deep neural networks like HMLe8, extraction and the accuracy of selection and positioning of
HMAlex12, HMZF13, HMVGG16, HMGoogLe24, and the candidate regions needs to be improved.
HMRes21. Using statistical analysis, we found that the most Haq et al. (2019) segment scene boundaries in movies
powerful network is the HMVGG16 model that is why we using a convolution neural network. However, the method
chose it from all other models. HMVGG16 better solves our can be used for keyframes extraction for indexing and
problem in terms of accuracy, recall, precision, and f-score retrieval, video abstraction, skims selection, and trailer
but it has longer pre-training time processing time while generation.
HMLe8 has lower accuracy but better timing considera- Yang et al. (2018) proposed a deep CNN model that is
tions. For HMVGG16, the learning rate (α) is adjusted to optimized by SGD. They solved the problem of error accu-
be 0.0001, the minimum batch size is 64 which is high to mulation during denoise and binarization. However, the
have larger gradient steps and larger variance in distance. method is sensitive to text style, size, etc.
The system has the advantage of working with multiple lan- Zhang et al. (2016) detects scene text with multi-orienta-
guages and it is evaluated using four different languages. In tion using FCN. The method handles different text orienta-
addition, some of the evaluated films can have two differ- tions, languages, and styles. Failure occurs with extremely
ent language translations at the same time. One challenging low contrast, curvature, strongly reflect light, too close text
problem occurs when translating text is appearing gradu- lines, or a tremendous gap between characters. It cannot be
ally and is removed in a fading way which makes it difficult used in a real-time environment.
for this algorithm to subtract the frame with text with the Hoang and Tabbone (2010) employ the MCA method
same frame containing the background without the text. using undecimated wavelet and curvelet transforms and
This limitation when it occurs decreases the system accu- promote spare representation. The system advantage is that
racy and makes it difficult for the OCR to recognize faded it is invariant to different font styles, sizes, and orientations.
unclear text. This may cause a system failure in some cases. Text extraction accuracy is 93.75%
The system also relies on a pre-trained model (HMVGG16) Audithan and Chandrasekaran (2009) used Haar Dis-
which may underperform with untrained text styles and crete Wavelet Transform (DWT) for edge detection using
fonts. However, our proposed method works particularly the Canny detector. Text. Haar is the fastest among all other
well in extracting text from a complex background with an wavelets because its coefficients are either 1 or − 1. The
accuracy of 96.15% that shows better performance than the method suppresses false alarms. However, it has a limitation
other proposed methods in the open literature. when the gradient color of text and background are quite
The computational complexity of the proposed system close. Text extraction accuracy is 94.76%.
using asymptotic measurements is O (N + pnl1 + nl1nl2 + … Grover et al. (2009) results were also well with high sen-
nl15nl16) where ‘p’ is the number of features and ‘nli’ is sitivity and low false alarm rate. It has a limitation when the
the number of neurons in layer ‘i’ in the neural network. gradient of intensities of text and image are quite similar.
Using our evaluation dataset in Table 4 above where the They used a Sobel Edge Detector and got text extraction
film captions contain four different languages, we can per- accuracy of 94.80%.
form the weighted mean for all the system metrics. We can Jung and Kim (2004) combined the neural net-based
find weighted average insertion error for additionally added detection with NMF based filtering. The main drawback is in
captions equal to 6.68%, we can compute the weighted its locality property i.e. it does not consider the text outside
average deletion error for missed captions equals to 5.36%. the window. However, they adopt CAMShift to enhance time
We can calculate the average weighted accuracy equals performance. Text extraction accuracy is 91.88%.
to 96.15%, average weighted precision equals to 93.97%,
average weighted recall equals to 94.79%, and the average
weighted f-score equals to 94.38%. We finally use Abbyy 11 Conclusion and future work
SDK as a black box OCR engine to qualify our system from
its intended outcome point of view which is to have a clear In this research, the common problem of imagery text detec-
clean plain text for the users of the system. The overall aver- tion and enhancement from videos is discussed. Proposed
age accuracy is equal to 97.75%. solutions for processing text videos to detect text automati-
Lu and Wang (2019) improved automatic positioning cally and extract it from images are implemented. Different
accuracy using the ICA feature which is a stoke segment that machine learning techniques like HMVGG16 and HMLe8
constitutes the image caption base. The adaptive iterative networks are applied to identify graphical text in films appli-
localization algorithm has strong adaptability to complex cation using deep convolutional neural networks. The point
changes of video frames, faster, and more accurate detection tracking technique is adopted for text extraction from its
13
1864 H. Elshahaby, M. Rashwan
complex background. A new self-created dataset “FiViD” He T, Huang W, Qiao Y, Yao J (2016b) Text attentional convo-
is created to be used as a benchmark. The HMVGG16 deep lutional neural network for scene text detection. IEEE Trans
Image Process 25(6):2529–2541
CNN network which is used for frame classification as text He P, Huang W, He T, Zhu Q, Qiao Y, Li X (2017) Single shot text
containing or non-text containing has accuracy equals 98%. detector with regional attention. In: Computer vision and pat-
Using “film videos dataset” to evaluate the graphical caption tern recognition, Cornell University, arXiv:1709.00138
extraction, the weighted average caption extraction accuracy Hesham M, Hani B, Fouad N, Amer E (2018) Smart trailer: auto-
matic generation of movie trailer using only subtitles. In: First
is equal to 96.15%, insertion error equals to 6.68%, deletion international workshop on deep and representation learning
error is equal to 5.36%, precision is equal to 93.97%, recall (IWDRL), IEEE, pp 26–30
is equal to 95.27%, and CDC recognition average accuracy Hoang T, Tabbone S (2010) Text extraction from graphical document
is equal to 97.75%. The future work in our film multimedia images using sparse representation. In: Proceedings of the 9th
IAPR international workshop on document analysis systems,
application is to make our own OCR and decrease execution pp 143–150
timing for the frame classifier to run in a real-time environ- https://pixabay.com/vectors/bitcoin-money-cryptocurrency-48513
ment. Also, we can train our text classifier model to support 83/. Accessed 28 Sept 2020
more languages like Russian, Indian, etc. We can enable the https://www.dreamstime.com/photos-images/autonomous-car.html.
Accessed 28 Sept 2020
user to translate the existing subtitle plain text to any other https: //www.freepi k.com/premiu m-photo/ engine er-check- contr
selected language of user choice. ol-weldin g-roboti cs-automa tic-arms-machin e_528474 2.htm.
Accessed 28 Sept 2020
Acknowledgements I would like to thank God for his help. Special https://www.robots.ox.ac.uk/~vgg/software/textspot/. Accessed 10
thanks to the RDI team, Dr. Sven Dickinson, and my faculty depart- June 2020
ment members for supporting me with their experience and data set Huang W, Qiao Y, Tang X (2014) Robust scene text detection with
used in my research. convolution neural network induced MSER trees. In: European
conference on computer vision, Springer, Zurich, pp 497–511
Indermühle E, Liwicki M, Bunke H (2010) IAMonDo-database: an
online handwritten document database with non-uniform con-
tents. In: Proceedings of the 9th IAPR international workshop
References on document analysis systems (DAS ’10), pp 97–104
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Read-
Alves W, Hashimoto R (2010) Text regions extracted from scene ing text in the wild with convolutional neural networks. Int J
images by ultimate attribute opening and decision tree classifica- Comput Vis 116(1):1–20
tion. In: Proceedings of the 23rd Sibgrapi conference on graphics, Jung K, Kim E (2004) Automatic text extraction for content-based
patterns, and images image indexing. In: Proceedings of PAKDD, pp 497–507
Audithan S, Chandrasekaran RM (2009) Document text extraction from Kong T, Yao A, Chen Y, Sun F (2016) Hypernet: towards accurate
document images using Haar discrete wavelet transform. Eur J Sci region proposal generation and joint object detection. In: Pro-
Res 36(04):502–512 ceedings of the IEEE conference on computer vision and pattern
Cho H, Sung M, Jun B (2016) Canny text detector: fast and robust scene recognition, pp 845–853
text localization algorithm. In: Proceedings of the IEEE confer- Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: a fast
ence on computer vision and pattern recognition, pp 3566–3573 text detector with a single deep neural network. In: AAAI, pp
Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region- 4161–4167
based fully convolutional networks. In: advances in neural infor- Liu X, Samarabandu J (2006) Multiscale edge-based text extraction
mation processing systems, pp 379–387 from complex images. In: Proceedings of the international con-
Gidaris S, Komodakis N (2015) Object detection via a multi-region and ference of multimedia and Expo, pp 1721–1724
semantic segmentation-aware CNN model. In: Proceedings of the Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks
IEEE international conference on computer vision, pp 1134–1142 for semantic segmentation. In: Proceedings of the IEEE confer-
Gomez L, Karatzas D (2017) Text proposals: a text specific selective ence on computer vision and pattern recognition, pp 3431–3440
search algorithm for word spotting in the wild. Pattern Recogn Lu Q, Wang Y (2019) Automatic text location of multimedia video
70:60–74 for subtitle frame. J Ambient Intell Humaniz Comput
Gorinski P, Lapata M (2018) What’s this movie about? A joint neural Moradi M, Mozaffari S, Orouji A (2010) Farsi/Arabic text extraction
network architecture for movie content analysis. In: University of from video images by corner detection. In: 2010 6th Iranian
Edinburgh, Proceedings of NAACL-HLT, pp 1770–1781 conference on machine vision and image processing, pp 1–6
Grover S, Arora K, Mitra S (2009) Text extraction from docu- Nagabhushan P, Nirmala S (2009) Text extraction in complex color
ment images using edge information. In: IEEE India Council document images for enhanced readability. Intell Inf Manag
Conference 2:120–133
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text locali- Neumann L, Matas J (2012) Real-time scene text localization and
zation in natural images. In: Proceedings of the IEEE conference recognition. In: Computer vision and pattern recognition
on computer vision and pattern recognition, pp 2315–2324 (CVPR) IEEE conference, pp 3538–3545
Haq I, Muhammad K, Hussain T, Kwon S, Sodanil M, Baik S, Lee M Noh H, Hong S, Han B (2015) Learning deconvolution network for
(2019) Movie scene segmentation using object detection and set semantic segmentation. In: Proceedings of the IEEE interna-
theory. Int J Distrib Sens Netw 15(6) tional conference on computer vision, Santiago: IEEE Computer
He K, Zhang X, Ren S, Sun J (2016a) Deep residual learning for image Society, pp 1520–1528
recognition. In: Proceedings of the IEEE conference on computer Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look
vision and pattern recognition, pp 770–778 once: unified, real-time object detection. In: Proceedings of the
13
An end to end system for subtitle text extraction from movie videos 1865
IEEE conference on computer vision and pattern recognition, Yang C, Pei W, Wu L, Yin X (2018) Chinese text-line detection from
pp 779–788 web videos with fully convolutional networks. Big Data Anal
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real- 3(2):1
time object detection with region proposal networks. In: Advances Ye Q, Doermann D (2015) Text detection recognition in imagery: a
in neural information processing systems, pp 91–99 survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Yin XC, Pei WY, Zhang J, Hao H (2015) Multi-orientation scene text
Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale detection with adaptive clustering. IEEE Trans Pattern Anal Mach
visual recognition challenge. Int J Comput Vis 115(3):211–252 Intell 37(9):1930–1937
Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks Zamberletti A, Noce L, Gallo I (2014) Text localization based on fast
for semantic segmentation. IEEE Trans Pattern Anal Mach Intell feature pyramids and multi-resolution maximally stable extremal
39(4):640–651 regions. In: Asian conference on computer vision, pp 91–105
Shi J, Tomasi C (1994) Good features to track. In: Proceedings of the Zhang Z, Shen W, Yao C, Bai X (2015) Symmetry based text line
IEEE conference on computer vision and pattern recognition, pp detection in natural scenes. In: Proceedings of the IEEE confer-
593–600 ence on computer vision and pattern recognition, pp 2558–2567
Shivakumara P, Dutta A, Pal U, Tan C (2010) A new method for hand- Zhang Z, Zhang C, Shen W, Yao C, Liu W, Bai X (2016) Multi-ori-
written scene text detection in video. In: International conference ented text detection with fully convolutional networks. In: Pro-
on frontiers in handwriting recognition, pp 16–18 ceedings of the IEEE conference on computer vision and pattern
Shrivastava A, Gupta A, Girshick R (2016) Training region-based recognition. Las Vegas: IEEE Computer Society, pp 4159–4167
object detectors with online hard example mining. In: Proceed- Zhang S, Liu Y, Jin L, Luo C (2018) Feature enhancement network: a
ings of the IEEE conference on computer vision and pattern rec- refined scene text detector. In: Thirty-second AAAI conference
ognition, Las Vegas: IEEE Computer Society, arXiv:1604.03540 on artificial intelligence (AAAI-18), pp 2612–2619
Sun L, Huo Q, Jia W, Chen K (2015) A robust approach for text detec- Zhong Z, Jin L, Zhang S, Feng Z (2016) DeepText: a unified frame-
tion from natural scene images. Pattern Recogn 48(9):2906–2920 work for text proposal generation and text detection in natural
Tian S, Pan Y, Huang C, Lu S, Yu K, Tan C (2015) Text flow: a uni- images. In: Computer vision and pattern recognition, Cornell
fied text detection system in natural scene images. In: Proceed- University, arXiv:1605.07314
ings of the IEEE international conference on computer vision, Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) EAST:
pp 4651–4659 an efficient and accurate scene text detector. In: Computer vision
Tian Z, Huang W, He T, He P, Qiao Y (2016) Detecting text in natu- and pattern recognition, Cornell University, arXiv:1704.03155
ral image with connectionist text proposal network. In: European Zhu Y, Yao C, Bai X (2016) Scene text detection and recognition:
conference on computer vision, pp 56–72 recent advances and future trends. Front Comput Sci 10(1):19–36
Vijayakumar V, Nedunchezhianm R (2011) A novel method for super
imposed text extraction in a sports video. Int J Comput Appl Publisher’s Note Springer Nature remains neutral with regard to
15(1):1 jurisdictional claims in published maps and institutional affiliations.
Xiang D, Yan H, Chen X, Cheng Y (2010) Offline Arabic handwriting
recognition system based on HMM. In: 2010 3rd International
conference on computer science and information technology
13