Benchmarking Probabilistic Deep Learning Methods For License Plate Recognition
Benchmarking Probabilistic Deep Learning Methods For License Plate Recognition
Benchmarking Probabilistic Deep Learning Methods For License Plate Recognition
Abstract— Learning-based algorithms for automated license neural networks. When the test images differ too much from
plate recognition implicitly assume that the training and test data the training distribution, neural networks are prone to silent
are well aligned. However, this may not be the case under extreme failures. Since the exact acquisition setup is unknown, the
environmental conditions, or in forensic applications where the
system cannot be trained for a specific acquisition device. training data needs to cover different combinations of camera
Predictions on such out-of-distribution images have an increased models, environmental factors, and image degradation types.
chance of failing. But this failure case is oftentimes hard to However, cost and effort to cover all of these conditions put
recognize for a human operator or an automated system. Hence, the feasibility of a truly “complete” dataset into question.
in this work we propose to model the prediction uncertainty for Modeling the uncertainty of neural networks regarding
license plate recognition explicitly. Such an uncertainty measure
allows to detect false predictions, indicating an analyst when its prediction has become an increasingly important area of
not to trust the result of the automated license plate recogni- research because of similar challenges in computer vision and
tion. In this paper, we compare three methods for uncertainty image forensics [4], [5], [6]. In addition to the prediction of
quantification on two architectures. The experiments on synthetic the neural network, a confidence estimate, called predictive
noisy or blurred low-resolution images show that the predictive uncertainty, gives a clue whether to trust the prediction.
uncertainty reliably finds wrong predictions. We also show that
a multi-task combination of classification and super-resolution Ovadia et al. [7] group these approaches under the name prob-
improves the recognition performance by 109% and the detection abilistic deep learning. Possible techniques to gather confi-
of wrong predictions by 29%. dence estimates, among others, are Bayesian neural networks
Index Terms— License plate recognition, uncertainty, multi- (BNN) [4], deep ensembles [8], or Monte Carlo dropout [9].
task learning. Each of these techniques are explained in detail in Sec. II and
Sec. III.
I. I NTRODUCTION Such confidence estimates are worth investigating in the
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
9204 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2023
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
SCHIRRMACHER et al.: BENCHMARKING PROBABILISTIC DEEP LEARNING METHODS FOR LICENSE PLATE RECOGNITION 9205
the networks may fuse these steps into a single end-to-end training datasets are available [42]. We see great potential
pipeline [31]. in probabilistic deep learning for multi-national license plate
For license plate detection, YOLO [25] is a popu- recognition. The probabilistic network can be trained on the
lar choice [1], [2], [32]. Zhang et al. [33] use a Mask existing datasets to cover license plates of some countries.
R-CNN [34]. Qiao et al. [35] replace license plate detection Predictions of license plates whose layout varies greatly from
with a position-aware mask attention module that directly the license plates in the training dataset are then marked as
detects characters in the image. Li et al. [31] propose a parallel uncertain by the network.
end-to-end approach of license plate detection and recognition.
Both tasks share convolutional features and are then split into B. Predictive Uncertainty
two branches. Out-of-distribution samples most likely appear in real-world
For license plate recognition, there are two main neural applications. Common feedforward neural networks guess
network architectures currently adopted in the research. One their prediction on out-of-distribution data. Quantifying pre-
design consists of only convolutional layers [1], [3], [36]. dictive uncertainty allows specifying the reliability of a neu-
Another design utilizes a combination of convolutional and ral network’s prediction. Predictive uncertainty is often split
recurrent layers [33], [37], [38], [39]. The feature sequence into aleatoric and epistemic uncertainty. Epistemic uncertainty
generated by the recurrent layers is transcribed into a label expresses the uncertainty of the model and can be decreased
using connectionist temporal classification [40]. by adding more training data. Aleatoric uncertainty captures
A particular challenge for all recognition systems are low the uncertainty regarding the data, e. g. due to noise in the
quality images. In police investigations, for example, low- observations or labels [4], [43]. In this work, we do not
quality images impede the investigation. Low-resolution and differentiate between aleatoric and epistemic uncertainty and
compression, to name the most limiting factors, prohibit only estimate predictive uncertainty.
reading the license plate number by looking at the image. Some works directly estimate predictive uncertainty based
There are two different approaches to improve the recogni- on the network’s prediction, e. g., the softmax output [44].
tion performance on low quality images: some work has been Guo et al. [45], for example, propose to rescale the output of
done in license plate recognition in combination with image the neural network in a post-processing step. However, it has
processing. Others train a neural network on very low-quality been shown that these softmax statistics can be misleading [6].
images. A principled approach to obtain predictive uncertainties
A combination of image denoising and license plate recog- is via Bayesian modeling [5], [46], [47], [48]. To this end,
nition is proposed by Rossi et al. [13]. The first convolu- Bayesian neural networks (BNN) learn a distribution over
tional neural network removes the noise in the image. The possible weights. The predictive distribution is obtained by
denoised and noisy images are both processed by the second marginalizing over the weight distribution. In practice, how-
CNN to predict the license plate number. Seibel et al. [41] ever, BNNs require restrictive approximations of the weight
combine multi-frame super-resolution and two optical char- distributions or expensive numerical sampling. Due to these
acter recognition (OCR) systems in a sequential framework difficulties, more scalable alternatives to BNNs have been
for low-quality surveillance cameras. Schirrmacher et al. [12] developed.
showed that the combination of super-resolution and character The most straightforward alternative to estimate predictive
recognition, called SR2 , achieves superior performance when uncertainty are deep ensembles [8]. A deep ensemble com-
performed in parallel. The parallel arrangement mitigates the prises multiple neural networks which are trained on the same
issue of error propagation and makes the super-resolution more task. Wen et al. propose BatchEnsemble [10], an ensemble
robust to unseen noise. technique that requires significantly fewer computations and
The training dataset in [36] contains low-resolution and memory than deep ensembles. Gal et al. [9] introduce Monte
noisy US license plates. With this dataset, a CNN can rec- Carlo dropout (MC-dropout), which allows obtaining pre-
ognize two sets of three characters each of license plates in dictive uncertainty by applying dropout also during testing.
images with a reduced quality. Lorch et al. [11] extend their In this paper, we consider deep ensemble, BatchEnsemble,
CNN to seven separate outputs, one for each character. A null and MC-dropout for our experiments.
character allows for the recognition of license plates of varying
III. M ETHODS
lengths. Kaiser et al. [3] additionally investigate the influence
This section provides a concise description of the methods
of compression on the recognition rate of synthetic Czech
employed in the experiments. The first part gives an overview
license plates.
of the two neural network architectures that serve as the base
For both ALPR and FLPR, detection of license plates
for the probabilistic deep learning methods. The second part
from different countries is challenging besides varying image
explains the three probabilistic deep learning methods deep
degradations in unconstrained settings [42]. One challenge is
ensemble, BatchEnsemble, and MC-dropout.
the availability of public datasets. This allows related work
to test their methods at least in some countries [31]. Another
challenge is the layout of the license plate that can greatly vary A. Backbone Neural Network Architectures for License Plate
between countries. As previous work has shown [3], the neural Recognition
network learns which character is possible at which position. We evaluate the efficacy of the probabilistic deep learning
This may lead to a training-test mismatch when only limited methods with two neural network architectures as a backbone.
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
9206 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2023
The first backbone is a license plate recognition CNN [3], lead to higher predictive uncertainty. In general, multi-task
[11]. The second backbone employs license plate recognition learning helps to generalize and acts as a regularizer. Thus, the
and super-resolution in the multi-task learning framework license plate recognition performance also benefits from the
SR2 [12]. additional task. The multi-task framework can better identify
W adapt the following implementation details for both relevant features when both tasks extract the same features,
backbones: all neural network architectures use ReLU as an such as the edges of the characters [16], [17], [18].
activation function. Additionally, batch normalization [49] is
performed after each trainable layer. As a result, the following
order is used in all architectures presented in the paper: B. Uncertainty Quantification Methods
trainable layer - batch normalization - ReLU. This paper compares three different probabilistic deep learn-
1) License Plate Recognition CNN: The license plate recog- ing methods in the context of license plate recognition. These
nition CNN consists of convolutional layers, max pooling methods are Monte Carlo dropout [9], deep ensembles [8],
layers, and fully-connected layers. All convolutional layers and BatchEnsemble [10]. Dropout is a common technique to
have a receptive field of 3 × 3. The max pooling layers have regularize neural networks. The work by Lorch et al. [11], for
a pool size of 2 × 2. In contrast to [3], our proposed license example, uses dropout in their CNN. Thus, models trained
plate recognition CNN contains additional batch normalization with dropout can benefit from our findings without the need
layers to stabilize training. for re-training. Existing methods that do not utilize dropout
The architecture is structured as follows. After the input, during training might consider using their pipeline to train
there are three sequences of two convolutional layers followed multiple models to get a deep ensemble. As deep ensembles
by a max pooling layer with 64 filter kernels, 128 filter kernels, require high computational power and memory, we addi-
and 256 filter kernels, respectively. Then, there are two blocks, tionally explore the efficacy of BatchEnsemble on the task
each with a convolutional layer with 512 filter kernels followed of license plate recognition. The remainder of this Section
by max pooling. Then, the features maps are flattened followed presents details on the configuration of these three approaches.
by two fully-connected layers with 1024 and 2048 nodes. Dropout [51] is a well-known regularization technique in
Finally, the CNN has seven fully-connected output layers with the area of deep learning. Typically, dropout is applied only
37 nodes each. Here, softmax replaces the ReLU activation, during training. However, Gal et al. [9] propose to use dropout
and batch normalization is omitted. not only during training but also during testing, called Monte
2) SR2 : SR2 consists of shared layers followed by a split Carlo dropout (MC-dropout). The inference step is performed
into two branches, one for super-resolution and the other multiple times. They show that the obtained variance between
for license plate recognition. We use FSRCNN [50] for the predictions “minimizes the Kullback-Leibler divergence
super-resolution and the baseline CNN [11] for the license between an approximate distribution and the posteriori dis-
plate recognition. FSRCNN was selected because its first layer tribution of a Gaussian process” [9]. Thus the obtained pre-
is similar to that of the license plate recognition CNN (LPR dictive uncertainty gives a valid statement. Inference runs are
CNN). Therefore, the shared layers in the SR2 framework do performed multiple times with the same data to quantify the
not differ from the original layers of the individual CNNs. predictive uncertainty. Due to the random dropout of nodes,
FSRCNN consists of five steps. First, features are extracted each inference run gives slightly different results. The number
using a convolutional layer with 56 filter kernels and a of trainable parameters is adapted to ensure that higher dropout
receptive field of 5×5. Then, the feature maps are shrunk using rates do not reduce the representational power of the model.
12 filter kernels with a receptive field of 1×1. Afterward, four Therefore, the number of filters in each convolutional layer
convolutional layers with 12 filter kernels each and a receptive and the number of nodes √in the fully-connected layers are
field of 3×3 perform a mapping. Next, one convolutional layer increased by a factor of 1/(1 − r ), where r denotes the
with 56 filter kernels and a receptive field of 1 × 1 expands dropout rate. Thus, the CNN with dropout rate r = 0.5 has
the feature maps. The last step is a convolutional layer with twice the trainable parameters compared to the CNN without
192 filter kernels and a receptive field of 9 × 9 followed by a dropout. Without the square in the factor, the parameter would
pixel shuffling. quadruple since the number of trainable weight scales with the
In line with [12], we propose a parallel arrangement of input size and the current layer.
the license plate recognition CNN and the super-resolution Arguably, deep ensembles are the most straightforward
FSRCNN. The tasks share the first convolutional layer of approach to estimating predictive uncertainty [8]. A deep
FSRCNN and then split into two branches in our setup. ensemble comprises multiple neural networks which are
The loss function is the weighted sum of the individual loss trained on the same task. Due to random initialization of the
functions. We choose wlpr = 20 and wsr = 1 as weights for weights and random data shuffling, each model ends up in a
the loss of the license plate recognition and super-resolution, different local minimum with high probability [52]. Thus, the
respectively. trained parameters are different in each model. During testing,
The addition of a super-resolution branch to the classifi- the difference between the models’ predictions expresses pre-
cation network boosts the predictive uncertainty of the clas- dictive uncertainty. However, the training of deep ensembles
sification. Super-resolution is particularly useful in studying is time-consuming, and the deep ensemble requires much
predictive uncertainty since it is more sensitive to unseen memory. Since each ensemble member has a similar behavior
degradations [19]. Therefore, smaller degradations potentially but makes different errors, uncertainty can be quantified well.
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
SCHIRRMACHER et al.: BENCHMARKING PROBABILISTIC DEEP LEARNING METHODS FOR LICENSE PLATE RECOGNITION 9207
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
9208 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2023
and k = 7 (right). The last row shows the low-resolution probabilistic deep learning techniques achieve a perfect char-
images with vertical blur with kernel size k = 3 (left), k = 5 acter recognition performance on images with no degradation.
(middle), and k = 7 (right). With increasing strength of the degradation, a misclassification
Additionally, we perform experiments on the CCPD base becomes more likely. For the neural networks, the strength
dataset [54]. The dataset contains labeled images with Chinese of the degradation where the first misclassification happens
license plates captured from a city parking management com- can vary. Note that the computation of the AUC is possible
pany. Along with the bounding boxes of the license plates, the only if there is a wrong classification. Therefore, we have a
dataset provides the license plate numbers. Different weather different number of AUC values for the different techniques.
conditions, rotation, and blur lead to images of varying image Thus, we compute the mean AUC of the precision-recall curve
quality. The provided training set is split into training and in all Tables only on the DS-Hard dataset. Also, these strongly
validation with 80 000 and 20 000 images accordingly. The degraded images are relevant for FLPA. Since the police
proposed LPR CNN is tested on the base test set with investigator can not verify the network’s prediction, he relies
100 000 images. on the predictive uncertainty to identify false predictions.
2) Evaluation Protocol: The license plate recognition CNN The hyperparameter of the probabilistic deep learning tech-
outputs seven vectors with 37 elements each. One vector repre- niques and the degradation strength change the distribution
sents one character in the license plate number. The position of of the predictive uncertainty values. We quantize the param-
the maximum element of each vector is the predicted character. eters of the distribution of the predictive uncertainty values
Throughout the experiments, the characters in the license plate of correct and false predictions with the median and the
are considered individually except for the comparison on the interquartile range. The interquartile range is the difference
CCPD dataset. between the 25th and 75th percentile and measures the spread
Each probabilistic deep learning method provides multiple of the distribution.
predictions for one input image. Deep ensembles provide one 3) Training: For deep ensemble, BatchEnsemble, and
prediction for each ensemble member. BatchEnsemble requires MC-dropout, we obtain five predictions. If not stated differ-
replication of the input according to the number of ensemble ently, five inference runs are performed with MC-dropout to
members. Then, the output of the BatchEnsemble is split into ensure fairness to the other methods. The deep ensemble and
the individual predictions. MC-dropout performs a number of MC-dropout are trained with a batch size of 32. The batch
inference runs to obtain multiple predictions. For each method, size of BatchEnsemble is raised to 160 by replicating the batch
the mean and the standard deviation of the predictions are five times. For the deep ensemble and MC-dropout, we use He
computed. normal to initialize the trainable parameters. BatchEnsemble
In the experiments, three different aspects of license plate uses random initialization drawn from a normal distribution
recognition are considered. First, we evaluate the character with a mean of 1.0 and a standard deviation of 0.5, as specified
recognition performance of the probabilistic deep learning in the official implementation.
techniques. Second, we test if the predictive uncertainty can be For each of the three probabilistic deep learning methods,
used to detect false predictions. Lastly, we investigate influ- either the license plate recognition CNN or SR2 serves as a
encing factors that change the distribution of the predictive backbone. The loss is the averaged cross-entropy loss of each
uncertainty values and thus the detection of false predictions. position. The mean absolute error loss is used for training
We use the mean prediction to measure the accuracy of the the super-resolution network. The structural similarity index
character recognition. An accuracy of 1 means all characters measure (SSIM) indicates potential overfitting when applied
in the test set are predicted correctly. The number of correctly to the validation data.
predicted license plates is counted and divided by the total Adam is used for optimization with the standard parameter
number of license plates in the dataset to compute the license β1 = 0.9, β2 = 0.999, and ϵ = 1e−7 . The learning rate
plate accuracy. for dropout is set to 0.001, while the deep ensemble and
The standard deviation of the predictions of each character BatchEnsemble are trained with a learning rate of 0.00001.
represents the predictive uncertainty. A prediction can either We set the L2 kernel regularizer to 0.0001 for dropout and
be correct or false. Additionally, each prediction has been 0.01 for the deep ensemble and the BatchEnsemble. Each
assigned a predictive uncertainty value. Using a threshold on model is trainied for 55 epochs. Additionally, the learning rate
the predictive uncertainty values, we can identify false predic- is reduced during training when the validation loss stagnates
tions. We measure how well false predictions can be identified for more than five epochs. The learning rate decay is set to 0.2.
for a given threshold with the precision-recall curve. True pos- The models are trained using Tensorflow 2.4.1 and evaluated
itive is a predictive uncertainty value of a false prediction that using scikit-learn 0.24.1. The training ran on a NVIDIA
is above the threshold. False positive is a predictive uncertainty GeForce RTX 2080 Ti GPU.
value above the threshold that belongs to a correct prediction.
False negative is a false prediction with a predictive uncer-
tainty value below the threshold. We measure the area-under- B. Comparison of Probabilistic Deep Learning Methods
the-curve (AUC) of the precision-recall curve with varying This section compares the accuracy and predictive uncer-
thresholds. A large AUC indicates well-separated predictive tainty of BatchEnsemble, deep ensemble, and MC-dropout
uncertainty values for correct and false predictions. To account with the LPR CNN as a backbone. We evaluate different
for class imbalance, we perform random subsampling. The dropout rates r for MC-dropout, denoted as MC-dropout-r .
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
SCHIRRMACHER et al.: BENCHMARKING PROBABILISTIC DEEP LEARNING METHODS FOR LICENSE PLATE RECOGNITION 9209
Fig. 3. Accuracy on the DS-Full dataset. The mean character accuracy of the BatchEnsemble, dropout, and the deep ensemble on the synthetic test dataset.
The low-resolution images are corrupted with horizontal and vertical blur with varying kernel size k. Moreover, Gaussian noise with standard deviation σ or
salt & pepper noise with probability p is added to the low-resolution images.
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
9210 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2023
TABLE II
ACCURACY ON THE DS-F ULL DATASET. M EAN C HARACTER R ECOGNITION ACCURACY OF THE BATCH E NSEMBLE , D ROPOUT, AND THE D EEP E NSEMBLE
ON I MAGES I MPAIRED BY T WO T YPES OF D EGRADATIONS
Fig. 4. AUC on the DS-Full dataset. The AUC of the precision-recall curve of the BatchEnsemble, MC-dropout, and the deep ensemble on the synthetic test
dataset. The low-resolution images are corrupted with horizontal or vertical blur with varying kernel size k. Moreover, Gaussian noise with standard deviation
σ or salt & pepper noise with probability p is added to the low-resolution images.
reports the AUCs for detecting false predictions from the mag- TABLE III
nitude of the uncertainty on the DS-Full dataset. We visualize AUC ON THE DS-H ARD DATASET. M EAN AUC OF THE
P RECISION -R ECALL C URVE FOR THE BATCH E NSEMBLE ,
the results for salt & pepper noise (left), additive Gaussian MC-D ROPOUT, AND THE D EEP E NSEMBLE
noise (middle left), horizontal blur (middle right), and vertical
blur (right) for increasing strength of degradation. The y-axis
shows the AUC. The x-axis visualizes the increasing strength
of the degradation. The line plots start at those points where
the model no longer reach an accuracy of 1.
Since salt & pepper noise is more challenging, misclas-
sifications already occur for p > 10−4 , hence we start to
report the AUCs from this. The competing methods, except
for BatchEnsemble, achieve a stable AUC of around 0.96 and (middle left) with 0.667. For all other degradation types, MC-
0.92 for MC-dropout-0.1 and MC-dropout-0.4, respectively. dropout-0.1 achieves the best results. Vertical blur (right) and
When additive Gaussian noise is present, the competing meth- additive Gaussian noise follow closely. The worst results are
ods, except MC-dropout-0.4, achieve an accuracy of 1 for obtained when salt & pepper noise (0.625) and horizontal
σ < 0.05 or σ < 0.075. Thus, we can not compute an blur (0.586) are present in the images. With this experiment,
AUC. MC-dropout-0.4 has a stable AUC in the range of 0.9 to we aim at visualizing the borderline case. The test images with
0.95 for 0.0001 ≤ σ ≤ 0.75. From this point, however, all strong degradation vary greatly from the clean low-resolution
AUC values sharply drop. For horizontal and vertical blur, the images in the training data. In a real-world scenario, the
methods exhibit similar behavior. The AUC drops drastically difference might not be that big. Therefore, the AUC values
for k > 3 by about 0.2 to 0.3. It can be observed that should not be seen absolutely but relatively as a comparison
BatchEnsemble generally performs poorly and mostly achieves of the methods. But even in the borderline case, the competing
an AUC of around 0.5. methods except BatchEnsemble provide significantly better
All competing methods struggle with strongly degraded results than guessing.
images. Table III reports the mean AUC for these strongly Table IV reports the F1 score of a false prediction at a false
degraded images on the DS-Hard dataset. For salt & pepper positive rate of 5%. The F1 score is the as the harmonic mean
noise (left), MC-dropout-0.1 best separates the predictive of precision and recall. Hence, in contrast to the summary
uncertainty values of false and correct predictions. The deep statistics of the AUC, this metric provides the performances at
ensemble achieves the highest AUC on additive Gaussian noise a specific threshold. The relative performance of MC-dropout
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
SCHIRRMACHER et al.: BENCHMARKING PROBABILISTIC DEEP LEARNING METHODS FOR LICENSE PLATE RECOGNITION 9211
TABLE IV TABLE V
M EAN F1 S CORE OF D ETECTING FALSE P REDICTION AT A 5% FALSE ACCURACY ON THE DS-F ULL DATASET. M EAN C HARACTER
P OSITIVE R ATE FOR BATCH E NSEMBLE , MC-D ROPOUT, AND R ECOGNITION ACCURACY OF MC-D ROPOUT W ITH 5 AND
D EEP E NSEMBLE ON THE DS-H ARD DATASET 50 I NFERENCE RUNS AND VARYING D ROPOUT R ATE . F OR E ACH
D EGRADATION T YPE , THE VALUES R EPRESENT THE M EAN
ACCURACY ON A LL D EGRADATION L EVELS
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
9212 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2023
Fig. 5. Median of the predictive uncertainty values displayed for increasing number of inference runs. The dashed lines show the median predictive uncertainty
values of falsely classified characters, and the solid lines the median predictive uncertainty values of correctly classified characters.
Fig. 6. Interquartile range of predictive uncertainty values displayed for increasing number of inference runs. The dashed lines show the interquartile range
of the predictive uncertainty values of falsely classified characters and the solid lines the interquartile range of the predictive uncertainty values of correctly
classified characters.
from an increasing number of inference runs. The behavior of the distribution. The predictive uncertainties are computed on
MC-dropout on images corrupted with horizontal blur (middle a subset of 1 000 test images. The experiments are conducted
right) is slightly different. Here, MC-dropout-0.2 achieves the with MC-dropout-0.1 (orange) and MC-dropout-0.4 (green),
highest AUC. Vertical blur (right) provokes a similar behavior since these models achieve the highest AUC and accuracy,
of MC-dropout as noise, but the AUC slightly increases with respectively.
increasing dropout rate for r ≤ 0.4. To visualize the influence of inference runs on the detection
We conclude from Subsec. IV-C1 and Subsec. IV-C2 that of false prediction, we choose three degradations with varying
ALPR and FLPR require two different tuning strategies. strength. While salt & pepper noise with p = 0.01 is less
In FLPR, lower numbers of inference runs are advised. The severe, additive Gaussian noise with σ = 0.2 and horizontal
criminal investigator is not able to visually verify the predic- blur with k = 5 significantly lower the image quality.
tion of the neural network. Thus, reliable detection of false Figure 5 visualizes the median predictive uncertainty of
predictions is a desired feature of the license plate recognition correct (Mc ) and false (Mf ) predictions for different inference
CNN. In ALPR, the number of inference runs can be set to a runs. The y-axis shows median predictive uncertainty. The
larger value. Here, the accuracy of the license plate recognition x-axis shows the number of inference runs. For salt & pepper
is important. noise (left) Mc and Mf increase with the number of inference
3) Evaluation of Predictive Uncertainty: Two different fac- runs. However, Mf is more stable than Mc . The median
tors cause poor detection of false predictions. First, with an predictive uncertainties are nicely separated. To identify false
increasing number of inference runs, we observed a decrease predictions better than just guessing, the median predictive
in the AUC. Second, with the increasing strength of the uncertainty of false predictions has to be above that of correct
degradation, the AUC decreases. This behavior indicates that predictions. This does not always hold for severely degraded
the predictive uncertainties of correct and wrong predictions images with additive Gaussian noise (middle) and horizontal
are poorly separable. Therefore, we examine the changes in blur (right). Here, MC-dropout-0.1 allows better separation
the distribution of the predictive uncertainty values of correct than MC-dropout-0.4.
(solid) and false (dashed) predictions for different inference We assume the increase of the median predictive uncertainty
runs and increasing strengths of degradation more closely. with an increasing number of inference runs is linked to the
We use the median and the interquartile range to quantize increased accuracy that is observed in Tab. V. Some characters
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
SCHIRRMACHER et al.: BENCHMARKING PROBABILISTIC DEEP LEARNING METHODS FOR LICENSE PLATE RECOGNITION 9213
Fig. 7. Median of the predictive uncertainty values of MC-dropout displayed for different degradation levels with five inference runs. The dashed lines show
the median predictive uncertainty values of falsely classified characters, and the solid lines the median predictive uncertainty values of correctly classified
characters.
predicted wrongly with five inference runs become correct TABLE VII
with 50 inference runs. However, the predictive uncertainty ACCURACY ON THE DS-F ULL DATASET. M EAN C HARACTER
R ECOGNITION ACCURACY OF THE BATCH E NSEMBLE , MC-D ROPOUT,
for those characters is still high, which increases the overall AND THE D EEP E NSEMBLE . F OR E ACH D EGRADATION T YPE ,
mean uncertainty. THE VALUES R EPRESENT THE M EAN ACCURACY OF A LL
In addition to widely spaced distributions, narrow distribu- D EGRADATION L EVELS . T HE TABLE C OMPARES THE
LPR CNN TO THE SR2 F RAMEWORK W ITH
tions are also important to detect misclassifications reliably. wlpr = 20 AND wsr = 1
Figure 6 visualizes the interquartile range of the predictive
uncertainty values of correct (solid) and false (dashed) predic-
tions. The y-axis shows the interquartile range of the predictive
uncertainty value. The x-axis shows the number of inference
runs. When little salt & pepper noise (left) is present, the
interquartile range is smaller compared to severely degraded
images with Gaussian noise (middle) and horizontal blur
(right). In general, the interquartile range tends to decrease
with an increasing number of inference runs. An exception to
this behavior is the spread of the predictive uncertainty values
of correct predictions from MC-dropout-0.4, which increases.
Besides the number of inference runs, the strength of [17], [18]: training multiple tasks on the same input image
the degradation influences the detection of false predictions. serves as an inductive bias. In this experiment, we inves-
Figure 7 visualize the median predictive uncertainty of cor- tigate the inductive bias introduced by the super-resolution
rect (solid) and false (dashed) predictions of MC-dropout in combination with license plate recognition. We implement
with ten inference runs. The y-axis shows median predic- both tasks in the SR2 framework [12]. In contrast to previous
tive uncertainty. The x-axis represents the strength of the work [41], we arrange both tasks in parallel. The parallel
degradation, with an increase in the degradation from left to arrangement minimizes error propagation and provides better
right. We choose ten runs as a tradeoff between decreasing character recognition when faced with out-of-distribution data,
spread and increasing median predictive uncertainty of correct as shown by [12].
predictions. For salt & pepper noise (left) and additive Gaus- The experiments prove the hypothesis that super-resolution
sian noise (middle left), MC-dropout-0.4 has higher predictive introduces an inductive bias that is beneficial for character
uncertainty values of falsely classified characters on higher recognition. All probabilistic deep learning methods benefit
quality images. However, with increasing degradation strength, from super-resolution in terms of both accuracy and predictive
the median predictive uncertainty of wrong characters becomes uncertainty. The performance boost is best seen on blurred
lower than that of correct characters. MC-dropout-0.1 is better images.
suited for strongly degraded images but later indicates a wrong 1) Character Recognition: Table VII reports the mean
character. Except for horizontal blur (middle right), falsely character recognition accuracy on the DS-Full dataset. Super-
classified characters’ median predictive uncertainty values are resolution increases the character recognition accuracy on
above 0.3. This value can be a potential threshold in the noisy images when salt & pepper noise (left) is present,
context of ALPR. However, this threshold is not suitable for except for MC-dropout-0.1. MC-dropout-0.4 with both back-
strongly degraded images. bones performs best. Additive Gaussian noise (middle left)
poses less of a problem for character recognition. The
D. Influence of Super-Resolution on Classification deep ensemble with SR2 as backbone achieves the highest
The generalization of neural networks to unseen degrada- accuracy with 0.91 and 0.99, respectively. All probabilistic
tions can also be addressed with multi-task learning [16], deep learning techniques benefit from the inductive bias of
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
9214 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2023
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
SCHIRRMACHER et al.: BENCHMARKING PROBABILISTIC DEEP LEARNING METHODS FOR LICENSE PLATE RECOGNITION 9215
(bottom) impede the recognition performance of the network. [6] A. Maier, B. Lorch, and C. Riess, “Toward reliable models for authenti-
We can identify the challenging parts of the license plate due cating multimedia content: Detecting resampling artifacts with Bayesian
neural networks,” in Proc. IEEE Int. Conf. Image Process. (ICIP),
to the difference in the predicted characters between inference Oct. 2020, pp. 1251–1255.
runs. [7] Y. Ovadia et al., “Can you trust your model’s uncertainty? Evaluating
predictive uncertainty under dataset shift,” in Proc. Adv. Neural Inf.
Process. Syst., vol. 32, 2019, pp. 13991–14002.
V. C ONCLUSION [8] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable
predictive uncertainty estimation using deep ensembles,” in Proc. Adv.
This paper proposes to model uncertainty for the task of Neural Inf. Process. Syst., 2017, pp. 6405–6416.
license plate recognition explicitly. To the best of our knowl- [9] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation:
edge, this has not been explored yet but offers helpful features Representing model uncertainty in deep learning,” in Proc. 33rd Int.
for license plate recognition. For example, we demonstrate Conf. Mach. Learn., vol. 48, Jun. 2016, pp. 1050–1059.
[10] Y. Wen, D. Tran, and J. Ba, “Batchensemble: An alternative approach
that the quantification of the prediction uncertainty allows the to efficient ensemble and lifelong learning,” in Proc. Int. Conf. Learn.
detection of misclassifications. We identify automatic license Represent., Feb. 2020, pp. 1–20.
plate recognition and forensic license plate recognition as [11] B. Lorch, S. Agarwal, and H. Farid, “Forensic reconstruction of severely
applications that benefit from predictive uncertainty. degraded license plates,” Electron. Imag., vol. 2019, no. 5, p. 529,
Jan. 2019.
We investigate three well-known probabilistic deep learning [12] F. Schirrmacher, B. Lorch, B. Stimpel, T. Köhler, and C. Riess, “SR2 :
methods that quantify predictive uncertainty: BatchEnsemble, Super-resolution with structure-aware reconstruction,” in Proc. IEEE Int.
MC-dropout, and deep ensemble. Two neural network archi- Conf. Image Process. (ICIP), Oct. 2020, pp. 533–537.
tectures are the backbones of these techniques. A state-of- [13] G. Rossi, M. Fontani, and S. Milani, “Neural network for denoising
and reading degraded license plates,” in Proc. Pattern Recognit., ICPR
the-art license plate recognition CNN serves as a baseline Int. Workshops Challenges (Lecture Notes in Computer Science). Cham,
backbone. To exploit the benefits of multi-task learning, Switzerland: Springer, 2021, pp. 484–499.
we combine super-resolution and license plate recognition in [14] M. Zhang, W. Liu, and H. Ma, “Joint license plate super-resolution and
recognition in one multi-task GAN framework,” in Proc. IEEE Int. Conf.
the SR2 framework as a second backbone. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 1443–1447.
License plate recognition in the wild is complex since [15] Y. Lee, J. Lee, H. Ahn, and M. Jeon, “SNIDER: Single noisy image
images stem from various acquisition settings. One must denoising and rectification for improving license plate recognition,” in
always consider a lower quality of the test data than that of Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019,
pp. 1017–1026.
the training data. We propose probabilistic deep learning as a [16] R. Caruana, “Multitask learning: A knowledge-based source of inductive
tool to detect when the data and thus the character recognition bias,” in Proc. 10th Int. Conf. Mach. Learn., Jun. 1993, pp. 41–48.
are less reliable. For this purpose, the models are trained on [17] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, no. 1,
high-quality images and tested on noisy or blurred lower- pp. 41–75, 1997.
quality images. Except for BatchEnsemble, all probabilistic [18] S. Ruder, “An overview of multi-task learning in deep neural networks,”
2017, arXiv:1706.05098.
deep learning methods provide reasonable uncertainty esti- [19] A. Villar-Corrales, F. Schirrmacher, and C. Riess, “Deep learning
mates even for severely degraded images. Even better results architectural designs for super-resolution of noisy images,” in Proc.
are obtained when license plate recognition is combined with IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2021,
pp. 1635–1639.
super-resolution in the SR2 framework. The SR2 framework
[20] J. Shashirangana, H. Padmasiri, D. Meedeniya, and C. Perera, “Auto-
improves both character recognition accuracy and detection of mated license plate recognition: A survey on methods and techniques,”
false predictions. For the future, we see super-resolution as IEEE Access, vol. 9, pp. 11203–11225, 2021.
a tool to additionally verify the prediction of images with a [21] S. Du, M. Ibrahim, M. Shehata, and W. Badawy, “Automatic license
plate recognition (ALPR): A state-of-the-art review,” IEEE Trans. Cir-
reduced quality. Here, the predictive uncertainty obtained per cuits Syst. Video Technol., vol. 23, no. 2, pp. 311–325, Feb. 2013.
pixel can help identify less reliable character predictions of the [22] C.-N.-E. Anagnostopoulos, I. E. Anagnostopoulos, I. D. Psoroulas,
LPR CNN. The hyperparameter of MC-dropout allows setting V. Loumos, and E. Kayafas, “License plate recognition from still images
a stronger focus on either character recognition performance and video sequences: A survey,” IEEE Trans. Intell. Transp. Syst., vol. 9,
no. 3, pp. 377–391, Sep. 2008.
or reliable detection of false predictions. [23] Y. Wen, Y. Lu, J. Yan, Z. Zhou, K. M. von Deneen, and P. Shi, “An algo-
rithm for license plate recognition applied to intelligent transportation
system,” IEEE Trans. Intell. Transp. Syst., vol. 12, no. 3, pp. 830–845,
R EFERENCES 2011.
[1] S. M. Silva and C. R. Jung, “License plate detection and recognition [24] L. Liu, H. Zhang, A. Feng, X. Wan, and J. Guo, “Simplified local
in unconstrained scenarios,” in Proc. Eur. Conf. Comput. Vis. (ECCV), binary pattern descriptor for character recognition of vehicle license
Oct. 2018, pp. 580–596. plate,” in Proc. 7th Int. Conf. Comput. Graph., Imag. Vis., Aug. 2010,
pp. 157–161.
[2] L. Zhang, P. Wang, H. Li, Z. Li, C. Shen, and Y. Zhang, “A robust
attentional framework for license plate recognition in the wild,” IEEE [25] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
Trans. Intell. Transp. Syst., vol. 22, no. 11, pp. 6967–6976, Nov. 2021. once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput.
[3] P. Kaiser, F. Schirrmacher, B. Lorch, and C. Riess, “Learning to decipher Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
license plates in severely degraded images,” in Proc. Pattern Recognit., [26] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
ICPR Int. Workshops Challenges, in Lecture Notes in Computer Science. real-time object detection with region proposal networks,” IEEE Trans.
Milan, Italy: Springer, 2021, pp. 544–559. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
[4] A. Kendall and Y. Gal, “What uncertainties do we need in Bayesian [27] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient
deep learning for computer vision?” in Proc. Adv. Neural Inf. Process. object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
Syst., Mar. 2017, pp. 5574–5584. nit. (CVPR), Jun. 2020, pp. 10778–10787.
[5] J. Snoek et al., “Can you trust your model’s uncertainty? Evaluating [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
predictive uncertainty under dataset shift,” in Proc. Adv. Neural Inf. with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Process. Syst., 2019, pp. 13969–13980. Process. Syst. (NIPS), vol. 25, Dec. 2012, pp. 1097–1105.
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.
9216 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2023
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [54] Z. Xu et al., “Towards end-to-end license plate detection and recognition:
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. A large dataset and baseline,” in Proc. Eur. Conf. Comput. Vis. (ECCV),
(CVPR), Jun. 2016, pp. 770–778. Oct. 2018, pp. 255–271.
[30] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for
convolutional neural networks,” in Proc. 36th Int. Conf. Mach. Learn.
(ICML), May 2019, pp. 6105–6114.
[31] H. Li, P. Wang, and C. Shen, “Toward end-to-end car license plate
detection and recognition with deep neural networks,” IEEE Trans.
Intell. Transp. Syst., vol. 20, no. 3, pp. 1126–1136, Mar. 2019.
[32] S. M. Silva and C. R. Jung, “A flexible approach for automatic license Franziska Schirrmacher received the M.Sc. degree
plate recognition in unconstrained scenarios,” IEEE Trans. Intell. Transp. in medical engineering from Friedrich-Alexander
Syst., vol. 23, no. 6, pp. 5693–5703, Jun. 2022. University Erlangen-Nürnberg (FAU), Erlangen,
[33] H. Zhang, F. Sun, X. Zhang, and L. Zheng, “License plate recognition Germany, in 2017. From 2017 to 2019, she was a
model based on CNN + LSTM + CTC,” in Proc. Int. Conf. Pioneering Researcher with the Pattern Recognition Laboratory,
Comput. Scientists, Eng. Educators. Guilin, China: Springer, 2019, FAU, where she joined the IT Security Infrastruc-
pp. 657–678. tures Laboratory in 2019. She is currently a part of
[34] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. the Multimedia Security Group. Her research inter-
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2961–2969. ests include image processing, machine learning, and
[35] L. Qiao et al., “MANGO: A mask attention guided one-stage scene text image forensics.
spotter,” 2020, arXiv:2012.04350.
[36] S. Agarwal, D. Tran, L. Torresani, and H. Farid, “Deciphering severely
degraded license plates,” Electron. Imag., vol. 29, no. 7, pp. 138–143,
Jan. 2017.
[37] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network
for image-based sequence recognition and its application to scene text
recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 11,
pp. 2298–2304, Nov. 2017. Benedikt Lorch received the M.Sc. degree in
[38] P. Shivakumara, D. Tang, M. Asadzadehkaljahi, T. Lu, U. Pal, and computer science from Friedrich-Alexander Univer-
M. H. Anisi, “CNN-RNN based method for license plate recognition,” sity Erlangen-Nürnberg (FAU), Erlangen, Germany,
CAAI Trans. Intell. Technol., vol. 3, no. 3, pp. 169–175, Sep. 2018. in 2018, where he is currently pursuing the Ph.D.
[39] B. Suvarnam and V. S. Ch, “Combination of CNN-GRU model to degree with the IT Security Infrastructures Lab-
recognize characters of a license plate number without segmentation,” in oratory in September 2018. His research inter-
Proc. 5th Int. Conf. Adv. Comput. Commun. Syst. (ICACCS), Mar. 2019, ests include image forensics, computer vision, and
pp. 317–322. machine learning.
[40] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connec-
tionist temporal classification: Labelling unsegmented sequence data
with recurrent neural networks,” in Proc. 23rd Int. Conf. Mach. Learn.
(ICML), 2006, pp. 369–376.
[41] H. Seibel, S. Goldenstein, and A. Rocha, “Eyes on the target: Super-
resolution and license-plate recognition in low-quality surveillance
videos,” IEEE Access, vol. 5, pp. 20020–20035, 2017.
[42] C. Henry, S. Y. Ahn, and S. Lee, “Multinational license plate recognition
using generalized character sequence detection,” IEEE Access, vol. 8, Anatol Maier received the M.Sc. degree in com-
pp. 35185–35199, 2020. puter science from Friedrich-Alexander Univer-
[43] A. D. Kiureghian and O. Ditlevsen, “Aleatory or epistemic? Does it sity Erlangen-Nürnberg (FAU), Erlangen, Germany,
matter?” Struct. Saf., vol. 31, no. 2, pp. 105–112, Mar. 2009. in 2019, where he is currently pursuing the Ph.D.
[44] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified degree with the IT Security Infrastructures Labo-
and out-of-distribution examples in neural networks,” in Proc. 5th Int. ratory. He is also a part of the Multimedia Secu-
Conf. Learn. Represent. (ICLR), Apr. 2017, pp. 1–12. rity Group. His research interests include reliable
[45] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of machine learning, deep probabilistic models, and
modern neural networks,” in Proc. Int. Conf. Mach. Learn., Aug. 2017, computer vision, with a particular application in
pp. 1321–1330. image and video forensics.
[46] G. E. Hinton and D. van Camp, “Keeping the neural networks simple
by minimizing the description length of the weights,” in Proc. 6th Annu.
Conf. Comput. Learn. Theory (COLT), 1993, pp. 5–13.
[47] A. Graves, “Practical variational inference for neural networks,” in Proc.
Adv. Neural Inf. Process. Syst., Dec. 2011, pp. 2348–2356.
[48] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight
uncertainty in neural networks,” in Proc. 32nd Int. Conf. Mach. Learn.,
Jul. 2015, pp. 1613–1622. Christian Riess (Senior Member, IEEE) received
[49] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep the Ph.D. degree in computer science from
network training by reducing internal covariate shift,” in Proc. 32nd Friedrich-Alexander University Erlangen-Nürnberg
Int. Conf. Mach. Learn., Jul. 2015, pp. 448–456. (FAU), Erlangen, Germany, in 2012, and the Habil-
[50] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution itation degree in X-ray phase contrast imaging in
convolutional neural network,” in Proc. Eur. Conf. Comput. Vis. (ECCV). 2020. From 2013 to 2015, he was a Post-Doctoral
Springer, Sep. 2016, pp. 391–407. Researcher with the Radiological Sciences Labora-
[51] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and tory, Stanford University, Stanford, CA, USA. Since
R. Salakhutdinov, “Dropout: A simple way to prevent neural networks 2016, he has been a Senior Researcher and he is the
from overfitting,” J. Mach. Learn. Res., vol. 15, no. 56, pp. 1929–1958, Head of the Multimedia Security Group within the
2014. IT Security Infrastructures Laboratory. His research
[52] S. Fort, H. Hu, and B. Lakshminarayanan, “Deep ensembles: A loss interests include image processing, machine learning, and machine learning
landscape perspective,” 2019, arXiv:1912.02757. security, with applications in image and video forensics, color image process-
[53] A. Krogh and J. Vedelsby, “Neural network ensembles, cross validation, ing, and image enhancement. He is currently an Associate Editor of IEEE
and active learning,” in Proc. Adv. Neural Inf. Process. Syst., G. Tesauro, T RANSACTIONS ON I NFORMATION F ORENSICS AND S ECURITY, a member
D. S. Touretzky, and T. K. Leen, Eds. Cambridge, MA, USA: MIT Press, of the IEEE Information Forensics and Security Technical Committee, and
1994, pp. 231–238. the EURASIP TAC Signal and Data Analytics for Machine Learning.
Authorized licensed use limited to: RVR & JC College of Engineering. Downloaded on January 31,2024 at 04:21:45 UTC from IEEE Xplore. Restrictions apply.