On The Cross-Dataset Generalization in License Plate Recognition
On The Cross-Dataset Generalization in License Plate Recognition
Keywords: Deep Learning, Leave-one-dataset-out, License Plate Recognition, Optical Character Recognition
Abstract: Automatic License Plate Recognition (ALPR) systems have shown remarkable performance on license
plates (LPs) from multiple regions due to advances in deep learning and the increasing availability of datasets.
The evaluation of deep ALPR systems is usually done within each dataset; therefore, it is questionable if such
results are a reliable indicator of generalization ability. In this paper, we propose a traditional-split versus
leave-one-dataset-out experimental setup to empirically assess the cross-dataset generalization of 12 Optical
Character Recognition (OCR) models applied to LP recognition on nine publicly available datasets with a great
variety in several aspects (e.g., acquisition settings, image resolution, and LP layouts). We also introduce a
public dataset for end-to-end ALPR that is the first to contain images of vehicles with Mercosur LPs and the
one with the highest number of motorcycle images. The experimental results shed light on the limitations of
the traditional-split protocol for evaluating approaches in the ALPR context, as there are significant drops in
performance for most datasets when training and testing the models in a leave-one-dataset-out fashion.
tance from the vehicle to the camera varies slightly. SSIG-SegPlate (Gonçalves et al., 2016) and UFPR-
All images have a resolution of 1,280 × 720 pixels. ALPR (Laroca et al., 2018) datasets. We preserved
An important feature of the proposed dataset is the percentage of samples for each vehicle type and
that it has images of two different LP layouts: Brazil- LP layout; for example, there are 2,000 images of
ian and Mercosur. To maintain consistency with pre- cars with Brazilian LPs in each of the training and test
vious works (Izidio et al., 2020; Oliveira et al., 2021; sets, and 1,000 images in the validation one. For re-
Silva and Jung, 2022), we refer to “Brazilian” as the producibility purposes, the subsets generated are ex-
standard used in Brazil before the adoption of the plicitly available along with the proposed dataset.
Mercosur standard. All Brazilian LPs consist of three Every image has the following information avail-
letters followed by four digits, while the initial pattern able in a text file: the vehicle’s type (car or motorcy-
adopted in Brazil for Mercosur LPs consists of three cle), the LP’s layout (Brazilian or Mercosur), its text
letters, one digit, one letter and two digits, in that or- (e.g., ABC-1234), and the position (x, y) of each of
der. In both layouts, car LPs have seven characters its four corners. We labeled the corners instead of just
arranged in one row, whereas motorcycle LPs have the LP bounding box to enable the training of methods
three characters in one row and four characters in an- that explore LP rectification, as well as the application
other. Even though these LP layouts are very similar of a wider range of data augmentation techniques.
in shape and size, there are considerable differences The datasets for ALPR are generally very unbal-
in their colors and characters’ fonts. anced in terms of character classes due to LP alloca-
The 20,000 images are divided as follows: 5,000 tion policies (Zhang et al., 2021). In Brazil, for ex-
images of cars with Brazilian LPs; 5,000 images of ample, one letter can appear much more often than
motorcycles with Brazilian LPs; 5,000 images of cars others according to the state in which the LP was is-
with Mercosur LPs; and 5,000 images of motorcy- sued (Gonçalves et al., 2018; Laroca et al., 2018).
cles with Mercosur LPs. For the sake of simplicity This information must be taken into account when
of definitions, here “car” refers to any vehicle with training recognition models in order to avoid unde-
four wheels or more (e.g., passenger cars, vans, buses, sirable biases – this is usually done through data aug-
trucks, among others), while “motorcycle” refers to mentation techniques (Zhang et al., 2021; Hasnat and
both motorcycles and motorized tricycles. As far as Nakib, 2021); for example, a network trained exclu-
we know, RodoSol-ALPR is the public dataset for sively in our dataset may learn to always classify the
ALPR with the highest number of motorcycle images. first character as ‘P’ in cases where it should be ‘B’
We randomly split the RodoSol-ALPR dataset as or ‘R’ since it appears much more often in this posi-
follows: 8,000 images for training; 8,000 images for tion than these two characters (see Figure 2).
testing; and 4,000 images for validation, following Regarding privacy concerns related to our dataset,
the split protocol (i.e., 40%/40%/20%) adopted in the we remark that in Brazil the LPs are related to the re-
Table 1: OCR models explored in our experiments.
1st char
2nd char Model Original Application
8000 3rd char
Framework: PyTorch3
4th char
5th char
R2 AM (Lee and Osindero, 2016) Scene Text Recognition
6th char RARE (Shi et al., 2016) Scene Text Recognition
6000 7th char STAR-Net (Liu et al., 2016) Scene Text Recognition
# instances
(b) EnglishLP
(c) UCSD-Stills
Table 5: Recognition rates obtained by all models under the traditional-split protocol, which assesses the ability of the models
to perform well in seen scenarios. Each model (rows) was trained once on the union of the training set images from all datasets
and evaluated on the respective test sets (columns). The best recognition rate achieved in each dataset is shown in bold.
Test set Caltech Cars EnglishLP UCSD-Stills ChineseLP AOLP OpenALPR-EU SSIG-SegPlate UFPR-ALPR RodoSol-ALPR
Average
Approach # 46 # 102 # 60 # 161 # 687 # 108 # 804 # 1,800 # 8,000
CR-NET (Silva and Jung, 2020) 95.7% 92.2% 100.0% 96.9% 97.7% 97.2% 97.1% 78.3% 55.8%‡ 90.1%
CRNN (Shi et al., 2017) 87.0% 81.4% 88.3% 88.2% 87.6% 89.8% 93.4% 64.9% 48.2% 81.0%
Fast-OCR (Laroca et al., 2021a) 93.5% 81.4% 95.0% 85.1% 95.8% 91.7% 87.1% 65.9% 49.7%‡ 82.8%
GRCNN (Wang and Hu, 2017) 93.5% 87.3% 91.7% 84.5% 85.9% 87.0% 94.3% 63.3% 48.4% 81.7%
Holistic-CNN (Špaňhel et al., 2017) 89.1% 68.6% 88.3% 90.7% 86.3% 78.7% 94.8% 70.3% 49.0% 79.5%
Multi-task (Gonçalves et al., 2019) 87.0% 62.7% 85.0% 86.3% 84.7% 66.7% 93.0% 65.3% 49.1% 75.5%
R2 AM (Lee and Osindero, 2016) 84.8% 70.6% 81.7% 87.0% 83.1% 63.9% 92.0% 66.9% 48.6% 75.4%
RARE (Shi et al., 2016) 91.3% 84.3% 90.0% 95.7% 93.4% 91.7% 93.7% 69.0% 51.6% 84.5%
Rosetta (Borisyuk et al., 2018) 87.0% 75.5% 81.7% 90.1% 83.7% 81.5% 94.3% 63.9% 48.7% 78.5%
STAR-Net (Liu et al., 2016) 95.7% 93.1% 96.7% 96.9% 96.8% 95.4% 96.1% 70.9% 51.8% 88.2%
TRBA (Baek et al., 2019) 91.3% 87.3% 96.7% 96.9% 99.0% 93.5% 97.3% 72.9% 59.6% 88.3%
ViTSTR-Base (Atienza, 2021) 84.8% 80.4% 90.0% 99.4% 95.6% 84.3% 96.1% 73.3% 49.3% 83.7%
Average 90.0% 80.4% 90.4% 91.5% 90.8% 85.1% 94.1% 68.7% 50.8% 82.4%
‡ Images from the RodoSol-ALPR dataset were not used for training the CR-NET and Fast-OCR models, as each character’s bounding box needs to be labeled for training them (as detailed in Section 4.1).
Table 6: Recognition rates obtained by all models under the leave-one-dataset-out protocol, which assesses the generalization
performance of the models by testing them on the test set of an unseen dataset. For each dataset (columns), we trained the
recognition models (rows) on all images from the other datasets. The best recognition rates achieved are shown in bold.
Test set Caltech Cars EnglishLP UCSD-Stills ChineseLP AOLP OpenALPR-EU SSIG-SegPlate UFPR-ALPR RodoSol-ALPR
Average
Approach # 46 # 102 # 60 # 161 # 687 # 108 # 804 # 1,800 # 8,000
CR-NET (Silva and Jung, 2020) 93.5% 96.1% 96.7% 88.2% 76.9% 96.3% 94.7% 61.8% 45.4% 83.3%
CRNN (Shi et al., 2017) 91.3% 62.7% 75.0% 76.4% 59.4% 88.0% 91.3% 61.7% 38.8% 71.6%
Fast-OCR (Laroca et al., 2021a) 93.5% 91.2% 95.0% 90.1% 77.0% 94.4% 91.2% 53.2% 47.8% 81.5%
GRCNN (Wang and Hu, 2017) 95.7% 65.7% 90.0% 80.7% 53.9% 88.9% 90.3% 60.8% 39.8% 74.0%
Holistic-CNN (Špaňhel et al., 2017) 80.4% 40.2% 73.3% 81.4% 59.7% 83.3% 93.4% 61.8% 33.4% 67.4%
Multi-task (Gonçalves et al., 2019) 82.6% 34.3% 66.7% 77.6% 50.8% 79.6% 89.9% 57.9% 44.8% 64.9%
R2 AM (Lee and Osindero, 2016) 89.1% 52.9% 66.7% 74.5% 52.5% 80.6% 93.5% 57.9% 40.7% 67.6%
RARE (Shi et al., 2016) 84.8% 50.0% 85.0% 88.8% 62.9% 91.7% 93.5% 71.3% 40.1% 74.2%
Rosetta (Borisyuk et al., 2018) 89.1% 63.7% 68.3% 83.2% 51.1% 81.5% 94.4% 61.8% 42.5% 70.6%
STAR-Net (Liu et al., 2016) 89.1% 80.4% 91.7% 95.0% 79.3% 93.5% 94.0% 69.1% 43.6% 81.8%
TRBA (Baek et al., 2019) 95.7% 66.7% 93.3% 95.0% 70.0% 92.6% 96.9% 73.2% 42.6% 80.7%
ViTSTR-Base (Atienza, 2021) 89.1% 58.8% 90.0% 95.0% 59.2% 89.8% 97.9% 69.6% 41.7% 76.8%
Average 89.5% 63.6% 82.6% 85.5% 62.7% 88.3% 93.4% 63.3% 41.8% 74.5%
Average (traditional-split protocol) 90.0% 80.4% 90.4% 91.5% 90.8% 85.1%† 94.1% 68.7% 50.8% 82.4%
Sighthound (Masood et al., 2017) 87.0% 94.1% 90.0% 84.5% 79.6% 94.4% 79.2% 52.6% 51.0% 79.2%
OpenALPR (OpenALPR API, 2021)∗ 95.7% 99.0% 96.7% 93.8% 81.1% 99.1% 91.4% 87.8% 70.0% 90.5%
† Underthe traditional-split protocol, no images from the OpenALPR-EU dataset were used for training. This is the protocol commonly adopted in the literature (Laroca et al., 2021b; Silva and Jung, 2022).
∗ OpenALPR contains specialized solutions for LPs from different regions and the user must enter the correct region before using its API. Hence, it was expected to achieve better results than the other methods.