Ebook MEDI2022 Springer
Ebook MEDI2022 Springer
Ahmed Hassan
Ladjel Bellatreche (Eds.)
LNCS 13761
Model and
Data Engineering
11th International Conference, MEDI 2022
Cairo, Egypt, November 21–24, 2022
Proceedings
Lecture Notes in Computer Science 13761
Founding Editors
Gerhard Goos
Karlsruhe Institute of Technology, Karlsruhe, Germany
Juris Hartmanis
Cornell University, Ithaca, NY, USA
Model and
Data Engineering
11th International Conference, MEDI 2022
Cairo, Egypt, November 21–24, 2022
Proceedings
Editors
Philippe Fournier-Viger Ahmed Hassan
Shenzhen University Nile University
Shenzhen, Guangdong, China Giza, Egypt
Ladjel Bellatreche
ISAE-ENSMA
Poitiers, France
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2023
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, expressed or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
General Chairs
Ahmed Hassan Nile University, Egypt
Ladjel Bellatreche ISAE-ENSMA, France
Workshop Chair
Ahmed Awad Tartu University, Estonia
Proceedings Chair
Walid Al-Atabany Nile University, Egypt
Financial Chair
Hala Zayed Nile University, Egypt
Program Committee
Antonio Corral University of Almeria, Spain
Mamoun Filali-Amine IRIT, France
Flavio Ferrarotti Software Competence Centre Hagenberg, Austria
Sofian Maabout University of Bordeaux, France
Yannis Manolopoulos Open University of Cyprus, Cyprus
Milos Savic University of Novi Sad, Serbia
Alberto Cano Virginia Commonwealth University, USA
Essam Houssein Minia University, Egypt
Moulay Akhloufi Université de Moncton, Canada
Neeraj Singh University of Toulouse, France
Dominique Mery Université de Lorraine, Loria, France
Duy-Tai Dinh Japan Advanced Institute of Science and
Technology, Japan
Giuseppe Polese University of Salerno, Italy
viii Organization
Organization Committee
Mohamed El Helw Nile University, Egypt
Islam Tharwat Nile University, Egypt
Sahar Selim Nile University, Egypt
Passant El Kafrawy Nile University, Egypt
Organization ix
Athman Bouguettaya
Bio: Athman Bouguettaya is Professor and previous Head of School of Computer Sci-
ence, at the University of Sydney, Australia. He was also previously Professor and Head
of School of Computer Science and IT at RMIT University, Melbourne, Australia. He
received his PhD in Computer Science from the University of Colorado at Boulder (USA)
in 1992. He was previously Science Leader in Service Computing at the CSIRO ICT
Centre (now DATA61), Canberra. Australia. Before that, he was a tenured faculty mem-
ber and Program director in the Computer Science department at Virginia Polytechnic
Institute and State University (commonly known as Virginia Tech) (USA). He is a found-
ing member and past President of the Service Science Society, a non-profit organization
that aims at forming a community of service scientists for the advancement of service
science. He is or has been on the editorial boards of several journals including, the IEEE
Transactions on Services Computing, IEEE Transactions on Knowledge and Data Engi-
neering, ACM Transactions on Internet Technology, the International Journal on Next
Generation Computing, VLDB Journal, Distributed and Parallel Databases Journal, and
the International Journal of Cooperative Information Systems. He is also the Editor-in-
Chief of the Springer-Verlag book series on Services Science. He served as a guest editor
of a number of special issues including the special issue of the ACM Transactions on
Internet Technology on Semantic Web services, a special issue the IEEE Transactions
xiv A. Bouguettaya
on Services Computing on Service Query Models, and a special issue of IEEE Internet
Computing on Database Technology on the Web. He was the General Chair of the IEEE
ICWS for 2021 and 2022. He was also General Chair of ICSOC for 2020. He served
as a Program Chair of the 2017 WISE Conference, the 2012 International Conference
on Web and Information System Engineering, the 2009 and 2010 Australasian Database
Conference, 2008 International Conference on Service Oriented Computing (ICSOC)
and the IEEE RIDE Workshop on Web Services for E-Commerce and E-Government
(RIDE-WS-ECEG’04). He also served on the IEEE Fellow Nomination Committee. He
has published more than 300 books, book chapters, and articles in journals and confer-
ences in the area of databases and service computing (e.g., the IEEE Transactions on
Knowledge and Data Engineering, the ACM Transactions on the Web, WWW Journal,
VLDB Journal, SIGMOD, ICDE, VLDB, and EDBT). He was the recipient of several
federally competitive grants in Australia (e.g., ARC), the US (e.g., NSF, NIH), Qatar
(NPRP). EU (FP7), and China (NSFC). He also won major industry grants from compa-
nies like HP and Sun Microsystems (now Oracle). He is a Fellow of the IEEE, Member of
the Academia Europaea (Honoris Causa) (MAE) (HON), WISE Fellow, AAIA Fellow,
and Distinguished Scientist of the ACM.
Broad and Deep Learning of Big Heterogeneous Health
Data for Medical AI: Opportunities and Challenges
Vincent S. Tseng
Research Award (2019 & 2015) by Ministry of Science and Technology Taiwan, 2018
Outstanding I.T. Elite Award, 2018 FutureTech Breakthrough Award, and 2014 K. T. Li
Breakthrough Award. He is also a Fellow of IEEE and Distinguished Member of ACM.
Contents
Modelling
Database Systems
1 Introduction
Communication methods have undergone significant changes in the recent few
decades due to the quick development of computer and network technology.
The need for secure communication of media and exchanged information has
gradually developed [1]. Specifically, image encryption has been the topic of
numerous researches to protect the user’s privacy [2]. The strong correlation
and redundancy between neighbouring pixels of an image require devising new
encryption schemes rather than the typical ones [3].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 3–15, 2023.
https://doi.org/10.1007/978-3-031-21595-7_1
4 M. A. Fetteha et al.
Chaotic systems are good candidates for image encryption systems because
of their pseudorandomness, initial value sensitivity, parameter sensitivity, and
unpredictability, among other qualities, which increase the security level [4–
6]. Both Deoxyribonucleic acid (DNA) encoding and Arnold permutation have
appeared in recent works as well. In [7], an image encryption algorithm based
on bit-level Arnold transform and hyperchaotic maps was proposed. The algo-
rithm divides the grayscale image into 8 binary images. Then, a chaotic sequence
is used to shift the images. Afterwards, Arnold transform is applied. Finally,
image diffusion is applied using the hyper chaotic map. The system requires
image division, which increases the system’s complexity and may halt it from
being optimized to applicable hardware design. In [6], Luo et al. used double
chaotic systems, where two-dimensional Baker chaotic map is used to set the
state variables and system parameters of the logistic chaotic map. In [8], Ismail
et al. developed a generalised double humped logistic map, which is used in gray
scale image encryption. In [9], a chaotic system and true random number gener-
ator were utilized for image encryption. The presence of both the chaotic system
and true random number generator increases the system’s complexity making it
less suitable for hardware implementation. In [1], a plaintext-related encryption
scheme that utilises two chaotic systems and DNA manipulation was presented.
The system depends on the values of some pixels for the encryption process,
which threatens image restoration if they are changed.
This paper proposes an image encryption algorithm that uses hyperchaotic
Lorenz system, an optimized DNA manipulation system and a new method for
applying Arnold transform, which is more suitable for encryption applications.
The rest of the paper is organized as follows: Sect. 2 provides a brief explana-
tion of the utilized methods. Section 3 demonstrates the proposed encryption
and decryption algorithms. Section 4 validates their good performance. Finally,
Sect. 5 concludes the work.
2 Preliminaries
Generally, encryption systems require a source of randomness that can be regen-
erated in the decryption process. This section explains the main sources of ran-
domness that are employed in the proposed scheme.
DNA coding [11] is used to change the bit values according to some set of rules.
This is done to enhance the security of the algorithm. DNA consists of 4 bases,
which are Adenine (A), Thymine (T), Cytosine (C), and Guanine (G). The
relation between these bases is that ‘A’ is complementary to ‘T’ and ‘G’ is
complementary to ‘C’. Table 1 shows the used binary code for each DNA base.
Based on these relations, we can apply rules to manipulate the data as long
as the relation between these bases does not change. Table 2 shows the list of
all possible rules that are used in the encryption algorithm, where a random
number is used to select the rule and then the two input bits are replaced with
the corresponding DNA base. For example, if the chosen rule is 6 and the input
is ‘T’, then the output will be ‘C’, which is equal to ‘11’.
Table 3 shows the results of DNA addition and subtraction, which can be
done using simple operations on the DNA bases if the binary representation of
Table 1 is used. The DNA sequence has a cyclic behavior, where each base is
repeated every 4 cycles (i.e., T, C, G, A, T, C, . . . ). This enables performing
‘DNA cycling’ by dividing the number of cycles by 4 and then referring to Table 4.
6 M. A. Fetteha et al.
Rules
1 2 3 4 5 6 7 8
A A C C T T G A G
T T G G A A C T C
G G T A G C A C T
C C A T C G T G A
+ G A T C − G A T C
G G A T C G G C T A
A A T C G A A G C T
T T C G A T T A G C
C C G A T C C T A G
Number of cycles T C G A
4n+0 T C G A
4n+1 C G A T
4n+2 G A T C
4n+3 A T C G
this periodicity, we propose a modified Arnold transform, where the image will
be permuted for any number of cycles chosen.
3 Proposed Algorithm
The proposed algorithm for encryption and decryption is shown in Fig. 2. The
proposed modified Arnold transform is explained, after that the encryption and
decryption process.
To guarantee image permutation for any number of cycles, the number of cycles
(Cyc) of the Arnold transform must not equal to 0 or P . Hence, we apply the
following equation:
G = mod (Cyc, P − 2) + 1. (3)
This will make the effective number of cycles G be in the range of 1 → (P − 1),
which avoids these two cases and eliminates the chances of periodicity.
Step 1: The 4 input sub keys (K1 , K2 , K3 , and K4 ) are converted from hex-
adecimal to decimal representation to set the initial state of each variable of the
hyperchaotic Lorenz system (1), x0 , y0 , z0 , and w0 . To make the initial condi-
tions bounded by the chaotic system’s basin of attraction, they are computed
as:
K1
x0 = − 20, (4a)
A/40
K2
y0 = − 20, (4b)
A/40
K3
z0 = , (4c)
A/50
K4
w0 = − 100, (4d)
A/200
where A = 252 . Then, the 4 chaotic sequences x, y, z and w are generated with
length equals M 2 + 1000.
8
M. A. Fetteha et al.
Step 2: The first 1000 iterations are removed from the four chaotic sequences
to generate Xh , Yh , Zh and Wh . Then, the vectors U1 , U2 , U3 , U4 , U5 , and U6
are generated by the following equations:
U1 = mod( Xh × 1013 , 8) + 1, (5a)
U2 = mod( (U − U ) × 10 , M ),
13 2
(5b)
U3 = mod( Wh × 10 , 8) + 1,
13
(5c)
U4 = mod( (Xh + Yh ) × 10 , 256) + 1,
13
(5d)
U5 = mod( Yh × 10 , 8) + 1,
13
(5e)
U6 = mod( (Wh + Zh ) × 10 , 256),
13
(5f)
where is the ceiling operator, and U = [Xh , Yh , Zh , Wh ].
Step 3: U1 is used to select the DNA rule to encode the input image according
to Table 2.
Step 4: U2 is used to perform DNA cycling on S1 . The result of mod(U2 , 4)
chooses how many times the data is shifted according to Table 4.
Step 5: U3 is used to DNA encode U4 to generate Q. Then, according to Table 3,
the following equations are applied on S2 :
q = Q(1) − Q(M 2 ), (6a)
S3 (1) = S2 (1) + Q(1) + q, (6b)
S3 (i) = S2 (i − 1) + S2 (i) + Q(i). (6c)
Step 6: U5 is used to select the rule for DNA decoding for S3 according to
Table 2.
Step 7: Every byte of S4 is accumulated to calculate ‘datasum ’. The proposed
modified Arnold transform (3) is applied on S4 to generate S5 , where cyc =
datasum .
Step 8: S5 is then XORed with the U6 to generate the encrypted image.
4 Performance Evaluation
The proposed system is tested using the gray-scale ‘Lena’ (256 × 256), ‘Baboon’
(512 × 512), and ‘Pepper’ (512 × 512) images.
(2n − 1)2
P SN R = 10 log10 , (8b)
M SE
where Oi,j and Ei,j are the original and encrypted image at position (i, j) respec-
tively and n is the number of bits per pixel. MSE and PSNR ∈ [0, ∞], where
high MSE and low PSNR values indicate huge difference between the original
and encrypted images. Table 6 shows that the proposed system gives similar
MSE and PSNR values compared to other researches.
Cov(x, y)
ρ= , (9a)
D(x) D(y)
M2 M2 M2
1 1 1
Cov(x, y) = 2 (xi − 2 xj )(yi − 2 yj ), (9b)
M i=1
M i=j
M i=j
M2 M2
1 1
D(x) = 2 (xi − 2 xj )2 , (9c)
M i=1
M i=j
M2 M2
1 1
D(y) = (yi − yj )2 , (9d)
M2 i=1
M2 i=j
Image Encryption Using Chaos, DNA and Modified Arnold Transform 11
Fig. 3. Histogram of the original image (left), and encrypted image (right) for Lena,
Baboon, and Pepper in (a), (b), and (c), respectively.
where Cov(x, y) is the covariance between pixels x and y, and D is the standard
deviation. The values of the correlation coefficients for the encrypted images must
be close to 0, which means that even the neighbouring pixels are uncorrelated.
The results in Table 6 show that the correlation coefficients are close to 0 and
12 M. A. Fetteha et al.
Fig. 4. (a) Horizontal, (b) vertical, and (c) diagonal correlation of Baboon image (left)
and encrypted Baboon image (right).
comparable to other works. Figure 4 further indicates that the original image
pixel values are grouped in a region, which shows that they are correlated. On
the contrary, the encrypted image pixel values are spread all over.
Image Encryption Using Chaos, DNA and Modified Arnold Transform 13
where P (i) is the probability of occurrence of i. For an 8 bit image, the ideal
value is 8, which means that the information is distributed uniformly over all
pixel values. The results in Table 6 shows that the entropy of the encrypted
images successfully approach 8.
The proposed system has a total number of 4 sub keys, each represented by
52 bits, where K1 = (FF123FF0567EF)16 , K2 = (F655FF000FFFF)16 , K3 =
(FFAB0957FFFFF)16 and K4 = (46FF0108F214F)16 are the values for the sub
keys used. This results in a key space equals 2208 ≈ 1063 , which is large enough to
resist brute force attacks [1,15]. In addition, the key must have high sensitivity
such that any slight change in the decryption key (single bit) prevents recovering
the original image. Figure 5 shows the original image of ‘Baboon’ and the wrong
decrypted image when changing the least significant bit of the first sub key.
Fig. 5. Original Baboon image (left) and wrong decrypted image (right).
This test is done by changing the least significant bit of a random pixel in
the original image and comparing the newly encrypted image to the original
encrypted image using Number of Pixels Change Rate (NPCR) and Unified
Average Changing Intensity (UACI) [16], which are given by:
14 M. A. Fetteha et al.
M M
1
N P CR = DE(i, j) × 100%, (11a)
M2 i=1 j=1
M M
1 | E1 (i, j) − E2 (i, j) |
U ACI = 2 × 100%, (11b)
M i=1 j=1
255
5 Conclusion
This paper presented an encryption algorithm, utilizes hyperchaotic system,
DNA manipulation, and a modified Arnold transform. The modified Arnold
transform enhances the encryption process by eliminating the cases at which
pixel permutation is cancelled. The performance evaluation for the proposed
system shows that it is reliable for image encryption compared to recent similar
schemes. The design is simple and amenable for real life application hardware
realization. For future work, it can be applied on colored images for each channel
separately rather than grayscale images only.
References
1. Li, M., Wang, M., Fan, H., An, K., Liu, G.: A novel plaintext-related chaotic
image encryption scheme with no additional plaintext information. Chaos, Solitons
Fractals 158, 111989 (2022)
2. Xian, Y., Wang, X.: Fractal sorting matrix and its application on chaotic image
encryption. Inf. Sci. 547, 1154–1169 (2021)
3. Li, T., Shi, J., Li, X., Wu, J., Pan, F.: Image encryption based on pixel-level
diffusion with dynamic filtering and DNA-level permutation with 3D Latin cubes.
Entropy 21(3), 319 (2019)
4. Alawida, M., Samsudin, A., Teh, J.S., Alkhawaldeh, R.S.: A new hybrid digital
chaotic system with applications in image encryption. Sig. Process. 160, 45–58
(2019)
5. Belazi, A., Abd El-Latif, A.A., Belghith, S.: A novel image encryption scheme
based on substitution-permutation network and chaos. Sig. Process. 128, 155–170
(2016)
6. Luo, Y., Yu, J., Lai, W., Liu, L.: A novel chaotic image encryption algorithm
based on improved baker map and logistic map. Multimed. Tools Appl. 78(15),
22023–22043 (2019). https://doi.org/10.1007/s11042-019-7453-3
7. Ni, Z., Kang, X., Wang, L.: A novel image encryption algorithm based on bit-level
improved Arnold transform and hyper chaotic map. In: 2016 IEEE International
Conference on Signal and Image Processing (ICSIP), pp. 156–160. IEEE (2016)
8. Ismail, S.M., Said, L.A., Radwan, A.G., Madian, A.H., Abu-Elyazeed, M.F.: Gen-
eralized double-humped logistic map-based medical image encryption. J. Adv. Res.
10, 85–98 (2018)
9. Zhou, S., Wang, X., Zhang, Y., Ge, B., Wang, M., Gao, S.: A novel image encryp-
tion cryptosystem based on true random numbers and chaotic systems. Multimed.
Syst. 28(1), 95–112 (2022). https://doi.org/10.1007/s00530-021-00803-8
10. Wang, X., Wang, M.: A hyperchaos generated from Lorenz system. Phys. A
387(14), 3751–3758 (2008)
11. Wu, J., Liao, X., Yang, B.: Image encryption using 2D Hénon-Sine map and DNA
approach. Sig. Process. 153, 11–23 (2018)
12. Wu, L., Zhang, J., Deng, W., He, D.: Arnold transformation algorithm and anti-
Arnold transformation algorithm. In: 2009 First International Conference on Infor-
mation Science and Engineering, pp. 1164–1167. IEEE (2009)
13. Mehra, I., Nishchal, N.K.: Optical asymmetric image encryption using gyrator
wavelet transform. Opt. Commun. 354, 344–352 (2015)
14. Kaur, M., Kumar, V.: A comprehensive review on image encryption techniques.
Arch. Comput. Methods Eng. 27(1), 15–43 (2020). https://doi.org/10.1007/
s11831-018-9298-8
15. Ghebleh, M., Kanso, A., Noura, H.: An image encryption scheme based on irregu-
larly decimated chaotic maps. Sig. Process. Image Commun. 29(5), 618–627 (2014)
16. Wu, Y., Noonan, J.P., Agaian, S., et al.: NPCR and UACI randomness tests for
image encryption. Cyber J. Multidisc. J. Sci. Technol. J. Sel. Areas Telecommuni.
(JSAT) 1(2), 31–38 (2011)
17. Alghafis, A., Munir, N., Khan, M., Hussain, I.: An encryption scheme based on
discrete quantum map and continuous chaotic system. Int. J. Theor. Phys. 59(4),
1227–1240 (2020). https://doi.org/10.1007/s10773-020-04402-7
Rice Plant Disease Detection
and Diagnosis Using Deep Convolutional
Neural Networks and Multispectral
Imaging
1 Introduction
In Egypt, rice is important in Egyptian agriculture sector, as Egypt is the largest
rice producer in Africa. The total area used for rice cultivation in Egypt is about
600 thousand ha or approximately 22% of all cultivated area in Egypt during
the summer. As a result, it is critical to address the causes of rice production
loss to minimize the gap between supply and consumption. Rice plant diseases
contribute mostly to this loss, especially rice blast disease. According to [9], rice
blast disease causes 30% worldwide of the total loss of rice production. Thus,
rice crops diseases detection, mainly rice blast disease, in the early stages can
play a great role in restraining rice production loss.
Early detection of rice crops diseases is a challenging task. One of the main
challenges of early detection of such disease is that it can be misclassified as the
Supported by Data Science Africa.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 16–25, 2023.
https://doi.org/10.1007/978-3-031-21595-7_2
Rice Plant Disease Detection and Diagnosis 17
brown spot disease by less experienced agriculture extension officers (as both are
fungal diseases and have similar appearances in their early stage) which can lead
to wrong treatment. Given the current scarcity of experienced extension officers
in the country, there is a pressing need and opportunity for utilising recent tech-
nological advances in imaging modalities and computer vision/artificial intelli-
gence to help in early diagnosis of the rice blast disease. Recently, multispectral
photography has been deployed in agricultural tasks such as precision agricul-
ture [3], food safety evaluation [11]. Multispectral cameras could capture images
in Red, Red-Edge, Green and Near-Infrared bands wavebands, which captures
what the naked eye can’t see. Integrating the multispectral technology with deep
learning approaches would improve crops diseases identification capability. How-
ever, it would be required to collect multispectral images in large numbers.
In this paper, we propose a public multispectral and RGB images dataset
and a deep learning pipeline for rice plant disease detection. First, the dataset we
present contains 3815 pairs of multispectral and RGB images for rice crop blast,
brown spot and healthy leaves. Second, we developed a deep learning pipeline
trained on our dataset which calculates the Normalised Difference Vegetation
Index (NDVI) channel from the multispectral image channels and concatenates
it along its RGB image channels. We show that using NDVI+RGB as input
archives a higher F1 score by 1% compared to using RGB input only.
2 Literature Review
Deep learning has emerged to tackle problems in different tasks and fields. Nowa-
days, it is being adopted to solve the challenge of crop disease identification. For
example, Mohanty et al. [8] trained a deep learning model to classify plant crop
type and its disease based on images. Furthermore, [1] proposed a deep learning-
based approach for banana leaf diseases classification.
Furthermore, multispectral sensors have proven its capability as a new modal-
ity to detect crop fields issues and diseases. Some approaches use multispectral
images for disease detection and quantification. Cui et al. [4] developed an image
processing-based method for quantitatively detecting soybean rust severity using
multi-spectral images. Also, [12] utilize digital and multispectral images captured
using quadrotor unmanned aerial vehicles (UAV) to collect high-spatial resolu-
tion imagery data to detect the ShB disease in rice.
After the reliable and outstanding results deep learning models could achieve
on rgb images, some approaches were developed to use deep learning on multi-
spectral images, especially of crops and plants. [10] proposed a deep learning-
based approach for weed detection in lettuce crops trained on multispectral
images. In addition, Ampatzidis et al. [2] collects multispectral images of citrus
fields using UVA for crop phenotyping and deploys a deep learning detection
model to identify trees.
18 Y. A. Alnaggar et al.
3 Methodology
3.1 Hardware Components
An android frontend application was also developed to enable the officers who
collect the dataset to control the multispectral and the smartphone cameras
for capturing dual RGNIR/RGB images simultaneously while providing fea-
tures such as image labelling, imaging session management, and Geo-tagging.
The mobile application is developed with Flutter and uses Firebase real-time
database to store and synchronise the captured data including photos and meta-
data. Furthermore, Hive local storage database is used within the application to
maintain a local backup of the data.
Rice Plant Disease Detection and Diagnosis 19
Our engine is based on ResNet18 [6] architecture which consists of 18 layers and
it utilize the power of residual network, see Fig. 3, residual network help us avoid
the vanishing gradient problem.
We can see how layers are configured in the ResNet-18 architecture. The
architecture starts with a convolution layer with 7 × 7 kernel size and stride
of 2. Next we begin with the skip connection. The input from here is added to
the output that is achieved by 3 × 3 max pool layer and two convolution layers
with kernel size 3 × 3, 64 kernels each. This is the first residual block.
The output of this residual block is added to the output of two convolution
layers with kernel size 3 × 3 and 128 such filters. This constituted the second
residual block. Then the third residual block involves the output of the second
block through skip connection and the output of two convolution layers with
filter size 3 × 3 and 256 such filters. The fourth and final residual block involves
output of third block through skip connections and output of two convolution
layers with same filter size of 3 × 3 and 512 such filters.
Finally, average pooling is applied on the output of the final residual block
and received feature map is given to the fully connected layers followed by soft-
max function to receive the final output.
The vanishing gradient is a problem which happens when training artificial
neural networks that involved gradient based learning and backpropagation. We
use gradients to update the weights in a network. But sometimes what happens
is that the gradient becomes very small, effectively preventing the weights to be
updated. This leads to network to stop training. To solve such problem, residual
neural networks are used.
20 Y. A. Alnaggar et al.
Residual neural networks are the type of neural network that applies identity
mapping. What this means is that the input to some layer is passed directly or
as a shortcut to some other layer. If x is the input, in our case its an image or
a feature map, and F (x) is the output from the layer, then the output of the
residual block can be given as F (x) + x as shown in Fig. 4.
We changed the input shape to be 256 × 256 instead of 224 × 244, also we
replaced the last layer in the original architecture with a fully connected layer
where the output size was modified to three to accommodate our task labels.
4 Experimental Evaluation
4.1 Dataset
We have collected 3815 samples of rice crops of three labels: blast disease, brown
spot disease and healthy leaves distributed, shown in Fig. 5, as the following:
2135, 1095 and 585, respectively. Each sample is composed of a pair of (RGB)
and (R-G-NIR) images as seen in Fig. 6, which were captured simultaneously.
Figure 7 shows samples of the three classes in our dataset.
Fig. 6. On the left is the RGB image and on the right is its R-G-NIR pair.
22 Y. A. Alnaggar et al.
Fig. 7. (a) Blast class sample. (b) Brown spot class sample. (c) Healthy class sample.
In this section, we explain our pipeline for training data preparation and pre-
processing. Also, we mention our deep learning models training configuration for
loss functions and hyperparameters.
Rice Plant Disease Detection and Diagnosis 23
Data Preparation
RGB Images Registration. Since the image sample of our collected dataset con-
sists of a pair of RGB and R-G-NIR images, the two images are expected to
have a similar field of view. However, the phone and MAPIR camera have dif-
ferent field of view parameters that the mapir camera has a 41◦ FOV compared
to the phone camera with 123◦ FOV. As a result, we register the rgb image
to the r-g-nir image using the OpenCV library. The registration task starts by
applying an ORB detector over the two images to extract 10K features. Next,
we use a brute force with Hamming distance matcher between the two images
extracted features. Based on the calculated distances for the matches, we sort
them descendingly and drop the last 10%. Finally, the homography matrix is
calculated using the matched points in the two images to be applied over the
RGB images. Figure 8 shows an RGB image before and after registration.
Fig. 8. On the left is an RGB image before calibration and on the right is after regis-
tration.
MAPIR Camera Calibration. The MAPIR camera sensor captures the reflected
light which lies in the Wavelengths in the Visible and Near Infrared spectrum
from about 400–1100 n and saves the percentage of reflectance. After this step,
calibration of each pixel is applied to ensure that it is correct. This calibration
is performed before every round of images captured using the MAPIR Camera
Reflectance Calibration Ground Target board, which consists of 4 targets with
known reflectance values, as shown in Fig. 9.
Results. For training the deep learning model using RGB and R-G-NIR pairs,
we generate a NDVI channel, using Eq. 1, and concatenate it to the RGB image.
Our study shows that incorporating the NDVI channel improves the model capa-
bility to classify the rice crops diseases. Our model could achieve a F1 score with
5-kFold of 84.9% when using RGB+NDVI as input compared to using only RGB
image which could obtain a F1 score of 83.9%. Detailed results are presented in
Table 1.
N IR − Red
N DV I = (1)
N IR + Red
Table 1. F1 score over our collected dataset achieved by using RGB as input versus
RGB+NDVI.
5 Conclusion
We presented our public dataset and deep learning pipeline for rice plant disease
detection. We showed that employing multispectral imagery with RGB improves
the model capability of disease identification by 1% compared to using solely
RGB imagery. We believe using a larger number of images for training would
enhance current results also considering a larger number of images when using
a deeper model this will result in better results. In addition, more investigation
Rice Plant Disease Detection and Diagnosis 25
on how to fuse multispectral imagery with RGB for training could be applied,
for example we can calculate NDVI from the blue channel instead of the red this
may also boost the model performance.
Acknowledgements. This work has been done with the Data Science Africa support.
References
1. Amara, J., Bouaziz, B., Algergawy, A.: A deep learning-based approach for banana
leaf diseases classification. Datenbanksysteme für Business, Technologie und Web
(BTW 2017)-Workshopband (2017)
2. Ampatzidis, Y., Partel, V.: UAV-based high throughput phenotyping in citrus
utilizing multispectral imaging and artificial intelligence. Remote Sens. 11(4), 410
(2019)
3. Candiago, S., Remondino, F., De Giglio, M., Dubbini, M., Gattelli, M.: Evaluating
multispectral images and vegetation indices for precision farming applications from
UAV images. Remote Sens. 7(4), 4026–4047 (2015)
4. Cui, D., Zhang, Q., Li, M., Hartman, G.L., Zhao, Y.: Image processing methods
for quantitatively detecting soybean rust from multispectral images. Biosys. Eng.
107(3), 186–193 (2010)
5. Elbasiouny, H., Elbehiry, F.: Rice production in Egypt: the challenges of climate
change and water deficiency. In: Ewis Omran, E.-S., Negm, A.M. (eds.) Climate
Change Impacts on Agriculture and Food Security in Egypt. SW, pp. 295–319.
Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41629-4 14
6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
7. Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts.
arXiv preprint arXiv:1608.03983 (2016)
8. Mohanty, S.P., Hughes, D.P., Salathé, M.: Using deep learning for image-based
plant disease detection. Front. Plant Sci. 7, 1419 (2016)
9. Nalley, L., Tsiboe, F., Durand-Morat, A., Shew, A., Thoma, G.: Economic and
environmental impact of rice blast pathogen (magnaporthe oryzae) alleviation in
the united states. PLoS ONE 11(12), e0167295 (2016)
10. Osorio, K., Puerto, A., Pedraza, C., Jamaica, D., Rodrı́guez, L.: A deep learning
approach for weed detection in lettuce crops using multispectral images. AgriEngi-
neering 2(3), 471–488 (2020)
11. Qin, J., Chao, K., Kim, M.S., Lu, R., Burks, T.F.: Hyperspectral and
multispectral imaging for evaluating food safety and quality. J. Food Eng.
118(2), 157–171 (2013). https://doi.org/10.1016/j.jfoodeng.2013.04.001, https://
www.sciencedirect.com/science/article/pii/S0260877413001659
12. Zhang, D., Zhou, X., Zhang, J., Lan, Y., Xu, C., Liang, D.: Detection of rice
sheath blight using an unmanned aerial system with high-resolution color and
multispectral imaging. PLoS ONE 13(5), e0187470 (2018)
A Novel Diagnostic Model for Early Detection
of Alzheimer’s Disease Based on Clinical
and Neuroimaging Features
1 Introduction
Alzheimer’s Disease is an extremely dangerous disease that is known for its properties
of memory erosion and brain destruction [1]. The disease itself is the degradation of grey
matter and memory function in the brain causing a person with Alzheimer’s disease to
exhibit “abnormal behavior” compared to their normal self. Grey matter degradation will
lead to mild cognitive impairment (MCI) on several levels, most notably the inability to
concentrate and short-term memory loss if the disease is in its early stages to complete
personality distortion and halting of response to external stimuli. Alzheimer’s disease
is existed throughout human history, mainly known for its effect on the elderly in terms
of cognitive impairment, dementia, and other symptoms such as violent outbreaks and
severe short- and long-term memory loss [2]. Many of the disease’s causes, such as
contact sports head injuries, and ageing, lead to the deterioration of gray matter in the
brain. AD occurs in old age due to a lack of neurogenesis or an increase in brain cells.
After a certain age, the brain stops producing new cells, as a result of which all brain
cells remain active for most of their lives, which can lead to deterioration due to the
virtue of time. As the disease normally manifests in 20 million people per year making
it dangerous as it cannot be contracted nor easily predicted at an early age [2].
Anticipating this disease before it causes any harm is a necessity in daily life, as
without predicting the possibility of developing the disease, countermeasures to reduce
symptoms will be greatly delayed. The classification of AD has been a very active topic
in the past decade, based on various methodologies mainly using ML and DL approaches
[3]. This is due to the importance of the research as with newer experiments and research
results the possibility of understanding the disease increases, decreasing the harm to
mankind. These studies are based on data from the Alzheimer’s Disease Neuroimaging
Initiative (ADNI), which provides data from multiple modalities. Intuitively, models that
integrate data from different modalities outperform their monomodal counterparts.
Gonzalez et al. [1] presented a multimodal ML approach for early diagnosis of AD,
which allows an objective comparison of the models used since the dataset and pipeline
are the same for all models. Their proposed approach is to use a support vector machine
(SVM) and RF on a combination of clinical and neuroimaging data, which would allow a
high degree of data diversity while maintaining a suitable degree of bias and variance. For
accurate measurements of the performance of the models, the researchers constructed
two SVM rating scores for each subject. These scores are added along with the clinical
data features into the RF classifier and evaluated using 10k fold cross-validation. The
results of the paper showed that using only demographic and clinical data results in a
balanced accuracy of 76% with AUC reaching 85%. Ultimately, by combining clinical
and neuroimaging features, prediction results improved to a balanced accuracy of 79%,
and an AUC of 89%.
Venugopalan et al., presented a research paper [4], which presented a DL approach
to predict AD by using integrated multimodal systems relying on data from Magnetic
Resonance Imaging (MRI), genetics focusing on single nucleotide polymorphisms and
electronic health records, to classify patients into suffering from AD, MCI or Cognitive
Normal (CN) where the average healthy individual is CN. The researchers proposed
an algorithm where stacked denoising auto-encoders are used to extract features from
clinical and genetic data and 3D CNN with MRI data to aid the prediction. Then the
extracted features are concatenated into a fully connected layer followed by RF for
classification. The results of the internal 10-fold cross-validation showed an accuracy of
88% and a recall of 85%.
El-Sappagh et al. presented an interesting new concept for the implantation in the
research paper [5], where a multilayered multimodal system for the detection and predic-
tion of AD was used. The model integrates 11 modalities of the ADNI dataset, making
precise decisions along with a set of interpretations for each decision to make the model
more robust and accurate. The model has two layers to classify and predicts the target
28 E. Gad et al.
class with minimal errors, the first layer performs multi-class classification for early
diagnosis of AD, while the second layer performs binary classification to detect poten-
tial progression from MCI to AD within three years of baseline diagnosis. The designed
model achieves a cross-validation accuracy of 93.95% in the first layer, while it achieves
a cross-validation accuracy of 87.08% in the second layer.
The prediction and progression of AD have been extensively studied, however,
research studies on early diagnosis using DL as feature extractors for single modal-
ity systems, especially neuroimaging, are less efficient for predicting the progression
of AD [6]. Based on the paper [1], We address the challenges of the dataset as it has
an unequal distribution of classes in the training dataset and outliers, which affects the
accuracy of the classifier. Therefore, in our research paper, we optimize the performance
of the prediction of early diagnosis from the multimodal dataset of clinical data and
neuroimaging scores that were used in the paper [1] by:
The goal of this paper is to predict the possibility of developing AD given the person
is classified as MCI as opposed to CN as people who are in the CN state are hard to
predict unless a family history is available. The model will be used to rationalize the
state of those suffering from MCI to distinguish the person’s nature, and are the closest
to CN or AD. If the patient is closer to CN, then the patient is stable (sMCI) however if
the patient is closer to AD, then the patient is progressive (pMCI).
In this section, we first explain the proposed approach as a model, followed by defining
the materials to be used, and finally explain our work in steps.
The experimental analysis in this work contains four steps (see Fig. 1). The first is to split
the data set into test and training data, then balance the training data with SMOTE and
create different feature sets. The second step is to extract most of the main features of each
feature set using the DNN and experiment with the extracted features using different ML
classifiers. The third step is to calculate the different performance metrics for the feature
sets and compare each set with these metrics to choose the most appropriate model.
Finally, Cross-Validation is applied to estimate the performance of the ML classifiers
and to optimize the hyperparameters of each, thus, obtaining the best model which
achieves the best accuracy of prediction.
A Novel Diagnostic Model for Early Detection of Alzheimer’s Disease 29
2.2 Dataset
In previous studies, the data were obtained from ADNI, which is a public dataset. This
dataset was released in 2003, with the main goal of measuring the progression of MCI
and early AD using a combination of imaging, biological markers, clinical, and neu-
ropsychological assessments. The three main subject classes which are CN (normal),
AD and MCI had to have two fundamental tests composed of a mini-mental state exam-
ination (MMSE) and clinical dementia rating (CDR) with a range of values that define
each class or else they would be ruled out of the data [3, 7]. In this study, we study the
performance of the prediction of progression to AD at 36 months, All the data used in
the preparation are obtained from ADNI, and the same group of features as in this study
[1] was used.
We include 15 different models of the dataset (see Table 1), each of which con-
sists of a base level model which holds demographic details (sex, education_level) plus
the MMSE which is a questionnaire that is used to measure cognitive impairment and
CDR sum of boxes used to accurately stage the severity of Alzheimer’s dementia and
MCI. This data is used as the baseline for our model, to classify those who are MCI
as either progressive or stable, where pMCI will progress to AD while those who are
sMCI will not. In each of the upcoming feature sets, we add a different mix of features
with the defined base model. The log memory test which is a standardized assessment of
narrative episodic memory is used in the next model. Rey auditory verbal learning test
(RAVLT) neuropsychological tool used to assess functions like attention and memory
[1]. We also have a series of AD assessment scale cognitive (ADAS) breakdown features
that help in the assessment of memory, language, concentration, and praxis at its core
(adas_memory, adas_language, adas_concentration, adas_praxis), providing thorough
information regarding patient condition. Another critical feature that is taken into con-
sideration is the Apolipoprotein ε4 (APOE4) which increases the risk for AD and is
30 E. Gad et al.
also associated with an earlier age of disease onset. Having one or two APOE ε4 alle-
les increases the risk of developing AD. The final sets include imaging feature scores,
MRI-T1 and Fluorodeoxyglucose scores (T1_scores, fdg_score), convoluted with the
previously mentioned features to obtain different perspectives that would help in gain-
ing an increased accuracy. These scores were extracted from the imaging data using
SVM in the study [1], and we used them back in this study.
Table 1. (continued)
SMOTE is a method used when the dataset used is imbalanced. SMOTE is most used
when the class or target is imbalanced with a severely underrepresented class of data [8].
The algorithm could be separated into multiple steps which can be defined separately to
produce the effect of data multiplication. The first step undergone is the under-sampling
of the data to trim all the outliers and any possible noise in the minority due to the nature
of the algorithm being more complex than merely reiterating or copying and pasting
the data back into the data set, for the SMOTE algorithm uses the feature space, or the
local area of the minority if graphed. Meaning the feature space will be constructed
for the target class in the minority. After the feature space is constructed, the SMOTE
algorithm adds new points within the area, mainly near other points in the area using the
same protocol as K-nearest neighbors (KNN) [9]. The algorithm can use the Euclidean
or the Manhattan distance in constructing the feature space. In our study, we used the
Euclidean distance, as it can be used in any space to calculate distance. Since the data
points can be represented in any dimension, it is the more viable option. The Euclidean
distance is depicted in (1) by taking two or more points to find the squared difference
32 E. Gad et al.
Although creating virtual data points may seem a source of severe errors, the samples
become more general with the increase in data points, preventing bias and variance
from occurring. The most crucial step is the increase of data, where the amount of data
increase required is determined through the parameters of the SMOTE algorithm. The
first parameter is the sampling strategy (ss) where the parameter identifies which class
is iterated, whether it is the minority or otherwise. The parameter k is the number of
nearest neighbors used to synthesize new data. Furthermore, The out_step parameter
determines the step size during calculations in the designated SMOTE algorithm. After
setting the parameters (ss = minority, k = 5, out_step = 0.6), the data is inputted into
the SMOTE algorithm for synthesizing the new data points.
Fig. 2. Scatter plots of two random features of the dataset. (a) and (b) plots illustrate the data
points before and after SMOTE is applied respectively.
After splitting the data set, the training data has 310 pMCI and 107 sMCI representing
purple and yellow respectively (see Fig. 2), the minority class in the data set is sMCI.
As a result of SMOTE, the training data set is balanced to have 310 data points of sMCI.
the set of characteristics we already obtained from the dataset, and random weights are
assigned to the synapses that are vital for attaining the proper outcome in the training
phase; weights with a significant positive or negative value will have a substantial influ-
ence on the output of the future neuron. These synapses are then linked to neurons in the
hidden layer. The hidden layer interprets significant elements from the input data that
are predictive of the outputs. A DNN architecture has 3 hidden layers, which have 1,000
neurons, 500 neurons, and 200 neurons respectively and a drop-out layer, however, the
output layer has half the number of input features or attributes of the model. As for
the optimization algorithm used, the Adam optimizer is used as it is a stochastic gra-
dient descent replacement optimization technique for DL model training, its technique
is based on the adaptive approximation of first- and second-order moments [9]. Adam
combines the strongest features of the Adaptive Gradient Algorithm and Root Mean
Squared Propagation to create an optimization algorithm.
like in RFs, XGBoost uses decision trees as base learners. Individual decision Trees have
high-variance, low-bias models. They are extremely effective at detecting associations in
any form of training data, but they struggle to extrapolate well to new data. Furthermore,
the trees employed by XGBoost are not standard decision trees [11], CARTs hold real-
value scores of whether an instance belongs to a group rather than a single judgement in
each leaf node. When the tree has reached its maximum depth, the choice may be made
by transforming the scores into categories based on a specific threshold.
with a disease, it measures the true positive rate of classification making a more inter-
esting value compared to the accuracy (2). In this case, a highly sensitive test means that
fewer cases of a disease are missed because there are fewer false negatives.
TruePostive
sensitivity = (2)
TruePostive + Falsenegative
Moreover, another measurement used is the specificity confusion matrix which calcu-
lates the classified individuals who do not have a disease as negative, as such it measures
the true negative rate of classification (3). The test’s high specificity means that there
are few false positives. The use of a measurement method with low screening specificity
may not be viable, since many people without this condition may test positive and may
undergo unnecessary diagnostic procedures.
Truenegative
specificity = (3)
Truenegative + FalsePostive
Lastly, Balanced accuracy is used to measure the efficiency of binary and multi-
category classification and is particularly useful when there is an imbalance between
classes, meaning that one of the classes appears more frequently than the other. This
often occurs in many places where abnormalities and diseases are detected. Balanced
accuracy is the arithmetic mean of sensitivity and specificity (4).
sensitivity + specificity
BalancedAccuracy = (4)
2
3 Results
For each model of the feature set, DNN is applied to perform feature engineering to
extract the key features. The number of extracted features for each model is the same as
the number of neurons in the output layer, thus reducing the dimensionality of the input
36 E. Gad et al.
data by removing the redundant data. Taking the first model (base) as an example, its
dimensions are (620, 4) which means 620 data points of 4 features; thus, the dimensions
of the extracted model should be (620, 2). 60% of each model is set as a training set, 20%
as a validation set, and the last 20% as a test set. Table 2 shows the validation accuracy
of the best epoch of DNN.
base_adas_memtest
base_ravlt_scores
base_adas_scores
base_ravlt_apoe
base_adas_apoe
base_ravlt_adas
base_ravlt_adas
base_fdgscore
base_logmem
base_logmem
base_t1score
base_scores
base_ravlt
base_adas
_scores
_apoe
_ravlt
base
Val. Acc. 63.4 72 83.5 72.8 76.2 86.7 77.1 93.7 89.5 80.4 83.6 90.8 87.0 91.6 90.6
Ex. feat. 2 3 2 3 4 4 3 4 5 2 2 3 3 4 5
base_adas_apoe
base_ravlt_adas
base_ravlt_adas
base_fdgscore
base_logmem
base_logmem
base_t1score
base_scores
base_ravlt
base_adas
_scores
_apoe
_ravlt
base
Specificity 65.1 78.7 70 74 79.2 72.6 74.9 83.9 79.7 76.4 85.3 80.5 74.1 87.2 73.4
Sensitivity 69 70.2 87.5 80.7 82.6 89.3 84.4 91.6 88.3 71.1 73.8 76.6 86 89.4 93.7
Bal. Acc. 67.1 74.4 78.7 77.3 80.9 80.9 79.6 87.7 84 73.7 79.6 78.5 80 88.3 83.6
AUC 75 81.8 88.6 86.7 89.3 90.9 88.7 94.3 90.8 83.2 87.7 87.2 90.6 94.5 91.4
For the second experiment, XGBoost is used. The results looked similar to the results
of the first experiment (see Table 4). XGBoost showed a balanced accuracy of 78% and
an AUC of 95% in the same model (base_adas_scores) of the first experiment. In further
analysis, the sensitivity was observed to be higher for each model. A negative result on a
test with high sensitivity is useful for ruling out disease. A high sensitivity test is reliable
when its result is negative since it rarely misdiagnoses those who have the disease.
A Novel Diagnostic Model for Early Detection of Alzheimer’s Disease 37
base_adas_memtest
base_adas_scores
base_ravlt_scores
base_ravlt_apoe
base_adas_apoe
base_ravlt_adas
base_ravlt_adas
base_fdgscore
base_logmem
base_logmem
base_t1score
base_scores
base_ravlt
base_adas
_scores
_apoe
_ravlt
base
Specificity 55.5 69.1 69.7 70.7 76.8 73.3 72 79.9 73.1 67 74.3 71 72.7 81.9 79.1
Sensitivity 88 90.5 93.8 94.8 93.9 93.9 93.9 95 93 89.7 92.3 92.3 93.2 92.8 95.5
Bal. Acc. 51.6 62.5 62.9 67.8 67.5 69.7 68.6 69.2 71.5 61.6 70.8 75.3 76.9 78.4 78.4
AUC 81.2 89.1 91.1 91.2 92.3 91.6 92.4 94.4 92.2 87 91.3 89.9 91.9 94.9 92.9
Finally, for the last experiment, RF is used. Random Forest achieved higher results
in all the feature sets (see Table 5). The RF algorithm avoids and prevents overfit-
ting by using multiple trees, this gives accurate and precise results. As a result, the
experiment showed a balanced accuracy of 92% and AUC of 97% in a different model
(base_ravlt_scores), but it has both clinical and neuroimaging features.
base_adas_memtest
base_ravlt_scores
base_adas_scores
base_ravlt_apoe
base_adas_apoe
base_ravlt_adas
base_ravlt_adas
base_fdgscore
base_logmem
base_logmem
base_t1score
base_scores
base_ravlt
base_adas
_scores
_apoe
_ravlt
base
Specificity 74.6 87.1 88.6 90.8 90.2 90.4 90.2 92.6 93.5 76.4 86.3 84.7 96.5 84.8 90.5
Sensitivity 68.6 89.4 71.1 79.4 82.7 74.1 76.9 79.8 74.1 74.8 83.6 79.7 87.2 85.4 81.5
Bal. Acc. 71.6 88.2 79.9 85.1 86.5 82.2 83.6 86.2 83.8 75.6 85.0 82.2 91.9 85.1 86.0
AUC 79.8 94.6 88.8 93.9 93.3 90.3 91.3 92.4 90.8 84.7 92.2 89.3 97.3 93.2 92.9
As a comparison of the experiments, apparently, the RF has the highest results among
the other two classifiers (see Table 6). It is the appropriate classifier as it is robust to
outliers and generalizes the data in an efficient way. As a result, we proved the reliability
of the proposed approach and optimized the performance of the prediction of early
diagnosis by proposing a novel approach using SMOTE and DL to extract features from
multimodal data.
4 Conclusion
In this paper, we outperformed the research paper findings [1] for both clinical and the
combination of clinical and neuroimaging data, as the below table shows (Table 7).
Table 7. AUC comparison between paper [1] and the proposed approach
References
1. Samper-Gonzalez, J., et al.: Reproducible evaluation of methods for predicting progression
to Alzheimer’s disease from clinical and neuroimaging data. In: SPIE Medical Imaging 2019,
San Diego, USA (2019)
2. Alzheimer’s Association.: 2016 Alzheimer’s disease facts and figures. Alzheimer’s Dement.
12(4), 459–509 (2016)
3. Afzal, S., et al.: Alzheimer disease detection techniques and methods: a review. Int. J. Interact.
Multim. Artif. Intell. (In press, 2021)
4. Venugopalan, J., Ton, L., Hassanzadeh, H.R.D., Wang, M.: Multimodal deep learning models
for early detection of Alzheimer’s disease stage. Sci. Rep. 11, 3254 (2021)
5. El-Sappagh, S., et al.: A multilayer multimodal detection and prediction model based on
explainable artificial intelligence for Alzheimer’s disease. Sci. Rep. 11, 2660 (2021)
6. Sheng, J., Xin, Y., Zhang, Q., et al.: Predictive classification of Alzheimer’s disease using
brain imaging and genetic data. Sci. Rep. 12, 2405 (2022)
7. Zhu, Q., et al.: Classification of Alzheimer’s disease based on abnormal hippocampal
functional connectivity and machine learning. Front. Aging Neurosci. (2022)
8. Fujiwara, K., et al.: Over- and under-sampling approach for extremely imbalanced and small
minority data problem in health record analysis. Front. Public Health 8, 178 (2020)
9. Notley, S., Magdon-Ismail, M.: Examining the use of neural networks for feature extraction:
a comparative analysis using deep learning, support vector machines, and K-nearest neighbor
classifiers (2018)
A Novel Diagnostic Model for Early Detection of Alzheimer’s Disease 39
10. Alam, M., Rahman, M., Rahman, M.: A Random Forest based predictor for medical data
classification using feature ranking, Informat. Med. 15, 100180(2019)
11. Budholiya, K., Shrivastava, S., Sharma, V.: An optimized XGBoost based diagnostic system
for effective prediction of heart disease. J. King Saud Univ. 34(7), 4514–4523 (2022)
Machine Learning and Optimization
Benchmarking Concept Drift Detectors
for Online Machine Learning
1 Introduction
Nowadays machine learning is considered essential for almost every industry in
the world, e.g. healthcare, finance, and manufacturing, to name just a few. Clas-
sical machine learning models are designed to act in static environments where
data distributions are constant over time. However, with the recent complex sys-
tems in real-life, this is not valid anymore. In real applications, generated data
distribution is changing over time. We need new techniques to deal with this
fact. In addition to that, the massive explosion in the generated data volumes
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 43–57, 2023.
https://doi.org/10.1007/978-3-031-21595-7_4
44 M. Mahgoub et al.
due to IoT technology shifts machine learning from being offline to an online and
continuous learning task. Thus, rather than learning from static data, classifiers
need to learn from data streams. In such learning mode, training and predic-
tion are interweaved [5]. In the meantime, the learning approach cannot keep all
the data due to the infinite nature of data streams. Due to the dynamic nature
of the data, the latent data distribution learned by online ML algorithms might
change over time. If this change goes unnoticed by ML algorithms, the prediction
accuracy of the model degrades over time.
The change of the underlying data distribution is known as concept drift [30,
36]. To maintain the prediction accuracy of online ML algorithms, drift detection
techniques have been developed [1,2,4,11,18,21,22]. These techniques vary in
their underlying detection approach (more details in Sect. 2.1). However, they
agree on the input they receive, the classification prediction, and the feedback on
that prediction. Upon detection of a drift in the data distribution, the detector
raises a flag that is received by the online ML system to take corrective action
to restore the prediction accuracy. The latter is out of the scope of the drift
detector task.
With the growth of drift detectors, a number of studies over the past decade
have compared such drift detectors [3,12,13,15,20]. These studies evaluate drift
detection algorithms implemented within MOA [6], the state-of-the-art frame-
work for online ML. The evaluation is mostly focused on drift detection accuracy
on synthetic data. Other metrics such as runtime and memory consumption are
not addressed. Moreover, detection algorithms not integrated in MOA are not
included.
In this paper, we fill this gap in the evaluation and the comparison of the drift
detection algorithms. In addition to the fourteen detection algorithms in MOA,
we integrate two more algorithms. Namely ADWIN++ [22] and SDDM [21].
Moreover, we cover detection accuracy, detection latency, runtime, and memory
consumption. The latter two metrics are of utmost importance from a data
engineering and operational points of view.
The rest of this paper is organized as follows. Section 2 briefly describes the
background concepts and definitions related to concept drift and discusses the
related work. The contribution of the paper is split into two sections. Section 3
describes the benchmark setup. Results and the comparison of the drift detection
algorithms are detailed in Sect. 4. Finally, Sect. 5 concludes the paper.
We start with a background about the different techniques for concept drift
detection on data streams. Next, we discuss related work on the comparative
evaluation of such techniques.
2.1 Background
Data Streams, Concept Drifts, and Their Types. A data stream can
be defined as an unbounded sequence in which the instances have timestamps
Benchmark Drift Detectors Online ML 45
with various granularity [12,36]. The process that generates the stream can be
considered as a random variable X from which the objects x ∈ domain(X)
are drawn. In a classification learning context, a target (or class) variable y ∈
domain(Y ) is available, where Y denotes a random variable over the targets.
Thus, the data stream is comprised of instances (x1 , y1 ), (x2 , y2 ) . . . (xt , yt ),
where (xt , yt ) represents an instance in time t, xt represents the vector of feature
values and yt represents the target for that particular instance. In practice, yt is
not necessarily known at time t where the features xt was observed. yt is usually
known at a later time t + n.
As stated in the Bayesian Decision theory [9], the process of classification
can be described by the prior probabilities of the targets P (Y), and the target
conditional probability density function P (X|Y). The classification decision is
made based on the posterior probabilities of the targets, which can be obtained
from:
P (Y) · P (X|Y)
P (Y|X) = (1)
P (X)
Since P (Y) and P (X|Y) uniquely determine the joint distribution P (X, Y),
concepts can be defined as the joint distribution P (X, Y) [12]. A concept at
point of time t will be denoted as Pt (X, Y). In practice, concept drifts occur due
to changes in user tastes. For example, changes in trends on Twitter that might
make a tweet recommendation obsolete for the user. Mathematically, a concept
drift occurs due to the change in the generating random variable X that leads
a change in the data distribution. Based on the definition of concept, concept
drift between data at point of time t and data at point of time u can formally
be defined as a difference in the distributions of the data in these time points:
According to [12], the popular patterns of concept drift are the following:
– Abrupt drift: when the a learned concept is suddenly replaced by a new con-
cept (Fig. 1a).
– Gradual drift: when the change is not abrupt, but goes back and forth between
the original and the new concept (Fig. 1b).
– Incremental drift: when, as time passes, the probability of sampling from the
original concept distribution decreases and the probability of sampling from
the new concept increases (Fig. 1c).
Fig. 1. Patterns of concept drift, X-axis: time, Y-axis: the mean of the data
Statistical detectors are categorized based on the underlying statistical test. The
Sequential Probability Ratio Test (SPRT) [33] detects and monitors the change
in data using a sequential hypothesis test. Page [24] developed two memory-less
models based on SPRT, the Cumulative Sum (CUSUM) and the Page-Hinckley
(PH) test. A drawback of the SPRT-based models is that it only depends on
two metrics when deciding a concept drift: false alarm and missed detection
rates [12]. Another statistical test is the Fisher’s Exact test which is employed
when the number of errors or correct predictions is small. A model that uses
this test is the Fisher Proportions Drift Detector (FPDD) presented by de Lima
Cabral and de Barros [18] extending it with the Fisher-based Statistical Drift
Detectors (FSDD) and Fisher Test Drift Detector (FTDD). Some models use
McDiarmid’s inequality for detecting concept drifts such as the McDiarmid Drift
Detection Methods (MDDMs) proposed by Pesaranghader et al. [26]. This model
uses a weighting scheme represented in a sliding window over prediction results
assigning weights to stream elements with higher weights to the most recent
ones. A concept drift occurs when there is a significant difference between two
weighted means. Three drift detectors are developed after this model depending
on the weighting scheme type: MDDM-A model (arithmetic), MDDM-G model
(geometric), and MDDME model (Euler).
Some detectors use a base learner (classifier) to classify the future stream ele-
ments. The Drift Detection Method (DDM) [11] is the first algorithm to use this
concept. Several methods are extended from DDM such as the Early Drift Detec-
tion Method (EDDM) [1], and the Reactive Drift Detection Method (RDDM) [2].
The Statistical Drift Detection Method (SDDM) [21], however, does not require
feedback from the learner (classifier) to decide about drifts.
and shrinks depending on the stream state. By the arrival of a stream element,
the internal window is split into two sub-windows covering the whole stream
tuples and deciding a drift when a significant difference in the means of the
two sub-windows occurs, leading to dropping the oldest elements till no further
change is detected. A number of efficient implementations and enhancements
have been presented in the literature. Grulich et al. [14] extend ADWIN by pre-
senting three variants: Serial, HalfCut, and Optimistic, making ADWIN more
scalable by optimizing its throughput using parallel adaptive windowing tech-
niques. Another optimization is presented in [22] to account for the unbounded
growth of the internal window size with steady streams, i.e., streams that have
a low frequency of drifts.
ECDD, PHT, STEPD, PL, ADWIN, and DOF. Naive Bayes was used as the
base learner. The authors used both synthetic and real-world data sets. However,
the synthetic data sets covered only abrupt and gradual drifts. The evaluation of
the detectors was focused on accuracy-related aspects. A more recent benchmark
is presented by Barros et al. [3]. The evaluation setup largely follows the one
in [13]. It includes more detectors such as: HDDM, RDDM, WSTD and SeqDr.
Moreover, it uses VFDT as another base learner. The evaluation is concerned
with accuracy aspects only.
In this paper, we extend and complement the evaluation done in [3,13] by: –
Including more recent drift detectors, namely: SDDM [21] and ADWIN++ [22],
– Using synthetic data sets that address all drift types in addition to real-world
data sets, – Reporting about operational metrics of runtime, latency and memory
consumption in addition to drift detection accuracy measures.
3 Benchmark Setup
In this section, we describe benchmarking, the datasets, the algorithm bench-
marked, the parametrization used in the drift detectors, metrics computed, and
the evaluation methodology.
3.1 Datasets
Both synthetic and real-world datasets are used for this experiment. The syn-
thetic data are generated using the generator from [14]. This data simulates
incremental, gradual, and abrupt concept drifts. Instances in this data represent
the prediction made by a Naı̈ve Bayes base learner so they can be applied directly
on the concept drift detection methods without the need for a base learner. This
helps in focusing on benchmarking the concept drift detection methods them-
selves without taking into consideration base learners and their possible delays
or effects on the experiments. Each dataset in this group consists of two million
instances. The incremental dataset (Inc1554) contains 1554 drifts. The gradual
dataset (Grad1738) contains 1738 drifts. The abrupt one (Abr1283) contains
1283 drifts.
The second group contains real-world datasets. There are three datasets:
Airlines (539383 records), Electricity (45312 records), and INSECTS-Abrupt
(balanced) (52848 records). The first two datasets are common in the literature
and are publicly available on MOA [6] website. The last dataset is one of many
new datasets introduced in [31]. The dataset can be used in benchmarking as
its characteristics and pattern are known and avoid the challenges related to the
other real-life datasets, i.e. the lack of ground truth about when drift occurred,
so we choose to include it in this paper.
3.3 Metrics
1
https://github.com/mahmoudmahgoub/moa.
2
https://github.com/openjdk/jmh.
50 M. Mahgoub et al.
To choose those values, we increased the warmup and the measurement itera-
tions value by one unit until the metric values stabilized, i.e. they are not affected
by change in warmup and measurement iterations.
Although memory usage can be calculated using different methods, such
as using the Java MemoryMXBean interface to calculate the memory before and
after running the drift detector methods, we preferred to use VisualVM3 , an
external profiling tool. VisualVM gives an accurate measurement for memory
consumption that are unaffected by the garbage collector calls in the JVM.
4 Results
We report about the metrics discussed in the previous section for the sixteen
algorithms using the respective hyperparameter values reported in Table 1. We
have conducted the experiments on a computer with an Intel Core i5-8250U
processor having 12 GB of RAM and running Windows 11 operating system
using Java 8. For measuring runtime and memory consumption, we report the
Inc1554 dataset only due to paper length limitation. However, each algorithm
followed a similar behavior on the other datasets.
3
https://visualvm.github.io/.
Benchmark Drift Detectors Online ML 51
Table 2 presents the accuracy results for the sixteen methods. For the synthetic
data, ADWIN MOA and ADWIN++ have the highest accuracy as they detect
the exact number of the existing drifts in the three synthetic datasets.
RDDM, STEPD, HDDMA , and HDDMW have relatively good performance
when they are used on synthetic data. EDDM has modest accuracy for the
incremental and abrupt datasets whereas its accuracy is much higher for the
gradual dataset. SDDM accuracy is close ti EDDM. Yet, it does not seem to be
affected by the drift type. DDM gives a lot of false positives for the gradual and
abrupt datasets. SeqDrift1, SeqDrift2, SEED, PH, GMA, EWMA, and CUSUM
have the worst accuracy. The reason can be attributed to their memory-less
nature. Thus, they learn the least about the data distribution. This is further
evidenced when we discuss memory consumption.
For the real-world datasets, we do not know exactly the number of drifts nor
their location in the data. As ADIWN has the highest accuracy in drift detection
on the synthetic dataset, we have used it as a reference to compare the accuracy
of the other methods on real-world data sets. The methods have very different
results compared to ADWIN. For instance, RDDM has poor accuracy on the real-
world data set, 10% on Electricity, 30% on Airlines, and 0% on Insects. HDDMW
has a similar behavior as RDDM, except for the Insects dataset. STEPD has
a steady huge rate of false positives. Memory-less methods, the last group in
52 M. Mahgoub et al.
Fig. 2. Drift detection in the first 1000 instances of the synthetic datasets. X-axis
represent the progress of the stream, y-axis shows when a drift is detected.
Table 2, continue to perform poorly on the real-world data set, the best accuracy
is 25% reached by SEED on the Airlines dataset.
Benchmark Drift Detectors Online ML 53
Figure 2 shows the detected drifts in the first 1000 instances of the synthetic
datasets. We can notice that for the algorithms with a good or modest perfor-
mance, mentioned previously, only ADWIN MOA and ADWIN++ detect them
at the correct time, whereas others have detection delays in addition to false
positives. The DDM family, including SDDM, depends on some statistical test
to monitor the number of errors produced by a model learned on the previous
stream items. So, it may be too slow in responding to changes because it may
take many observations after the change to detect a drift. The same can be said
about memory-less detectors.
On the other hand, ADWIN makes instantiations call of the drift. For
instance, upon the arrival of a new stream element, a cut detection is made
immediately by splitting the main window into two sliding windows looking for
a possible drift. A drift is called when there is a significant difference between
the means of these two windows.
4.2 Runtime
Figure 3 shows the aver-
age runtime of the differ-
ent methods. SDDM and
ADWIN MOA are the
slowest methods. SDDM
has a high algorithmic com-
plexity in deciding whether
drift is detected due to the
internal bucketing and dis-
tance measured used [21].
For ADWIN MOA, we have
investigated the reason for
slowness, and we found
that setting the mintClock
hidden parameter value to
1 is the reason for this
Fig. 3. Average runtime in milliseconds using a loga-
slowness. By increasing rithmic scale
mintClock we get more
speed, but accuracy becomes lower. For example, setting mintClock to 32 and
using the Inc1554 dataset, the execution time drops to 2712.246 ms. However,
the number of detected drifts drops to 57. ADWIN++ is better in that sense as
it is faster and it preserves its accuracy. In fact, that is the main improvement
it brings [22].
Figure 4 shows the consumed memory over the time. All the algorithms almost
need the same small amount of memory expect for SDDM that consumes almost
54 M. Mahgoub et al.
about three orders of magnitude higher than the rest of the methods. It almost
consumes 1.6 GB.
Apart from ADWIN variants and SDDM, other algorithms tend to consume
almost a constant amount of memory as they decide on a drift based on some
statistical calculations which monitor the number of errors counted after a pre-
vious drift so they accumulate on existing values and no need to keep a window
of previous stream elements.
The memory consumption of ADWIN MOA is particularly interesting. It goes
high over time if the number of drifts is small. This is because the size of the
window keeps increasing over. The accuracy/latency trade-off of ADWIN MOA
are affected by the internal window size, i.e., the memory consumed. ADWIN++
comes into the picture to save the memory consumption as low as the other
methods, shown as “Others” in Fig. 4 and keep the accuracy and latency of
ADWIN MOA. it achieves that by controlling the window size [22] even in the
case of not detecting drifts.
SDDM memory consumption is very high because of the many internal
parameters of the algorithm that need to be optimized. For example, the bucket
size, the distance measure, and the number of features to compute drifts for
all affect the accuracy, memory consumption and thus latency of drift detec-
tion. The algorithm implementation needs to be optimized. These parameters
are orthogonal to those listed in Table 1 for SDDM. The optimization of such
parameters is out of scope of the current paper.
accuracy on the synthetic data sets. Namely, ADWIN++ maintains the accuracy
of ADWIN while improving runtime and memory consumption. Memory-less
detectors are not useful and are not recommended for use in real-life scenarios.
Taking ADWIN as a reference point, we notice a considerable difference in the
performance of the other algorithms on real-life data sets. This calls for further
investigation on which of those methods delivers the best accuracy. This is the
target for future studies. From memory consumption results, more improvements
for SDDM implementation are needed.
References
1. Baena-Garcıa, M., del Campo-Ávila, J., Fidalgo, R., Bifet, A., Gavalda, R.,
Morales-Bueno, R.: Early drift detection method. In: Fourth International Work-
shop on Knowledge Discovery from Data Streams, vol. 6, pp. 77–86 (2006)
2. de Barros, R.S.M., de Lima Cabral, D.R., Gonçalves Jr, P.M.G., de Carvalho San-
tos, S.G.T.: RDDM: reactive drift detection method. Expert Syst. Appl. 90, 344–
355 (2017)
3. Barros, R.S.M., Santos, S.G.T.C.: A large-scale comparison of concept drift detec-
tors. Inf. Sci. 451–452, 348–370 (2018)
4. Bifet, A., Gavaldà, R.: Learning from time-changing data with adaptive windowing.
In: ICDM, pp. 443–448. SIAM (2007)
5. Bifet, A., Gavaldà, R., Holmes, G., Pfahringer, B.: Machine Learning for Data
Streams with Practical Examples in MOA. MIT Press, Cambridge (2018)
6. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis.
J. Mach. Learn. Res. 11, 1601–1604 (2010)
7. Brzeziński, D., Stefanowski, J.: Accuracy updated ensemble for data streams with
concept drift. In: Corchado, E., Kurzyński, M., Woźniak, M. (eds.) HAIS 2011.
LNCS (LNAI), vol. 6679, pp. 155–163. Springer, Heidelberg (2011). https://doi.
org/10.1007/978-3-642-21222-2 19
8. Domingos, P.M., Hulten, G.: Mining high-speed data streams. In: SIGKDD, pp.
71–80. ACM (2000)
9. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley (2001)
10. Frı́as-Blanco, I., del Campo-Ávila, J., Ramos-Jiménez, G., Morales-Bueno, R.,
Ortiz-Dı́az, A., Caballero-Mota, Y.: Online and non-parametric drift detection
methods based on Hoeffding’s bounds. IEEE Trans. Knowl. Data Eng. 27(3), 810–
823 (2015)
11. Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In:
Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295.
Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28645-5 29
12. Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on
concept drift adaptation. ACM Comput. Surv. 46(4), 44:1–44:37 (2014)
13. Gonçalves, P.M., de Carvalho Santos, S.G., Barros, R.S., Vieira, D.C.: A com-
parative study on concept drift detectors. Expert Syst. Appl. 41(18), 8144–8156
(2014)
56 M. Mahgoub et al.
14. Grulich, P.M., Saitenmacher, R., Traub, J., Breß, S., Rabl, T., Markl, V.: Scalable
detection of concept drifts on data streams with parallel adaptive windowing. In:
EDBT, pp. 477–480. OpenProceedings.org (2018)
15. Han, M., Chen, Z., Li, M., Wu, H., Zhang, X.: A survey of active and passive
concept drift handling methods. Comput. Intell. 38(4), 1492–1535 (2022)
16. Huang, D.T.J., Koh, Y.S., Dobbie, G., Pears, R.: Detecting volatility shift in data
streams, pp. 863–868 (2014)
17. Kolter, J.Z., Maloof, M.A.: Dynamic weighted majority: a new ensemble method
for tracking concept drift. In: ICDM, pp. 123–130. IEEE (2003)
18. de Lima Cabral, D.R., de Barros, R.S.M.: Concept drift detection based on Fisher’s
Exact test. Inf. Sci. 442, 220–234 (2018)
19. Liu, G., Cheng, H.R., Qin, Z.G., Liu, Q., Liu, C.X.: E-CVFDT: an improving
CVFDT method for concept drift data stream. In: ICCCAS, vol. 1, pp. 315–318.
IEEE (2013)
20. Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., Zhang, G.: Learning under concept
drift: a review. IEEE TKDE 31(12), 2346–2363 (2019)
21. Micevska, S., Awad, A., Sakr, S.: SDDM: an interpretable statistical concept drift
detection method for data streams. J. Intell. Inf. Syst. 56(3), 459–484 (2021).
https://doi.org/10.1007/s10844-020-00634-5
22. Moharram, H., Awad, A., El-Kafrawy, P.M.: Optimizing ADWIN for steady
streams. In: ACM/SIGAPP SAC, pp. 450–459. ACM (2022)
23. Nishida, K., Yamauchi, K.: Detecting concept drift using statistical testing. In:
Corruble, V., Takeda, M., Suzuki, E. (eds.) DS 2007. LNCS (LNAI), vol. 4755, pp.
264–269. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75488-
6 27
24. Page, E.S.: Continuous inspection schemes. Biometrika 41(1/2), 100–115 (1954).
https://doi.org/10.1093/biomet/41.1-2.100
25. Pears, R., Sripirakas, S., Koh, Y.S.: Detecting concept change in dynamic data
streams. Mach. Learn. 97, 259–293 (2014). https://doi.org/10.1007/s10994-013-
5433-9
26. Pesaranghader, A., Viktor, H.L., Paquet, E.: McDiarmid drift detection methods
for evolving data streams. In: IJCNN, pp. 1–9. IEEE (2018)
27. Roberts, S.W.: Control chart tests based on geometric moving averages. Techno-
metrics 1(3), 239–250 (1959). http://www.jstor.org/stable/1266443
28. Ross, G.J., Adams, N.M., Tasoulis, D.K., Hand, D.J.: Exponentially weighted mov-
ing average charts for detecting concept drift. Pattern Recogn. Lett. 33(2), 191–198
(2012). https://www.sciencedirect.com/science/article/pii/S0167865511002704
29. Sakthithasan, S., Pears, R., Koh, Y.S.: One pass concept change detection for data
streams. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013.
LNCS (LNAI), vol. 7819, pp. 461–472. Springer, Heidelberg (2013). https://doi.
org/10.1007/978-3-642-37456-2 39
30. Sobolewski, P., Wozniak, M.: Enhancing concept drift detection with simulated
recurrence. In: Pechenizkiy, M., Wojciechowski, M. (eds.) New Trends in Databases
and Information Systems. AISC, vol. 185, pp. 153–162. Springer, Heidelberg (2012).
https://doi.org/10.1007/978-3-642-32518-2 15
31. Souza, V.M.A., dos Reis, D.M., Maletzke, A.G., Batista, G.E.A.P.A.: Challenges in
benchmarking stream learning algorithms with real-world data. Data Min. Knowl.
Discov. 34(6), 1805–1858 (2020). https://doi.org/10.1007/s10618-020-00698-5
32. Street, W.N., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale clas-
sification. In: SIGKDD, pp. 377–382. ACM (2001)
Benchmark Drift Detectors Online ML 57
1 Introduction
Affymetrix microarray datasets represent a powerful analysis tool used for determination
of disease-relevant gene through the analysis of mRNA expression profile of thousands
of genes. Unfortunately, not all assorted microarray genes are expressed in all tissues
needed to be removed as they aren’t related to the state of the disease and represent
irrelevant and redundant that mislead the machine learning algorithms [1, 2]. There-
fore, relevant informative gene selection is a matter of concern needed to enhance the
classification performance of the diseases trajectories [3, 4]. Classification imbalanced
microarray datasets poses a challenge for machine learning predictive modeling as the
distribution of samples across the assorted classes is biased or skewed [5]. Therefore,
recently, considerable attention has been paid for tackling the imbalanced datasets. Tang
et al. [5] proposed granular Support Vector Machines repetitive cost-sensitive learning
undersampling algorithm (GSVM-RU) that minimize the negative impact of informa-
tion loss while maximizing the positive effect of data filtering within the undersampling
process using less number of support vectors. While Krawczyk et al. [6] introduced
an effective ensemble of cost-sensitive Decision Trees (DT) for imbalanced classifica-
tion. Sáez et al. [7] suggested Synthetic Minority Oversampling Technique (SMOTE)
combined with Iterative-Partitioning Filter (IPF) used for balancing the biased datasets.
The study was based on synthetizing new samples not correlated with noisy and bor-
derline samples. Xiao et al. [8] analyzed the effectiveness of a novel class-specific cost
regulation extreme learning machine (CCR-ELM) that based on determination of class-
specific regulation cost for handling misclassification of each class in the performance
index to be not sensitive to the dispersion degree of the utilized dataset. Lopez-Garcia
et al. [9] suggest a hybrid metaheuristic feature selection algorithm named by Genetic
Algorithm (GA) with a cross entropy (CE) based on ensembles called Adaptive Splitting
and Selection (AdaSS) that partitions the feature space into clusters. AdaSS establishes
a different classifier for each partition through adjusting the weights of the different base
classifiers using the discriminant function of the collective decision-making method.
Krawczyk et al. [10] confirmed the robustness of boosting strategy combined with evo-
lutionary undersampling in handling imbalanced datasets using an enhanced ensemble
classifier named EUSBoost. Aljarah et al. [11] introduced whale optimization algorithm
(WOA) as a novel metaheuristic optimization algorithm trained with multilayer per-
ceptron (MLP) neural networks classifier. WOA was utilized to determine the optimal
values for weights and biases to minimize the mean square error (MSE) fitness function
of MLP to overcome the problem of imbalanced datasets. Whereas Aljarah et al. [12]
implemented a machine learning algorithm based on radial basis function (RBF) neu-
ral networks using Biogeography-Based Optimizer. Likewise, Roshan and Asadi [13]
implemented an ensemble of bagging classifiers with evolutionary undersampling tech-
niques. On the other hand, some chaotic metaheuristic approaches and machine learning
algorithms applied on these problems [14–17] but classification performance is a matter
of concern.
In spite of many studies in the field of handling imbalanced datasets classifica-
tion, they were restricted to undersampling the datasets which results in information
loss. Therefore, the present study deploys a modified version of a novel metaheuristic
algorithm known as Emperor Penguin Optimization (EPO) for training a supervised
Random Forest (RF) classifier of bagging and boosting ensembles to overcome the
problem of imbalanced dataset classification. The results were assessed using discovery
imbalanced sepsis microarray datasets from the same platform GPL570 by conducting
60 R. H. Elden et al.
statistical analysis of the experimental results through the calculation of the classification
accuracy.
2 Methodology
The preprocessed microarray dataset which encompasses the average gene expression
values undergoes binary conversion to be adaptable for EPO algorithm. Since imbalanced
microarray dataset can alter the classification performance based on the improper gene
selection, an oversampling strategy was applied for synthesizing new sampling to adjust
the class distribution of the utilized dataset. Thereafter, the most informative genes were
extracted from the preprocessed dataset using Emperor Penguin Optimization algorithm
based on the best classification performance achieved by the internal embedded RF
classifier within EPO algorithm. The performance of the selected genes in differentiating
were further evaluated using supervised machine learning algorithms to confirm the
robustness of the proposed modified EPO model. Figure 1 depicts the outline of the
proposed architecture of the modified version of EPO algorithm for gene selection from
imbalanced datasets.
Ψ = ∇Φ (1)
where, Φ represents the wind velocity, and Ψ determines the gradient of the wind
velocity.
The temperate profile of the huddle is mathematical modeled using the following
equation:
Maxiteration 0, if R > 1
T = T− T= (2)
x − Maxiteration 1, if R < 1
Computational Microarray Gene Selection Model 61
Fig. 1. System architecture of the modified version of EPO algorithm for imbalanced microarray
gene selection.
where, T represents the temperature, R is the radius of the polygon, x depicts the current
iteration, and Maxiteration defines the maximum number of iterations.
Thereafter, we will calculate the distances between assorted genes which means that
genes update their positions according to the best emperor penguin position which is
mathematically defined as follows:
−→ −→ − → → −→
−
Dep = Abs S A · P (x) − C · Pep (x) (3)
→
−
A = M ∗ T + Pgrid (Accuracy) ∗ Rand () − T (4)
62 R. H. Elden et al.
−
→ −→
Pgrid (Accuracy) = Abs( P − Peq ) (5)
− 2
→ x
S A = f · e− l − e−x (6)
−
→
C = Rand ( ) (7)
−→
where Dep represents the distance between the current gene candidate and best selected
−
→ −
→
gene whose fitness value is minimum, x indicates the current iteration. A , and C are
−
→
used to avoid the collision between gene candidates. P defines the best selected gene,
−→
Peq indicates the position vector of the current gene candidate. S( ) defines the social
forces responsible to move towards the direction of best solution. M represents the
movement parameter that preserves a gap between search agents for collision avoidance
not to create a tight or loose huddle. Pgrid (Accuracy) defines the polygon grid accuracy
by comparing the difference between genes, while Rand ( ) is a random value lies in
the range of [0, 1]. e defines the expression function. f and l are control parameters for
better exploration and exploitation.
Meanwhile, the positions of emperor penguins are updated according to the best
obtained optimal solution as follows:
−→ −
→ −
→ −→
Peq (x + 1) = P (x) − A · Dep (8)
−→
where Peq (x + 1) represents updated position of gene candidate.
Finally, the synthetic data samples are generated using the following mathematical
model:
si = Xi + (Xzi − Xi ) × λ (14)
where Xzi represents the K-nearest neighbors samples from Xi , while λ is a random real
number [0, 1].
Random Forest classifier (RF) [21] is one of the most popular supervised machine learn-
ing that was built algorithm based on an ensemble of DT trained through bagging or
bootstrap aggregating. The classification performance of RF depends on the majority
voting of the ensemble trees, therefore increasing the number of trees increases the preci-
sion of the classification outcome and reduces the overfitting of microarray datasets. RF
utilizes two ensemble techniques; Bagging or Boosting algorithms. Bagging technique is
based on generating a different training subset from training samples with replacement,
and the final output is based on majority voting while Boosting combines weak learners
to get strong learners through creating sequential models.
The modified version of EPO for gene selection from the imbalanced microarray datasets
is assembled based on three successive stages:
Initialization
The initial population of the microarray genes are represented in the suggested L-shape
polygon using in which each data point represents the average gene expression level
value. Before evaluation the fitness value of each gene candidate, we have performed
binary conversion using the sigmoidal transfer function as follows the same as in [22]:
where V (x) represents the velocity at iteration of the gene candidate. Using Eq. (15), if
−→
Peq (x + 1) is 1, this means that such gene candidate are highly selected and otherwise
will not selected.
Updating Solutions
After the binary conversion of the gene candidate, the fitness value is evaluated to deter-
mine the best solution. This step is terminated after the maximum number of iterations
reached.
Supervised Classification
The best selected gene candidates have been randomly divided using hold-out strategy in
which 80% are used for the training set and 20% are used for testing. RF classifier trained
Computational Microarray Gene Selection Model 65
using bagging and boosting algorithms is used the main supervised machine learning
algorithm for evaluating the effectiveness of EPO algorithm in imbalances microar-
rays gene selection. The classification performance was assessed through determining
accuracy, sensitivity, and specificity of the classification phase [23].
Table 1. List of the utilized microarray datasets with their clinical description.
To emphasize that the proposed model is adaptable to other microarray datasets, the
performance of gene selection was evaluated through the analysis of the aforementioned
microarrays that listed in Table 4. From the table, it’s evident that the proposed algorithm
achieved the highest classification performance and showed the exploitation capability
of the proposed algorithm.
Computational Microarray Gene Selection Model 67
In terms of average statistical evaluation metrics for the overall classification model,
Table 5 depicts the average accuracy performance of the competitive metaheuristic
optimization algorithms using SVM classifier implemented in the same experimental
environment. By analyzing the aforementioned table, we can conclude that the proposed
model outperforms other suggested optimizers in 24 classification models out of 25 ones.
To summarize, Fig. 3 and Fig. 4 depict the average confusion matrices for the EPO-RF
and other compared algorithms. Figures allow us to determine the best optimizer which
maximized the accuracy.
Table 4. Statistical evaluation results of the proposed algorithm using SVM classifier with 20
runs of fivefold cross-validation procedure. Acc., Accuracy; Spec., Specificity; Sens., Sensitivity;
Prec., Precision.
Table 5. Average accuracy performance of the competitive optimization algorithms using SVM
classifier with 20 runs of fivefold cross-validation procedure.
Fig. 3. (A−F) The average confusion matrix of the EPO-RF algorithm in comparison with other
metaheuristic algorithms for GSE66099 A) EPO, B) PSO, C) GWO, D) SSA, E) HHO, and F)
GA’s.
Computational Microarray Gene Selection Model 69
Fig. 4. (A−F) The average confusion matrix of the EPO-RF algorithm in comparison with other
metaheuristic algorithms for GSE13904 A) EPO, B) PSO, C) GWO, D) SSA, E) HHO, and F)
GA’s.
4 Conclusion
References
1. Mao, Z., Cai, W., Shao, X.: Selecting significant genes by randomization test for cancer
classification using gene expression data. J. Biomed. Inform. 46, 594–601 (2013)
2. Zhang, H.-J., Li, H., Li, X., Zhao, B., Ma, Z.-F., Yang, J.: Influence of pyrolyzing atmosphere
on the catalytic activity and structure of Co-based catalysts for oxygen reduction reaction.
Electrochim. Acta 115, 1–9 (2014)
3. Chen, Y., Wang, L., Li, L., Zhang, H., Yuan, Z.: Informative gene selection and the direct
classification of tumors based on relative simplicity. BMC Bioinform. 17, 44 (2016)
4. Liu, H., Liu, L., Zhang, H.: Ensemble gene selection for cancer classification. Pattern
Recognit. 43, 2763–2772 (2010)
5. Tang, Y., Zhang, Y., Chawla, N.V., Krasser, S.: SVMs modeling for highly imbalanced
classification. IEEE Trans. Syst. Man Cybern. 39, 281–288 (2009)
6. Krawczyk, B., Woźniak, M., Schaefer, G.: Cost-sensitive decision tree ensembles for effective
imbalanced classification. Appl. Soft Comput. 14, 554–562 (2014)
7. Sáez, J.A., Luengo, J., Stefanowski, J., Herrera, F.: SMOTE–IPF: addressing the noisy and
borderline examples problem in imbalanced classification by a re-sampling method with
filtering. Inf. Sci. 291, 184–203 (2015)
8. Xiao, W., Zhang, J., Li, Y., Zhang, S., Yang, W.: Class-specific cost regulation extreme learning
machine for imbalanced classification. Neurocomputing 261, 70–82 (2017)
9. Lopez-Garcia, P., Masegosa, A.D., Osaba, E., Onieva, E., Perallos, A.: Ensemble classification
for imbalanced data based on feature space partitioning and hybrid metaheuristics. Appl. Intell.
49(8), 2807–2822 (2019). https://doi.org/10.1007/s10489-019-01423-6
10. Krawczyk, B., Galar, M., Jeleń, Ł, Herrera, F.: Evolutionary undersampling boosting for
imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38, 714–726
(2016)
11. Aljarah, I., Faris, H., Mirjalili, S.: Optimizing connection weights in neural networks using
the whale optimization algorithm. Soft. Comput. 22(1), 1–15 (2016). https://doi.org/10.1007/
s00500-016-2442-1
12. Aljarah, I., Faris, H., Mirjalili, S., Al-Madi, N.: Training radial basis function networks using
biogeography-based optimizer. Neural Comput. Appl. 29(7), 529–553 (2016). https://doi.org/
10.1007/s00521-016-2559-2
13. Roshan, S.E., Asadi, S.: Improvement of bagging performance for classification of imbalanced
datasets using evolutionary multi-objective optimization. Eng. Appl. Artif. Intell. 87, 103319
(2020)
14. Hashim, F., Mabrouk, M.S., Al-Atabany, W.: GWOMF: Grey Wolf Optimization for motif
finding. In: 2017 13th International Computer Engineering Conference (ICENCO), pp. 141–
146 (2017)
15. Elden, R.H., Ghoneim, V.F., Al-Atabany, W.: A computer aided diagnosis system for the early
detection of neurodegenerative diseases using linear and non-linear analysis. In: 2018 IEEE
4th Middle East Conference on Biomedical Engineering (MECBME), pp. 116–121 (2018)
16. Elden, R.H., Ghoneim, V.F., Hadhoud, M.M.A., Al-Atabany, W.: Studying genes related to the
survival rate of pediatric septic shock. In: 2021 3rd Novel Intelligent and Leading Emerging
Sciences Conference (NILES), pp. 93–96 (2021)
Computational Microarray Gene Selection Model 71
17. Abdelnaby, M., Alfonse, M., Roushdy, M.: A hybrid mutual information-LASSO-genetic
algorithm selection approach for classifying breast cancer (2021)
18. Dhiman, G., Kumar, V.: Emperor penguin optimizer: a bio-inspired algorithm for engineering
problems. Knowl. Based Syst. 159, 20–50 (2018)
19. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling
technique. J. Artif. Intell. Res. 16, 321–357 (2002)
20. Haibo, H., Yang, B., Garcia, E.A., Shutao, L.: ADASYN: adaptive synthetic sampling app-
roach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural
Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328 (2008)
21. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
22. Dhiman, G., et al.: BEPO: a novel binary emperor penguin optimizer for automatic feature
selection. Knowl. Based Syst. 211, 106560 (2021)
23. Prince John, R., Lewall David, B.: Sensitivity, specificity, and predictive accuracy as measures
of efficacy of diagnostic tests. Ann. Saudi Med. 1, 13–18 (1981)
24. Heidari, A.A., Mirjalili, S., Faris, H., Aljarah, I., Mafarja, M., Chen, H.: Harris hawks
optimization: algorithm and applications. Future Gener. Comput. Syst. 97, 849–872 (2019)
25. Mirjalili, S., Mirjalili, S.M., Lewis, A.: Grey wolf optimizer. Adv. Eng. Softw. 69, 46–61
(2014)
26. Mirjalili, S., Gandomi, A.H., Mirjalili, S.Z., Saremi, S., Faris, H., Mirjalili, S.M.: Salp swarm
algorithm: a bio-inspired optimizer for engineering design problems. Adv. Eng. Softw. 114,
163–191 (2017)
27. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of ICNN 1995 -
International Conference on Neural Networks, vol. 1944, pp. 1942–1948 (1998)
28. Whitley, D.: A genetic algorithm tutorial. Stat. Comput. 4, 65–85 (1994)
Fuzzing-Based Grammar Inference
1 Introduction
Software testing is one of the most important phases of the software lifecy-
cle. This includes testing not only for functional correctness but also for safety
and security. Finding bugs and security vulnerabilities presents a difficult task
when facing complex software architectures. An integral part of software test-
ing is fuzzing software with more or less random input and tracking how the
software reacts. Having knowledge about the input structure of the software
under test enables the fuzzer to generate more targeted inputs which signif-
icantly increases the chance to uncover bugs and vulnerabilities by reaching
deeper program states.
As such the most successful fuzzers all come with some sort of model that
describes the input structure of the target program. One of the most promis-
ing methods poses grammar-based fuzzing, where inputs are generated based
on a context-free grammar which fully covers the so-called input language of
a program. This makes it possible for the grammar-based fuzzer to produce
inputs that are valid or near-valid, considerably raising its success rate. Although
grammar-based fuzzing is a very successful method, in most cases such a precise
description of the input language is not available.
The automation of learning input languages for a program, in our instance
in the form of a synthesized context-free grammar, is still an issue in current
grammar-based fuzzing techniques and is not completely resolved yet. With these
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 72–86, 2023.
https://doi.org/10.1007/978-3-031-21595-7_6
Fuzzing-Based Grammar Inference 73
2 Preliminaries
In this section we introduce the necessary notation and theory for the rest of the
paper. We assume the reader is familiar with basic concepts of language theory.
An excellent reference for that is the classical book by Hopcroft and Ullman [9].
74 H. Sochor et al.
(Σ ∪ N )∗ N (Σ ∪ N )∗ → (Σ ∪ N )∗ .
We say G derives (or equivalently produces) a string y from a string x in
one step, denoted x ⇒ y, iff there are u, v, p, q ∈ (Σ ∪ N )∗ such that x = upv,
p → q ∈ P and y = uqv. We write x ⇒∗ y if y can be derived in zero or
more steps from x, i.e., ⇒∗ denotes the reflexive and transitive closure of the
relation ⇒.
The language of G, denoted as L(G), is the set of all strings in Σ ∗ that can
be derived in a finite number of steps from the start symbol S. In symbols,
L(G) = {w ∈ Σ ∗ | S ⇒∗ w}
In this work, Σ always denotes the “input” alphabet (e.g., the set of ASCII
characters) of a given executable (binary) program p. The set of valid inputs of p
is defined as the subset of Σ ∗ formed by all well formed inputs for p. In symbols:
The definition of a well formed input for a given program p depends on the
application at hand. In our setting we only need to assume that it is possible to
determine whether a given input string w is well formed or not for a program p
by simply running p with input w.
As usual, we assume that validInputs(p) is a context-free language. Conse-
quently, there is a context-free grammar Gp such that L(Gp ) = validInputs(p).
Recall that a grammar is context-free if its production rules are of the form
A → α with A a single nonterminal symbol and α a possibly empty string of
terminals and/or nonterminals.
Our main contribution in this paper is a novel algorithm that takes as input
a program p and a finite (small) subset I of validInputs(p), and infers a gram-
mar Gp such that L(Gp ) approximates validInputs(p). I is usually called seed
input. We say “approximates” since it is not decidable in our setting (see [5])
whether L(Gp ) = validInputs(p). To evaluate how well L(Gp ) approximates
validInputs(p), we measure the precision and recall of L(Gp ) w.r.t. validInputs(p)
as in [2], among others.
In our setting we first fix a procedure to calculate the probability distribution
of a language, starting from its corresponding grammar. Following [2] we use
random sampling of strings. Let G = (N, Σ, P, S) be a context-free grammar.
Fuzzing-Based Grammar Inference 75
3 Algorithm
The goal of our approach is to infer a context-free grammar Gp given a program p,
a set of terminal symbols Σ as well as some valid seed inputs I. Ideally the
language L(Gp ) produced by our inferred grammar Gp should be able to produce
the input language validInputs(p) of p such that L(Gp ) = validInputs(p). To
achieve our goal, we apply the following steps:
8 return null
strategy is based on extracting a seed grammar from p using I, and then expand-
ing it using grammar-based fuzzing of p until we obtain a good approximation of
validInputs(p). Grammar-based fuzzing of p means that we execute p with ran-
domly sampled inputs produced from a given grammar. The concrete strategy
is described in Algorithm 1.
The seed grammar extraction is done by the function findSeedGrammar(I, p)
(line 1 in Algorithm 1). The set of non-terminals Ns of the seed grammar Gs is
formed by the names of all functions called by executing p with inputs from I
(line 2 in Algorithm 1). The set of productions Ps of Gs is obtained as follows.
For each u ∈ I,
1. Extract from p and u a parse tree Tu for u, where the internal nodes of Tu are
labelled by non-terminals in Ns and the leaves are labelled by terminals in Σ.
78 H. Sochor et al.
parse → expr
expr → term | term+term | term−term
term → factor | factor ∗ factor | factor / factor
factor → 1 | 2 | 3 | ( expr )
Listing 1. Target Grammar
3.2 Example
1 1 1 1
Fig. 1. From parse tree to grammar Fig. 2. Parse tree evolution of query 1
Fig. 3. Parse tree evolution of query 2 Fig. 4. Parse tree evolution of query 3
parse → expr
expr → term | term ( −term | +term | )
term → factor | factor ( / factor | ∗ factor | )
factor → 1 | 2 | 3 | ( expr )
Listing 2. Learned Grammar
the correct α, we will not be able to find another counterexample and stop
fuzzing when the maximum specified amount of tries is reached. Finally, we add
expr → term | term(−term | +term | ) to Gp . We repeat the process described
above for every non-terminal in Gs . The resulting grammar Gp is shown in
Listing 2. This grammar is equivalent to Gp given in Fig. 1.
Precision
1 Recall
0.8
0.6
0.4
0.2
generate a parser p that accepts our target grammar Gp using the compiler-
compiler Coco/R1 . We apply our fuzzing-based grammar inference algorithm
on this setup. If a specified maximum amount of allowed membership queries
has been reached without finding a counterexample, execution is stopped and
the current state of the inferred grammar Gp is returned. Finally, we calculate
precision(L(Gp ), L(Gp )) and recall (L(Gp ), L(Gp )) by randomly sampling 1 000
words each.
Table 1 displays the results, with the first column indicating the targeted
parser. The second and third columns show the resulting precision and recall of
our extracted grammar, followed by the total amount of unique membership-
queries (MQ) performed, the total number of equivalence-queries (EQ) per-
formed, and the time elapsed. In contrast to previous approaches that rely on a
good number of representative seed inputs to perform appropriately (cf. Sect. 5),
we have used sets I with no more than two seed inputs each. Our results in
Table 1 show that we are able to recover a grammar that perfectly matches the
target grammar for the 1 000 inputs examined. This shows that our approach is
indeed robust to recover context-free grammars from recursive top-down parsers.
In the following we provide more details for the experiment “JsonParser”.
The target grammar is given in Listing 3. Listing 4 shows our learned grammar.
Additionally Fig. 5 shows a more detailed analysis of the learning process for
“JsonParser”. We calculate precision(L(Gp ), L(Gp )) and recall (L(Gp ), L(Gp ))
every time an equivalence query is performed, where Gp is the current state of
the inferred grammar. Again, we use 1 000 randomly sampled words each. As
can be seen in Fig. 5, precision stays at 1 most of the time during the learning
process, due to the fact that L(Gs ) ⊆ L(Gp ) (see Sect. 3.1).
1
https://ssw.jku.at/Research/Projects/Coco/.
Fuzzing-Based Grammar Inference 83
Json →
Element . Json → Element .
Element →
Ws Value Ws . Element → Ws Value Ws .
Ws →
" ". Ws → " ".
Value →
Object | Array | Value → Object | Array |
String | Number | String | Number |
" true " | " false " | " null ". " true " | " false " | " null ".
Object → "{" Ws [ String Ws ":" Object → "{" ( Ws "}" | Ws ( "}" |
Element [" ," Members ]] "}". String Ws ":" ( Element
"}" | Element ( "}" | " ,"
Members "}" ) ) ) ) .
Members → Member [" ," Members ]. Members → Member |
Member ( " ," Members | ) .
Member → Ws String Ws ":" Element . Member → Ws String Ws ":" Element .
Array → "[" Ws [ Value Ws Array → "[" ( Ws "]" | Ws ( "]" |
[ " ," Elements ]] "]". Value ( Ws "]" | Ws ( "]" |
" ," Elements "]" ) ) ) ) .
Elements → Element [ " ," Elements ]. Elements → Element |
Element ( " ," Elements | ) .
String →
" ’" Characters " ’". String → " ’" Characters " ’" .
Characters →
| Character Characters . Characters → | Character Characters .
Character →
" a " | " b " | " c ". Character → " a " | " b " | " c ".
Number →
Integer Fraction Number → Integer Fraction
Exponent . Exponent .
Integer → ["−"] ( "0" | Integer → "0" | Onenine |
Onenine [ Digits ] ) . Onenine ( Digits | ) |
"−" ( Onenine | Onenine
( Digits | ) | "0" ) .
Digits →Digit [ Digits ]. Digits → Digit | Digit ( Digits | ) .
Digit →"0" | Onenine . Digit → "0" | Onenine .
Onenine →"1" | "2" | "3". Onenine → "1" | "2" | "3".
Fraction → | "." Digits . Fraction → | "." Digits .
Exponent → | " E " Sign Digits | Exponent → | " E " Sign Digits |
" e " Sign Digits . " e " Sign Digits .
Sign → | "+" | "−". Sign → | "+" | "−".
If only a portion of the desired language is accepted by the rule at the mea-
surement point, precision remains at 1. Precision may gradually drop as you
learn more rules over time. This could occur when attempting to identify the
correct body of a rule, in particular when the rule accepts a superset of the
wanted language. These inaccuracies are automatically fixed when a counterex-
ample is found. For example the drop in precision in Fig. 5 occurs while learning
a rule which consumes integers. The learned automaton accepts words containing
preceding zeros as well as words containing more than one “-” at the beginning.
Both are not accepted by the parser. As such, the precision of the learned gram-
mar was lowered to 0.5. After the drop in precision, first the issue with multiple
“-” symbols is fixed by providing a counterexample. This raises precision to 0.8.
Finally, after providing another counterexample and consequently disallowing
preceding zeros, the rule is learned correctly and precision increases back to 1.
In terms of recall, we see a consistent increase over time as the learnt grammar
is expanded, and as a result, the learned language grows significantly. When a
84 H. Sochor et al.
rule that consumes terminals is learned, the boost in recall is often greater. For
example, the final spike in recall occurs while learning the rules for parsing digits
and mathematical symbols.
We must remark that we have rarely used optimizations in our implementa-
tion, which leaves a lot of room for improvement. Possible performance improve-
ments include (i) using hash-tables to store previously seen membership-queries
instead of a plain-text list, (ii) replacing the early-parser used to determine
whether a grammar produces a given word with something more efficient, (iii)
using paralellization to speed up fuzzing, and (iv) to simultaneously learn the
different rules of the seed grammar.
5 Related Work
Extracting context-free grammars for grammar-based fuzzing is not a new idea.
Several methods exist for grammar learning which try to recover a context-free
grammar by means of membership-queries from a black-box, such as by begin-
ning with a modestly sized input language and then generalizing it to better fit
a target language [2,15]. Another approach synthesizes a grammar-like structure
during fuzzing [3]. However, this grammar-like structure has a few shortcomings,
e.g., multiple nestings that are typical in real-world systems are not represented
accurately [8]. Other methods use advanced learning techniques to derive the
input language like neural networks [7] or Markov models [6]. Although black-box
learning is generally promising, it suffers from inaccuracies and incompleteness
of learned grammars. It is shown in [1] that learning a context-free language from
a black box cannot be done in polynomial time. As a result, all pure black-box
methods must give up part of the accuracy and precision of the learnt grammars.
Due to limitations with black-box approaches there exist several white-box
methods to recover a grammar. If full access to the source code of a program is
given, described methods fall under the category of grammar inference. Known
methods for inferring a context-free language using program analysis include
autogram [10] and mimid [8]. Unlike its predecessor autogram, which relies
on data flow, mimid uses dynamic control flow to extract a human readable gram-
mar. Finally, [11] describes how a grammar can be recovered using parse-trees
of inputs, which is then improved with metrics-guided grammar refactoring. All
of the aforementioned grammar inference methods share the same flaw: They all
primarily rely on a predetermined set of inputs from which a grammar is derived
that corresponds to this precise set of inputs. If some parts of a program are not
covered by the initial set of inputs, the resulting grammar will also not cover
these parts. However there exist some methods that attempt to automatically
generate valid input for a given program, such as symbolic execution [14].
6 Conclusion
Our main contribution is a novel approach for grammar inference that com-
bines machine learning, grammar-based fuzzing and program analysis. Our app-
roach, in contrast to other efforts, reduces reliance on a good set of seed inputs
Fuzzing-Based Grammar Inference 85
References
1. Angluin, D., Kharitonov, M.: When won’t membership queries help? J. Comput.
Syst. Sci. 50(2), 336–355 (1995)
2. Bastani, O., Sharma, R., Aiken, A., Liang, P.: Synthesizing program input gram-
mars. In: PLDI, pp. 95–110. ACM (2017)
3. Blazytko, T., et al.: GRIMOIRE: synthesizing structure while fuzzing. In: USENIX
Security Symposium, pp. 1985–2002. USENIX Association (2019)
4. Bollig, B., Habermehl, P., Kern, C., Leucker, M.: Angluin-style learning of NFA.
In: IJCAI, pp. 1004–1009 (2009)
5. Gold, E.: Language identification in the limit. Inf. Control 10(5), 447–474 (1967)
6. Gascon, H., Wressnegger, C., Yamaguchi, F., Arp, D., Rieck, K.: PULSAR: stateful
black-box fuzzing of proprietary network protocols. In: Thuraisingham, B., Wang,
X.F., Yegneswaran, V. (eds.) SecureComm 2015. LNICST, vol. 164, pp. 330–347.
Springer, Cham (2015). https://doi.org/10.1007/978-3-319-28865-9 18
7. Godefroid, P., Peleg, H., Singh, R.: Learn& fuzz: machine learning for input fuzzing.
In: ASE, pp. 50–59. IEEE Computer Society (2017)
8. Gopinath, R., Mathis, B., Zeller, A.: Inferring input grammars from dynamic con-
trol flow. CoRR abs/1912.05937 (2019)
9. Hopcroft, J.E., Ullman, J.D.: Introduction to Automata Theory. Languages and
Computation. Addison-Wesley, Boston (1979)
10. Höschele, M., Zeller, A.: Mining input grammars with AUTOGRAM. In: ICSE
(Companion Volume), pp. 31–34. IEEE Computer Society (2017)
11. Kraft, N., Duffy, E., Malloy, B.: Grammar recovery from parse trees and metrics-
guided grammar refactoring. IEEE Trans. Softw. Eng. 35(6), 780–794 (2009)
12. Mathis, B., Gopinath, R., Mera, M., Kampmann, A., Höschele, M., Zeller, A.:
Parser-directed fuzzing. In: PLDI, pp. 548–560. ACM (2019)
86 H. Sochor et al.
13. Moser, M., Pichler, J.: eknows: platform for multi-language reverse engineering and
documentation generation. In: 2021 IEEE International Conference on Software
Maintenance and Evolution (ICSME), pp. 559–568 (2021)
14. Moser, M., Pichler, J., Pointner, A.: Towards attribute grammar mining by sym-
bolic execution. In: SANER, pp. 811–815. IEEE (2022)
15. Wu, Z., et al.: REINAM: reinforcement learning for input-grammar inference. In:
ESEC/SIGSOFT FSE, pp. 488–498. ACM (2019)
Natural Language Processing
In the Identification of Arabic Dialects:
A Loss Function Ensemble Learning
Based-Approach
1 Introduction
More than 2 billion people use Arabic as their liturgical language, making it the
sixth or seventh most widely spoken language in the world. It is a member of the
Semitic language family. One of the hardest languages to learn is typically Ara-
bic. First and foremost, Arabic has a 28-symbol alphabet that can change meaning
depending on where it appears in a word. Additionally, Arabic is read from right
to left, which is completely counter-intuitive to how most westerners read. The
letters and the diacritics, which alter the sound values of the letters to which they
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 89–101, 2023.
https://doi.org/10.1007/978-3-031-21595-7_7
90 S. Jamal et al.
techniques between two different models with the loss function that yielded the
best results on that model and it enhanced the model performance.
The rest of the paper is organized as follows. A review of earlier Arabic
dialect identification literature is in Sect. 2. The proposed dataset is described
in Sect. 3. The methods and pipeline are in Sect. 4. The results and evaluation
are discussed in Sect. 5. Finally, we conclude in Sect. 6.
2 Related Work
A lot of work has gone into developing a method for reliably identifying Arabic
dialects, although it is more difficult than just identifying a particular language
because Arabic Dialects shares a lot of vocabulary. In recent years, the problem
received a lot of attention.
Abdelali et al. [1] discovered that a significant source of errors in their method
was caused by the naturally occurring overlap between dialects from nearby
countries. They created a dataset of 540 k tweets that contained a significant
amount of dialectical Arabic tweets by applying filtering techniques to remove
tweets that are primarily written in MSA. Employing two models, a SVM classi-
fier and a Transformer model (mBERT and AraBERT), their strategy achieved
a macro-averaged F1-score of 60.6%.
A survey on deep learning techniques for processing Arabic dialectal data as
well as an overview of the identification of Arabic dialects in text and speech
was conducted by Shoufan et al. [23].
Salameh et al. [21] proposed a method for classifying Arabic dialects using a
dataset that included 25 Arabic dialects from certain Arab cities, in addition to
Modern Standard Arabic. They experimented several Multinomial Naive Bayes
(MNB) models, and their strategy was able to achieve an accuracy of 67.9% for
sentences with an average length of 7 words.
Malmasi et al. [17] proposed a method to identify a set of four regional Arabic
dialects (Egyptian, Gulf, Levantine, and North African) and Modern Standard
Arabic (MSA) in a transcribed speech corpus that has a total of 7,619 sentences
in the training set. They achieved an F1-score of 51.0% by employing and ensem-
ble learning technique between different SVM models but with different feature
types.
Using six different deep learning techniques, including Convolution LSTM
and attention-based bidirectional recurrent neural networks, Elaraby et al. [10]
benchmarked the AOC dataset [25]. They tested the models in a variety of
scenarios, including binary and multi-way classification. Using different embed-
dings, they attained accuracy of 87.65% on the binary task (MSA vs. dialects),
87.4% on the multi-classification task (Egyptian vs. Gulf vs. Levantine), and
82.45% on the 4-way task.
As a solution to the VarDial Evaluation Campaign’s Arabic Dialect Identifica-
tion task, Mohamed Ali [4] proposed three different character-level convolutional
neural network models with the same architecture aside from the first layer to
solve the task. MSA, Egyptian, Gulf, Levantine, and North African dialects are
92 S. Jamal et al.
among the five included in the dataset. The first model used a one-hot character
representation, the second model used an embedding layer before the convolu-
tion layer, and the third model used a recurrent layer before the convolution
layer, which produced the best results 57.6% F1-score.
Obeid et al. proposed ADIDA [20], an automated method for identifying
Arabic dialects. The algorithm outputs a vector of probabilities showing the
possibility that a sentence entered is from one of 25 dialects and MSA. They
employed the Multinomial Naive Bayes (MNB) classifier that Salameh et al. [21]
proposed using the MADAR corpus. They achieved a 67.9% accuracy.
The Nuanced Arabic Dialect Identification (NADI) 2020 [3] shared task,
which was divided into two sub-tasks of country-level identification and province-
level identification, was the subject of a pipeline proposed by El Mekki et al. [9].
For sub-task one, their pipeline consisted of a voting ensemble learning approach
with two models: the first model is AraBERT [5] with a softmax classifier, and the
second model is TF-IDF with word and character n-grams to represent tweets.
They achieved a 25.99% F1-score for the first sub-task, placing them second.
They also applied a hierarchical classification strategy for the second sub-task,
fine-tuning Arabert for each country to forecast its provinces after applying the
country-level identification, and they were ranked first with an F1-score of 6.39%.
Because the self attention mechanism in the transformer models [24] captures
the long range dependencies, Lin et al. [15] assumed that using a transformed-
based Arabic Dialect identification system will improve the result rather than if
we used CNN-based system. However, they used the self attention mechanism
with down-sampling to reduce the computational complexity. They evaluated
their technique on the ADI17 dataset [22], which performed better than CNN-
based models with an accuracy of 86.29%.
Issa et al. [13], proposed a solution to the country-level Identification sub-task
in the second Nuanced Arabic Dialect Identification (NADI) [3]. They applied
two models and assessed the results. A pre-trained CBOW word embedding was
utilised with an LSTM as the first model, while linguistic features were used
in the second model as low-dimensional feature embeddings that were fed via
a simple feed-forward network. Their F1-scores were 22.10% for the first model
and 18.60% for the second, demonstrating that the use of language features did
not improve the performance.
The results of this study show that most studies relied on the standard
weighted cross-entropy loss rather than conducting further research to solve the
issue of data imbalance that most Arabic datasets suffer. They also concen-
trated on using each dialect’s unique linguistic characteristics in their pipeline.
Following the literature [19], in order to more effectively handle the issue of data
imbalance, the goal of this research is to overcome prior constraints by employ-
ing and evaluating different loss functions on the Arabic dialects identification
task.
In the Identification of Arabic Dialects 93
3 Dataset
We used a subset of the Arabic Online Commentary dataset (AOC) [25]. As
stated earlier, it was generated by gathering comments from three Arabic news
papers publications, each of which represented one of the dialects as follows:
– Al-Ghad → leventine.
– Al-Riyadh → Gulf.
– Al-Youm Al-Sabe’ → Egyptian.
4 Methods
In this section, we’ll go over the steps and techniques for the proposed method,
covering everything from data prepossessing and cleaning to the pre-trained
models overviews and ensemble learning techniques to the prediction output.
94 S. Jamal et al.
Following is a summary of the steps we took in this stage, which was to analyse
the text to make sure it was properly formatted and devoid of unnecessary
characters. Sample texts from the dataset are shown in Table 1 along with their
dialect.
– Arabic Stopwords were eliminated in order to draw attention to the text’s
most crucial information.
– In order to normalize the data, we substituted:
– We eliminated the punctuation and numerals because they’re not useful for
dialect classification.
MARBERT-V2. One of the three models proposed in [2], but with a longer
sequence length of 512 tokens. These Transformers Language models were pre-
trained on 1 billion Arabic tweets, which were created by randomly selecting
tweets from a large dataset of approximately 6 billion tweets made up of 15.6
billion tokens, in order to improve transfer learning on most Arabic dialects
alongside MSA. The model was trained without the next sentence prediction
components but has the same network architecture as the BERT Base (masked
language model).
focuses the model on difficult ones. We used the focal loss with label smoothing
parameter to address the issue of overconfidence.
Dice Loss (DL): The Dice coefficient [7], also known as the harmonic mean of
sensitivity and precision, is used to balance the two because they have different
denominators and true positives in the numerator. In addition, the denominator
of the dice loss equation is changed to its square form to speed up convergence
as proposed in [18].
II- Different Models: The second ensemble learning strategy was between
the two proposed pre-trained BERT models, MARBERT-V2 and AraBERTv02-
twitter, the Focal loss was used as the loss function as it achieved the best results
on both models.
98 S. Jamal et al.
Table 3. Results of applying the two ensemble learning methodologies which shows
that ensemble learning using MARBERT-V2 model with different loss function
achieved the highest MACRO-F1 score.
5.2 Discussion
In this section will discuss the results of implementing different models, loss
functions, performance metrics, ensemble learning strategies and the result of
the proposed model compared to the previous models.
Loss Function: In both models, Focal Loss outperformed the other loss func-
tions in the Macro-F1 score, achieving 81.0% with MARABERTV2 and 79.4%
with AraBERTv02-t. This is due to the fact that Focal Loss down-weights the
easy examples so that, despite their huge number, their contribution to the over-
all loss is minimal, hence addressing the issue of class imbalance better. As shown
in Table 4, the proposed ensemble learning model combined with different loss
functions such as Focal-loss and Dice-Loss outperformed the previous models
with a significant margin even with including the four dialects.
References
1. Abdelali, A., Mubarak, H., Samih, Y., Hassan, S., Darwish, K.: QADI: arabic
dialect identification in the wild. In: Proceedings of the Sixth Arabic Natural Lan-
guage Processing Workshop, pp. 1–10 (2021)
2. Abdul-Mageed, M., Elmadany, A., Nagoudi, E.M.B.: Arbert & marbert: deep bidi-
rectional transformers for arabic. arXiv preprint arXiv:2101.01785 (2020)
3. Abdul-Mageed, M., Zhang, C., Bouamor, H., Habash, N.: Nadi 2020: The first
nuanced Arabic dialect identification shared task. In: Proceedings of the Fifth
Arabic Natural Language Processing Workshop, pp. 97–110 (2020)
4. Ali, M.: Character level convolutional neural network for Arabic dialect identi-
fication. In: Proceedings of the Fifth Workshop on NLP for Similar Languages,
Varieties and Dialects (VarDial 2018), pp. 122–127 (2018)
5. Antoun, W., Baly, F., Hajj, H.: Arabert: transformer-based model for Arabic lan-
guage understanding. arXiv preprint arXiv:2003.00104 (2020)
6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805 (2018)
7. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology
26(3), 297–302 (1945)
8. El-Khair, I.A.: 1.5 billion words Arabic corpus. arXiv preprint arXiv:1611.04033
(2016)
100 S. Jamal et al.
9. El Mekki, A., Alami, A., Alami, H., Khoumsi, A., Berrada, I.: Weighted combi-
nation of BERT and n-gram features for nuanced Arabic dialect identification.
In: Proceedings of the Fifth Arabic Natural Language Processing Workshop, pp.
268–274 (2020)
10. Elaraby, M., Abdul-Mageed, M.: Deep models for Arabic dialect identification on
benchmarked data. In: Proceedings of the Fifth Workshop on NLP for Similar
Languages, Varieties and Dialects (VarDial 2018), pp. 263–274 (2018)
11. Elfardy, H., Diab, M.: Sentence level dialect identification in Arabic. In: Proceed-
ings of the 51st Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), pp. 456–461 (2013)
12. Hajiabadi, H., Molla-Aliod, D., Monsefi, R., Yazdi, H.S.: Combination of loss func-
tions for deep text classification. Int. J. Mach. Learn. Cybern. 11(4), 751–761
(2020)
13. Issa, E., AlShakhori, M., Al-Bahrani, R., Hahn-Powell, G.: Country-level Arabic
dialect identification using RNNs with and without linguistic features. In: Pro-
ceedings of the Sixth Arabic Natural Language Processing Workshop, pp. 276–281
(2021)
14. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object
detection. In: Proceedings of the IEEE International Conference on Computer
Vision, pp. 2980–2988 (2017)
15. Lin, W., Madhavi, M., Das, R.K., Li, H.: Transformer-based Arabic dialect identi-
fication. In: 2020 International Conference on Asian Language Processing (IALP),
pp. 192–196. IEEE (2020)
16. Lulu, L., Elnagar, A.: Automatic Arabic dialect classification using deep learning
models. Proc. Comput. Sci. 142, 262–269 (2018)
17. Malmasi, S., Zampieri, M.: Arabic dialect identification in speech transcripts. In:
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and
Dialects (VarDial3), pp. 106–113 (2016)
18. Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks
for volumetric medical image segmentation. In: 2016 Fourth International Confer-
ence on 3D Vision (3DV), pp. 565–571. IEEE (2016)
19. Mostafa, A., Mohamed, O., Ashraf, A.: GOF at Arabic hate speech 2022: breaking
the loss function convention for data-imbalanced Arabic offensive text detection. In:
Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing
Tools with Shared Tasks on Qur’an QA and Fine-Grained Hate Speech Detec-
tion, pp. 167–175. European Language Resources Association, Marseille, France,
June 2022. http://www.lrec-conf.org/proceedings/lrec2022/workshops/OSACT/
pdf/2022.osact-1.21.pdf
20. Obeid, O., Salameh, M., Bouamor, H., Habash, N.: ADIDA: automatic dialect iden-
tification for Arabic. In: Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics (Demonstrations), pp.
6–11 (2019)
21. Salameh, M., Bouamor, H., Habash, N.: Fine-grained Arabic dialect identification.
In: Proceedings of the 27th International Conference on Computational Linguistics,
pp. 1332–1344 (2018)
22. Shon, S., Ali, A., Samih, Y., Mubarak, H., Glass, J.: Adi17: a fine-grained Ara-
bic dialect identification dataset. In: IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 8244–8248 (2020)
23. Shoufan, A., Alameri, S.: Natural language processing for dialectical Arabic: a
survey. In: Proceedings of the Second Workshop on Arabic Natural Language Pro-
cessing, pp. 36–48 (2015)
In the Identification of Arabic Dialects 101
24. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information
Processing Systems, vol. 30 (2017)
25. Zaidan, O.F., Callison-Burch, C.: Arabic dialect identification. Comput. Linguist.
40(1), 171–202 (2014)
26. Zeroual, I., Goldhahn, D., Eckart, T., Lakhouaja, A.: OSIAN: open source
international Arabic news corpus-preparation and integration into the Clarin-
infrastructure. In: Proceedings of the Fourth Arabic Natural Language Processing
Workshop, pp. 175–182 (2019)
Emotion Recognition System for Arabic Speech:
Case Study Egyptian Accent
1 Introduction
Emotion expression is an important component of human communication as it helps
in transferring feelings and offering feedback. Recently, high interest in speech emo-
tion recognition systems (SER) evolved. Speech emotion recognition systems attempt to
detect desired emotions using voice signals regardless of semantic content [1]. Advances
2 Literature Review
SER systems undergo enormous evolution over the past decade. The system methodolo-
gies depend on the dataset used. Dataset types can be acted, elicited and non-acted [7].
Furthermore, the datasets differ based on the speaker’s gender, age [8].
Features used to describe the utterances are the key players in SER systems. The
set of features related to the emotions description can greatly enhance the recognition
success rate. Features of different domains are used in SER systems. Prosodic features
which describe the intonation and rhythm of the speech signal such as pitch, intensity, and
fundamental frequency F0 [7, 8, 13]. Spectral features represented by Mel spectrograms
and Mel-frequency cepstral coefficients (MFCC) are widely used in SER [14]. Advanced
techniques as deep neural networks depend on MFCCs and spectrogram images to
train the system using CNN and LSTM [15]. Furthermore, there are linear prediction
coefficients (LPC) based features and Voice Quality Features as Jitter and Shimmer.
Selected sets of features for precise emotion recognition depends on the corpus used,
the language and the classification algorithm [10].
INTERSPEECH 2009 Emotion Challenge feature set (IS09) [16] and INTER-
SPEECH 2010 Paralinguistic Challenge Paralinguistic Challenge feature set (IS10) [17]
are considered as benchmark for many SER systems. A feature set IS09-10 generated
from combining IS09 and IS10 features, was introduced by Klaylat et al. [10], results in
improvement in some cases.
104 M. El Seknedy and S. A. Fawzi
A wide variety of machine learning algorithms are used for classification in SER
domain as Hidden Markov models (HMM), Gaussian Mixture Models (GMM), Support
Vector Machine (SVM), tree-based models as Random Forest (RF), K-Nearest Neigh-
bor (K-NN), Logistic regression (LR), and recently Artificial Neural Networks (ANN).
Advantages and limitations of these algorithms are surveyed by El Ayadi et al. [8] and
Koolagudi et al. [18]. Recent research focuses on ensemble learning and majority voting
combining the advantages of different classifiers to create a model capable of enhancing
the prediction results [19].
Artificial neural networks are being explored in SER. Literatures showed that deep
neural networks need more optimization to be included in the speech emotion recognition
field. No significant improvement was reported using the features sets as input to the
deep neural networks [20].
3 Methodology
The SER system includes three basic building phases, as shown in Fig. 1. First phase
focuses on the chosen Arabic dataset (EYASE). Phase two comprises the construction of
the two proposed feature sets, features’ normalization, and features selection analysis.
The third phase focuses on classification, and it employs five machine learning models:
MLP, SVM, Random Forest, Logistic Regression, and Ensemble Learning.
3.1 Datasets
EYASE: Egyptian Arabic Semi-natural Emotion speech database is created from a
drama series and consists of 3 male and 3 female aged 22–45 years old actors with 12
to 22 years of professional experience. It includes four basic emotions: angry, happy,
neutral, and sad. A total of 579 wav files with a sampling rate of 44.1 kHz [12].
Emotion Recognition System for Arabic Speech 105
Table 1. Shows the set of features used during our research, the statistical functions used, and
the tools used in feature extraction.
Table 1. (continued)
Choosing the most suitable features to detect emotions from the speech signal is consid-
ered a key player in enhancing SER performance. Two algorithms were applied to rank
features according to their impact on classification model used.
Classifier Features
MLP Spectral contrast (1), Chroma (4), RMS(1), MFCC(2), F0(1), Pitch Contour (1)
SVM Chroma (2), MFCC (3), Spectral contrast (3), RMS(2)
RF Mel spectrogram (6), MFCC (2), RMS (1), Chroma (1)
LR Chroma (4), Spectral contrast (1), MFCC (3), Pitch Contour (1), RMS(1)
Emotion Recognition System for Arabic Speech 107
Results show that, MFCC is the most dominant feature across all the classifiers.
Mel-Spectrogram features are highest with random forest which supports Information
gain results.
Machine learning algorithms have better performance when features are normalized.
This reduces the effect of speakers’ variabilities, different languages and recordings con-
ditions on the recognition process. Normalization techniques includes Standard Scaler,
Minimum and Maximum Scaler (MMS) and Maximum Absolute Scaler (MAS) [26,
27]. MMS is used in this work to normalize features to a 0-1 range applying Eq. 1. Both
MMS and Standard scaler were applied and by comparing results, MMS showed better
results.
where X is the input features, min is the features’ minimum value and max is the features’
maximum value.
Shapiro-Wilk test was performed on featureset-2 to accept or reject the null hypoth-
esis (H0) that the data had a normal distribution using the estimated p-value [10]. The
majority of the 122 features rejected the H0 as the p-values were less than the accepted
confidence 0.05. Histograms’ visualization concluded that the data is more skewed than
normally distributed as in Fig. 2.
Precision: is how many of the correctly predicted emotional classes were positive.
Tp
Precision = (3)
Tp + Fp
Recall: is how many of the actual positive emotional classes were correctly predicted.
Tp
Recall = (4)
Tp + Fn
Confusion Matrix: is another way to analyze how many samples were miss-classified
by the model by giving a comparison between actual and predicted labels.
Fig. 3. Survey Analysis results, the percentage of each emotion participant’s votes for Arabic and
non-Arabic speakers
1. Valence-Arousal classification
Valence and arousal are the two main dimensions defining emotions, as shown in Fig. 4
[30, 31].
Feature-set Valence MLP SVM Random Logistic Ensemble Arousal MLP SVM Random Logistic Ensemble
forest regression learning forest regression learning
Featureset-1 Acc 81.3 84.1 80.5 78.4 87.6 Acc 93.3 94.3 92.9 92.9 93.9
pre 81.8 84.5 81.2 79.2 88 pre 93.3 94.3 93.0 93.1 94.0
rec 82.2 84.8 81.0 79.3 87.5 rec 93.6 94.4 93.4 93.2 94.2
Featureset-2 Acc 85.1 86.5 85.8 82.3 87.6 Acc 94.6 95.3 93.6 92.6 95.6
Pre 85.0 86.2 85.9 82.1 88.0 pre 94.7 95.3 93.7 92.7 95.8
rec 85.0 86.3 85.8 82.3 87.5 rec 94.4 95.2 93.7 92.6 95.4
IS10 Acc 79.1 84.4 79.8 82.2 83.7 Acc 95.0 95 93.3 95.6 95.6
pre 79.2 84.2 80.0 81.8 83.8 pre 95.4 95.4 93.4 96 96
rec 78.7 84.7 79.8 82.2 83.5 rec 94.7 94.7 93.0 95.4 95.4
IS-09 Acc 82.3 83.7 82.3 82.0 83.4 Acc 91.6 92.3 89.6 92.0 92.2
Pre 82.3 83.3 82.4 81.6 83.0 pre 92.6 92.6 89.5 92.3 92.8
rec 82.8 84.2 91.8 82.5 83.4 rec 92.3 92.3 89.5 91.8 92.0
IS-09-10 Acc 78.7 82.3 82.0 78.4 81.2 Acc 94.6 94 94.0 95.3 95.0
Pre 79.0 82.3 81.6 78.6 81.5 pre 94.6 93.6 93.8 95.1 94.8
rec 79.2 82.9 82.0 79.0 81.5 rec 94.8 93.7 94.0 95.5 95.1
Emotion Recognition System for Arabic Speech 111
2. Anger Detection
The most essential need in emotion detection applications is anger emotion recognition.
Anger detection is one of the most significant emotions to detect since it is commonly
utilized in contact centers and retail businesses to measure client happiness. As well
as in the medical field, such as recognizing if a patient is in an angry state based on
his voice signal. The Anger classification results are shown in Table 5, showing that
SVM and IS09-10 had the greatest accuracy of 91.33%, MLP and featureset-2 had the
same accuracy of 91%, and feature-set surpassed other features using ensemble learning
approach with an accuracy of 90.00%. Across all of the featuresets, SVM has the best
average accuracy. Angry classification rate ranges vary from 84% to 91%.
Table 7 introduces a comparison between the proposed models using featureset-2 and
previous results by Abdel-Hamid who introduced the Egyptian dataset “EYASE” [12].
Using the same dataset ensures measuring the models’ performance. She used a feature
set composing of prosodic, spectral and wavelet features of a total of 49 features, as well
as using linear SVM classifier and KNN for classification.
The arousal classification of the proposed model achieved 1% enhancement using
SVM classifier. An enhancement of 2.2% was achieved when comparing KNN results.
For valence classification, we achieved nearly the same result in the case of both
SER systems using SVM and an enhancement of 1.39% when ensemble learning was
used technique versus their SVM SER. Leading to improvement of 4% when compared
their results using knn versus our model using either SVM or ensemble learning. For
Anger classification, Lamia [8] used many features combined for anger detection ending
up with the best result of 91% using Prosodic, LTAS, and Wavelet features and the
lowest result of 81% using MFCC and Formants features. Compared with our model,
Emotion Recognition System for Arabic Speech 113
which achieved 91% using MLP and featureset-2 and 90% using ensemble learning and
SVM. For Multi classification, Lamia achieved 66.8% using SVM and 61.7% using Knn
compared with our model results of 64.61% using SVM and 64.07% using ensemble
learning. So, we were able to achieve enhancement over their Knn model with 3% but
they superseded by 2% for SVM classifier. The justification here is that the LTAS and
wavelet features are very effective with the Arabic language in the case of multi-emotion
classification. That was concluded by Lamia as well when exploring feature importance
in multi-classification as LTAS took a high rank among other features. That’s why it was
concluded that in Multi-classification the absence of LTAS differs a bit in the performance
but still, our target during research is to have a baseline featureset for cross-corpus not
just Arabic and to not be computationally expensive.
Table 7. Comparison between our work using featureset-2 and previous SER research work
5 Conclusion
Different speech feature sets were used to train five different machine learning models.
The correlation between each feature and the classifier was investigated, as well as
which feature has the greatest impact on each classifier. It was found that MFCC is one
of the most dominant features across the four classifiers. In comparison to the previous
SER, our model improved Arabic results by 1–2%. SVM showed best classification
results in many cases. MLP is a highly promising classifier that verifies the current
114 M. El Seknedy and S. A. Fawzi
research trend of neural networks and their different forms. Furthermore, Ensemble
learning was effective and was highly sensitive to the overall other 4 models predication
rates, reflecting multiple classifier point of view. The new state-of-the-art featureset-
2 created and implemented as a mix of spectral and prosodic features, outperformed
previous benchmarked feature sets such as Interspeech feature sets IS09, IS10, and
IS09-10. Furthermore, featureset-2 has the lowest computational time to train the models
compared to other feature sets.
In the future, there’s a lot of opportunity for supplementing the model with additional
multilingual data and expanding the input dataset as much as possible. Apply novel
methods such as Convolutional neural network (CNN), LSTM, and transfer learning, as
well as deep learning methods. Furthermore, consider feeding the voice stream straight
into the neural network model without first extracting speech characteristics, which
might speed up the process. Implementing the transfer learning approach by considering
spectrogram images as an input feature to the CNN model.
References
1. Likitha, M.S., Gupta, S.R.R., Hasitha, K., Raju, A.U.: Speech based human emotion recog-
nition using MFCC. In: 2017 International Conference on Wireless Communications, Signal
Processing and Networking (WiSPNET), pp. 2257–2260 (2017). https://doi.org/10.1109/WiS
PNET.2017.8300161
2. Blumentals, E., Salimbajevs, A.: Emotion recognition in real-world support call center data
for latvian language. In: CEUR Workshop Proceedings, vol. 3124 (2022)
3. Stankova, M., Mihova, P., Kamenski, T., Mehandjiiska, K.: Emotional understanding skills
training using educational computer game in children with autism spectrum disorder (ASD)
- case study. In: 2021 44th International Convention on Information, Communication and
Electronic Technology, MIPRO 2021 – Proceedings, pp. 672–677 (2021). https://doi.org/10.
23919/MIPRO52101.2021.9596882
4. Du, Y., Crespo, R.G., Martínez, O.S.: Human emotion recognition for enhanced performance
evaluation in e-learning. Prog. Artif. Intell. 1–13 (2022). https://doi.org/10.1007/S13748-022-
00278-2
5. Roberts, L.: Understanding the Mel Spectrogram (2020). https://medium.com/analyticsvid
hya/understanding-the-mel-spectrogram-fca2afa2ce53
6. Rashidan, M.A., et al.: Technology-assisted emotion recognition for autism spectrum disorder
(ASD) children: a systematic literature review. IEEE Access 9, 33638–33653 (2021)
7. Akçay, M.B., Oğuz, K.: Speech emotion recognition: emotional models, databases, features,
preprocessing methods, supporting modalities, and classifiers. Speech Commun. 116, 56–76
(2020)
8. El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features,
classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011)
9. Mori, S., et al.: Emotional speech synthesis using subspace constraints in prosody. In: 2006
IEEE International Conference on Multimedia and Expo, pp. 1093–1096 (2006)
10. Klaylat, S., Osman, Z., Hamandi, L., Zantout, R.: Emotion recognition in Arabic speech.
Analog Integr. Circ. Sig. Process. 96(2), 337–351 (2018)
11. Szmigiera, M.: The most spoken languages worldwide 2021. https://www.statista.com/statis
tics/266808/the-most-spoken-languages-worldwide/
12. Abdel-Hamid, L.: Egyptian Arabic speech emotion recognition using prosodic, spectral and
wavelet features. Speech Commun. 122, 19–30 (2020)
Emotion Recognition System for Arabic Speech 115
13. Mirsamadi, S., Barsoum, E., Zhang, C.: Automatic speech emotion recognition using recurrent
neural networks with local attention. In: 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 2227–2231 (2017)
14. Lalitha, S., Geyasruti, D., Narayanan, R., Shravani, M.: Emotion detection using MFCC and
cepstrum features. Procedia Comput. Sci. 70, 29–35 (2015)
15. Araño, K.A., Gloor, P., Orsenigo, C., Vercellis, C.: When old meets new: emotion recognition
from speech signals. Cogn. Comput. 13(3), 771–783 (2021). https://doi.org/10.1007/s12559-
021-09865-2
16. Schuller, B., Steidl, S., Batliner, A.: The interspeech 2009 emotion challenge. In: INTER-
SPEECH (2010)
17. Schuller, B., et al.: The INTERSPEECH 2010 paralinguistic challenge. In: INTERSPEECH
(2010)
18. Koolagudi, S.G., Murthy, Y.V.S., Bhaskar, S.P.: Choice of a classifier, based on properties
of a dataset: case study-speech emotion recognition. Int. J. Speech Technol. 21(1), 167–183
(2018). https://doi.org/10.1007/s10772-018-9495-8
19. Bhavan, A., Chauhan, P., Shah, R.R.: Bagged support vector machines for emotion recognition
from speech. Knowl.-Based Syst. 184, 104886 (2019)
20. Yadav, S.P., Zaidi, S., Mishra, A., et al.: Survey on machine learning in speech emotion
recognition and vision systems using a recurrent neural network (RNN). Arch. Comput.
Methods Eng. 29, 1753–1770 (2022)
21. Langari, S., Marvi, H., Zahedi, M.: Efficient speech emotion recognition using modified
feature extraction. Inform. Med. Unlocked 20, 100424 (2020)
22. https://librosa.org/doc/latest/index.html
23. About openSMILE—openSMILE Documentation. https://audeering.github.io/opensmile/
about.html#capabilities. Accessed 18 May 2021
24. https://machinelearningmastery.com/information-gain-and-mutual-information/. Accessed
10 Dec 2020
25. Permutation feature importance with scikit-learn. https://scikit-learn.org/stable/modules/per
mutation_importance.html. Accessed 18 May 2021
26. Sefara, T.J.: The effects of normalisation methods on speech emotion recognition. In: Pro-
ceedings - 2019 International Multidisciplinary Information Technology and Engineering
Conference, IMITEC 2019 (2019)
27. Zehra, W., Javed, A.R., Jalil, Z., Khan, H.U., Gadekallu, T.R.: Cross corpus multi-lingual
speech emotion recognition using ensemble learning. Complex Intell. Syst. 7(4), 1845–1854
(2021)
28. Koduru, A., Valiveti, H.B., Budati, A.K.: Feature extraction algorithms to improve the speech
emotion recognition rate. Int. J. Speech Technol. 23(1), 45–55 (2020). https://doi.org/10.1007/
s10772-020-09672-4
29. Matsane, L., Jadhav, A., Ajoodha, R.: The use of automatic speech recognition in education
for identifying attitudes of the speakers. In: IEEE Asia-Pacific (2020)
30. Bestelmeyer, P.E.G., Kotz, S.A., Belin, P.: Effects of emotional valence and arousal on the
voice perception network. Soc. Cogn. Affect. Neurosci. 12(8), 1351–1358 (2017). https://doi.
org/10.1093/scan/nsx059. PMID: 28449127; PMCID: PMC5597854
31. Russell, J.A.: A circumplex model of affect. J. Personal Soc. Psychol. 39(6), 1161–1178
(1980)
Modelling
Towards the Strengthening of Capella
Modeling Semantics by Integrating
Event-B: A Rigorous Model-Based
Approach for Safety-Critical Systems
1 Introduction
With the rapid pace of change in our world, safety-critical systems became more
and more complex, and the traditional engineering practices that are mostly
document-driven are no longer adequate to address increasing complexity in
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 119–132, 2023.
https://doi.org/10.1007/978-3-031-21595-7_9
120 K. Bouba et al.
systems architecture. Modeling is the first step in the software development pro-
cess for understanding requirements relative to the system. As a result, partial
models of the structure and behavior of the system are created, allowing the
developers to work at various abstraction levels before beginning the program-
ming process. The second step is the consistency verification, that must also be
performed at an early design phase, meaning that each refined model must be
consistent with itself, with the previous models, and with global constraints.
The core of the MBSE is to construct the appropriate models regarding
given system specifications. For this reason, Capella [1] was a turning point
for engineering environments such as energy, aerospace, and automotive indus-
tries, which in recent years has become increasingly recommended for modeling.
Capella provides better architecture quality, expresses the systems in five differ-
ent abstraction levels, each one is a refinement of the previous. These levels are
Operational Analysis (OA), System Analysis (SA), Logical Architecture (LA),
Physical Architecture (PA), and End Product Breakdown Structure (EPBS). In
addition to that, Capella offers an automatic traceability between its various lev-
els and supports an easier integration of structure and behaviour. In this paper,
we focus on the first modeling level, which captures the relevant stakeholders of
the operational context in which the system will be integrated. In this phase, we
chose the Operational Architecture (OAB) diagram, that allocates the activities
to the entities and actors in order to present a conceptual overview. The benefit
of this diagram is that it groups both the functional and behavioral decomposi-
tions of the system. Hence, a rigorous model-based approach for safety-critical
systems starting from OAB models is needful.
So, in our work, we address an innovative challenge which is the automatic
transformation of Capella modeling to Event-B [2] models, so that Capella mod-
els can be verified using model checking, which is an important step towards
establishing a reliable development process. Hence, the present solution starts
with identifying the system requirements and presenting them as a Capella dia-
gram. Then, a model transformation is applied to transform automatically the
Capella models into Event-B specifications. Also, to the best of our knowledge,
the proposed formalization, based on model transformation of a Capella OAB
diagram to Event-B, presented in this paper has not been developed so far.
The reminder to the paper is organised as follows. The next section presents
the state-of-the-art, followed by presenting the running example adopted in this
work. Section 3 is devoted to explain the process and the methodology of our
approach, in addition of the verification process of our case study. Section 4 is
dedicated to the tooling of a proof of concept. Finally, Sect. 5 concludes the paper
and presents the future work.
In this section we first start by discussing some existing related works, then we
present briefly Capella and Event-B. We also present an example that will be
treated in our contribution to facilitate the explanation.
A Rigorous Model-Based Approach for Safety-Critical Systems 121
2.1 Background
Software development methodologies, such as Model-Driven Engineering (MDE)
[3], are considered an effective method to simplify the design process. Towards
developing more abstract and more automated systems, MDE makes extensive
and consistent use of models at different levels of abstraction while designing
systems. By using models, it becomes possible to eliminate some useless details
as well as to break down complicated systems into smaller, simpler and manage-
able units. Due to a separation between the business and technical components
of the application, MDE automates the generation of applications following the
modification of the target platforms. In this context, the [4] focuses on transform-
ing an adaptive run-time system model interpreted with MARTE [5] elements
to Event-B concepts using the Acceleo [6] transformation engine. The [7] sug-
gests a Sewerage System which is represented by a UML activity diagram, and
further converted into Nondeterministic Finite Automata (NFA) [8] to describe
the system’s behavior of water effectively. In the following step, a formal model
is created from the automata model using TLA+ [9], which will be checked
and validated by TLC, a model checking feature included in TLA+. The work
proposed in [10] presents the formalization of a system called Railway Signal-
ing System European Rail Traffic Management System/European Train Control
System (ERTMS/ETCS). The functionalities and relationships of its several sub-
systems are modelled via UML, and then translated to Event-B language. The
proof of correctness of the end code is provided by ProB [11].
Contrary to previous researches, the research proposed in [12] proposes
Capella as a modeling tool for Distributed Integrated Modular Avionics (DIMA).
The choice of Capella was due to the increasing maturity of the system, which
makes the DIMA system architects confront several issues face several problems
during the design process due to the high number of functions, such as func-
tions allocation and device physical allocation. The authors in [13] offer a set
of constructions and principles in order to abstract heterogeneous models with
the intention of being able to synchronize and compare them; They trusted the
System Structure Modeling Language [14] (S2ML) to ensure the consistency
between system architecture models designed with Capella and safety models
written in AltaRica 3.0 [15] of the power supply system. The [16] introduces a
transformation approach from Capella physical architecture to software archi-
tecture in AADL [17] using the Acceleo plugin, that was applied and validated
on a robotic demonstrator called TwIRTee, developed within the INGEQUIP
project. Besides, each element of the source architecture is mapped to a new
concept of the target architecture. The work in [18] shows a model-to-model
transformation application, which aims to verify the dynamic behavior model of
Capella systems using Simulink [19]. The use case chosen for this research is the
“clock radio” system, which will be interpreted in the form of a Capella data-
flow diagram (physical architecture data-flow), and subsequently transformed
into an executable simulink model. Unlike the researches mentioned above, we
propose through this paper, a new approach which consists of formalizing a
Capella model, one of the most useful modeling solutions, to Event-B, one of
the formal languages more used.
122 K. Bouba et al.
3 Proposed Approach
Our approach contains two steps: (i) Preparatory step and (ii) Transformation
step. First, we start by defining the needs of the stakeholders and the system func-
tionality requirements. Next, we switch these informal requirements to Capella
model. During this stage, it is possible for users to inject rules into the model
through constraints, which must be preserved by each state of the system.
A Rigorous Model-Based Approach for Safety-Critical Systems 123
Once the Capella model and the translation implementation are ready, we
can proceed to the transformation step. As input for our generator, we pass a
specific file containing the model without the graphical part. Some of the Capella
components will not be formalized later, because so far they do not participate
in the state/transition behavior of the system. As soon as our formal Event-B
model is generated, it must be validated and verified. In summary, our approach
is depicted in Fig. 1.
Generated Results. The Fig. 5 shows an extract of the different parts of the
generated Event-B model. Our case study is quite large, so we focused on a
specific sub-systems for a clearer explanation.
In our case study, we have an entity called CAR LIGHTS, which is respon-
sible for turning on/off the vehicle lights. It includes a couple of activities, so
for this reason a set denominated “CAR LIGHTS” composed of two elements
{activeLigts, inactiveLights}, a variable called “VAR CAR LIGHTS”, an invari-
ant indicating that the variable created belongs to the set (membership pred-
icate), named “VAR CAR LIGHTS : CAR LIGHTS”, and initialisation event,
128 K. Bouba et al.
Fig. 7. The mapping of the Car Lights sub-system elements into Event-B concepts.
Validation and Verification. Validating the Event-B model and ensuring that
the invariants (typing and safety properties invariants) are preserved across all
events is our intention in this section. It consists of checking whether a finite-
state model of a system meets certain specifications. Also, there is no addition
of new instances of the sets defined earlier in the model, so the objects set
remains constant. To test whether our Event-B model is valid, we applied it to
a simplified scenario derived from the use case study illustrated in Sect. 2.3. We
launch our scenario by starting the engine, which activates the driver’s front
camera. This camera is used to predict the driver’s state and fatigue levels. If
the state of the driver is “normal”, then he can change the settings of the car to
turn on the lights, and as a consequence, his state is changed to “in Control”.
On the other hand, if the detected state is “drowsy”, which means that he’s not
capable of driving anymore, the vehicle, the lights and the frontal camera are
turned off (to prevent accidents for example). For the purpose of demonstrating
that the formal specifications of the adaptive exterior lighting model are correct,
we’ll use ProB in order to validate the Event-B model. Model checking is used
here instead of theorem proving, since that requires more effort and training.
Nevertheless, the model checking is sufficient to check system properties for a
given initial state, since the system has a finite state space.
A - Verification Using Model Checking. Model-Checking [23] is a for-
mal verification method that automatically and systematically checks whether
a system description conforms specified properties. The behavior of the system
is modeled formally, and the specifications expressing the expected properties
(safety, security, etc.) of the system are also expressed formally using the first-
order logic formulas. All experiments were conducted on a 64-bit PC, Windows
10 operating system, an Intel Core i7, 2.9 GHz Processor with 2 cores and 8 GB
130 K. Bouba et al.
RAM. Using the ProB model-checker and based on mixed breadth and depth
search strategy, we have explored all states: 100% of checked states with 7 dis-
tinct states and 20 transitions. No invariant violation was found, and all the
operations were covered. This verification ensures that invariants are preserved
by each event. Otherwise, a counter-example would be generated.
B - Validation by Animation. ProB can function as a complement to a
model-checker and as an animator. The use of animations during verification is
very important and can detect a range of problems that can be avoided in the
future, including unexpected behavior of a model. The behavior of an Event-
B machine can be dynamically visualized with ProB animator using different
operational scenarios; Besides it can analyze all of the accessible states of the
machine to check the demonstrated properties. Based on the animation of these
scenarios, we can conclude that our specification has been tested and validated.
Alternatively, if this is not the case, we must go back to the initial specification to
find the conflicts, correct the unacceptable behaviors and re-apply the animation
to ensure the specification is aligned with the requirements.
4 Tooling
The Fig. 8 represents each step of our approach with its equivalent tool. The
preparatory step is devoted to the construction of the operational analysis model
using the Capella studio tool. The transformation step is dedicated for the trans-
formation of Capella model to Event-B model using Acceleo. The last step is
committed to the Event-B textual specification and to the verification of this
using ProB.
for creating model-based workbenches. Also, users can enhance and customize
the Capella development artefacts (meta-models, diagrams) using the Capella
development artefacts. There are many add-ons and viewpoints that are already
integrated in the Capella studio. It includes EMF technology for the models man-
agement which are defined in the Ecore format. Ecore is a framework composed
of a set of concepts, that can be manipulated by EMF to build a meta-model.
Ecore shares a lot of similarities with the class diagram of UML, that’s why
it can basically be seen as UML packages. Every package contains ontology
elements (or UML Classes) and “local” element relationships (Associations).
Moreover, it includes as well the Acceleo add-on, which is a language based
on templates for creating code-generation templates. In addition to supporting
OCL, this language provides a number of operations useful for working with
text-based documents. There is a set of powerful tools bundled with Acceleo,
including an editor, a code completion and refactoring tools, a debugger, error
detection and a traceability API.
References
1. Roques P.: Modélisation architecturale des systèmes avec la méthode Arcadia:
guide pratique de Capella, vol. 2, ISTE Group, 2018
2. Abrial, J.R.: Modeling in Event-B: system and software engineering. Cambridge
University Press (2010)
3. Schmidt, C.: D.: Model-driven engineering. Computer-IEEE Computer Society-
39(2), 25 (2006)
4. Fredj, N., Hadj Kacem, Y., Abid, M.: An event-based approach for formally veri-
fying runtime adaptive real-time systems. The Journal of Supercomputing 77(3),
3110–3143 (2021)
132 K. Bouba et al.
5. The ProMARTE consortium, UML profile for MARTE, beta 2, June 2008, OMG
document number : ptc/08-06-08
6. Brambilla, M., Cabot, J., Wimmer, M.: Model driven software engineering in prac-
tice. SynthLect. Softw. Eng. 3(1), 1–207 (2012)
7. Latif, S., Rehman, A., Zafar, N.A.: Modeling of sewerage system linking UML,
automata and TLA+. In 2018 International Conference on Computing, Electronic
and Electrical Engineering (ICE Cube), pp 1–6. IEEE (2018)
8. Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory,
Language and Computation, Addison-Wesley, Reading (2001)
9. Cristiá, M.: A TLA+ encoding of DEVS models. In: Proceedings of the Interna-
tional Modeling and Simulation Multiconference, pp. 17–22 (2007)
10. Ait Wakrime, A., Ben Ayed, R., Collart-Dutilleul, S., Ledru, Y., Idani, A.: For-
malizing railway signaling system ERTMS/ETCS using UML/Event-B. In: Abdel-
wahed, E.H., Bellatreche, L., Golfarelli, M., Méry, D., Ordonez, C. (eds.) MEDI
2018. LNCS, vol. 11163, pp. 321–330. Springer, Cham (2018). https://doi.org/10.
1007/978-3-030-00856-7 21
11. Leuschel, M., Butler, M.: Prob: an automated analysis toolset for the b method.
Int. J. Softw. Tools Technol. Transf. 10(2), 185–203 (2008)
12. Batista, L., Hammami, O.: Capella based system engineering modelling and multi-
objective optimization of avionics systems. In: IEEE International Symposium on
Systems Engineering (ISSE), pp. 1–8. IEEE (2016)
13. Batteux, M., Prosvirnova, T., Rauzy, A.: Model synchronization: a formal frame-
work for the management of heterogeneous models. In: Papadopoulos, Y., Aslanse-
fat, K., Katsaros, P., Bozzano, M. (eds.) IMBSA 2019. LNCS, vol. 11842, pp.
157–172. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32872-6 11
14. Batteux, M., Prosvirnova, T., Rauzy, A.: System Structure Modeling Language
(S2ML) (2015)
15. Batteux, M., Prosvirnova, T., Rauzy, A.: Altarica 3.0 in 10 modeling patterns. Int.
J. Critic. Comput. Based Syst. (IJCCBS). 9, 133 (2019). https://doi.org/10.1504/
IJCCBS.2019.10020023
16. Ouni, B, Gaufillet, P., Jenn, E., Hugues, J.: Model driven engineering with Capella
and aadl. In: ERTSS 2016 (2016)
17. Architecture Analysis and Design Language (AADL), SAE standards .http://
standards.sae.org/as5506/
18. Duhil, C., Babau, J.P., Lépicier, E., Voirin, J.L., Navas, J.: Chaining model trans-
formations for system model verification: application to verify Capella model with
Simulink. In: 8th International Conference on Model-Driven Engineering and Soft-
ware Development, pp. 279–286. SCITEPRESS-Science and Technology Publica-
tions (2020)
19. Klee, H., Allen, R.: Simulation of Dynamic Systems with MATLAB and Simulink.
CRC Press, Boca Raton, February 2011
20. Houdek, F., Raschke, A.: Adaptive exterior light and speed control system. In:
Raschke, A., Méry, D., Houdek, F. (eds.) ABZ 2020. LNCS, vol. 12071, pp. 281–
301. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-48077-6 24
21. AbuAli, N., Abou-zeid, H.: Driver behavior modeling: Developments and future
directions. Int. J. Veh. Technol. 2016, 1–12 (2016)
22. Weixuan, S., Hong, Z., Chao, F., Yangzhen, F.: A method based on meta-model for
the translation from UML into Event-B. In: 2016 IEEE International Conference
on Software Quality, Reliability and Security Companion, pp. 271–277 (2016)
23. M Clarke Jr., E., Grumberg, O., Kroening, D., Peled, D., Veith, H.: Model checking.
Cyber Physical Systems Series (2018)
A Reverse Design Framework
for Modifiable-off-the-Shelf Embedded
Systems: Application to Open-Source
Autopilots
1 Introduction
Nowadays, the re-usability is an aspect that becomes more and more requested
when developing new systems. Indeed, many systems are being constructed by
integrated existing independent systems, of different stakeholders, leading to new
system of systems (SoS), like in UAV (Unmanned Air Vehicles) domain.
There is a growing interest in open and flexible architecture for UAV sys-
tems. A lot of small and medium stakeholders propose new drone-based inno-
vative services by customising hardware and/or software parts. Moreover, from
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 133–146, 2023.
https://doi.org/10.1007/978-3-031-21595-7_10
134 S. Kamni et al.
The rest of this paper is organised as follows. Section 2 presents the background
and motivations of this work. Section 3 presents our contribution. Section 4 and
5 present respectively the developed framework and its validation via Paparazzi
autopilot. Finally, a conclusion and some perspectives are given.
Drone
Flight control Actuation
to a plan. Giving the sensors readings, an estimation of the actual state is done.
A state can be, for example, the angular speeds on each of the three-dimensional
axis, as well as air and/or ground horizontal speeds, climb rate speed, as well as,
for some specific cases, actual angle values on the three-dimensional axis. The
difference between estimated state and setpoint is the error, which is usually
corrected by several PID (Proportional, Integrator, Derivative) or PD, and more
recently using INDI (Incremental Non-Linear Dynamic Inversion) controllers.
These controllers compute torques to apply on each axis, converted to individual
positions of surfaces or commands of rotors through a mixing process. This allows
controllers to be relatively independent of the frame geometry.
The design methods of a custom autopilot are thus different from what we
observe for other embedded systems. Most other embedded systems use a top-
down approach in their design life-cycle: from requirements, a functional decom-
position can be derived independently of the hardware. Then the functions are
mapped onto executable entities, at the low-level corresponding to processes and
threads, themselves mapped to CPU, either with or without an operating sys-
tem. This top-down approach has been used for decades now in several fields of
embedded systems. Several methodologies are based on this top-down approach,
starting with Structured Analysis for Real-Time (SA/RT) [10] in the 1980s,
to Model-Driven Architecture (MDA) [11] launched by the Object Management
Group (OMG) in the early 2000s, or the ARCADIA method tooled by Capella in
the 2010s [12]. We also find this top-down approach in the automotive standard
AUTOSAR [13], as well as in avionics with the DO-178C standards [14].
The development of UAV does not fit the top-down approach. It is usually
requiring an instance of an autopilot, which implies a hardware platform sup-
ported by the chosen autopilot, an OS (or no OS) supported by the platform and
the autopilot, compatible and supported sensors and actuators, a frame which
can be completely custom-made or based on an existing Commercial off-the-
shelf (COTS) frame. Then the added value can range from the specificity of the
frame, to additional functions which are not implemented in the open source
autopilot, to specific hardware. The development efforts consist in extending the
instance of the autopilot to support the custom parts. In some cases, for some
A Reverse Design Framework 137
autopilots, this extension can be easy to integrate. The problem is that some
custom parts are not only difficult to integrate, but may also compromise the
smooth operation of the original autopilot or cause it to stop working.
Extraction Visualisation
3.2 Extraction
After configuration of the target hardware and frame, COTS autopilots binaries
are generated from C/C++ source code. In order to model an autopilot, it is
thus necessary to extract a model from source code. The extraction step con-
sists in retrieving the necessary information from the source code, in the form
of a parse tree, for the purpose of visualisation. To retrieve the parse tree, it
is necessary to parse the code with a parser such as Lex and Yacc, or more
recent technologies such as Xtext, ANTLR, etc., that generate parsers directly
by giving them as input a grammar expressed with a compatible DSL. Once
the parser is set up, we input the code and generate the parse tree. The code
behind the autopilots contains several C/C++ (.c/.cpp) and header (.h) files.
This makes the task of extracting the information needed for visualisation diffi-
cult. To overcome this problem, we use the GCC compiler to generate GIMPLE
code, the low-level three-address abstract code generated during the compilation
process, that contains all the necessary information, including metadata about
the functions at the time of their definition, inside files with the same extension
which is “.lower”.
3.3 Visualisation
This step consists in visualising the multithreaded and functional composition as
well as the execution dataflow of the functions by showing the communications
between them. It takes as input the parse tree generated at the end of the
extraction step and gives as output a model that shows the set of threads, the
set of functions and their sub-functions, the order of the function calls and the
communications that happen between them.
The visualisation requires a text-to-model (T2M) transformation of the code,
and this must be done all along the traversal of the parse tree. As the tree is
being traversed, the elements of the tree are evaluated, and they are transformed
into equivalent model elements compatible with the chosen visualisation tool.
The objective of the visualisation is not for graphical aspects only but to also
be able to analyse the modified code and check if it meets the non-functional
requirements (like deadlines, end-to-end delays, etc.)
140 S. Kamni et al.
Model
& AADL-Like
XML
XML injection for
Capella
code visualisation
generate
processing part
Code
paparazzi
gimple
Parse tree building
files
Listeners
Parser Tokens Lexer
respect generate
grammar input
files
Custom grammar
processing consists of three layers. From top to bottom, the program that per-
forms the tree traversal and its text-to-text transformation layer. This program
is built on top of the two other layers, which are provided by ANTLR, namely
the built parse tree as well as the generated bricks (lexer, parser, tokens, and
the listeners).
Building the parse tree consists in parsing the GIMPLE code (e.g., Paparazzi
GIMPLE files) that is conforming to the GIMPLE grammar (see Listing 1.1) and
requires the three given components of the first layer, namely the Parser, the
Lexer, and the Tokens. Once the parse tree is built, it is then transformed into
XML code. This process requires the generated listeners of the first layer.
1 ...
2 functionDefinition
3 : attributeSpecifierSeq ? declSpecifierSeq ? declarator
v i r t u a l S p e c i f i e r S e q ? functionBody
4 | gimplePreamble ? d e c l a r a t o r v i r t u a l S p e c i f i e r S e q ?
functionBody
5 ;
6
7 functionBody :
8 c o n s t r u c t o r I n i t i a l i z e r ? compoundStatement
9 | f u n c t i o n T r y B l o c k | A s s i g n ( D e f a u l t | D e l e t e ) Semi ;
10 ...
11 }
Listing 1.1. Code snippet of the GIMPLE grammar.
Physical Architecture
PhyscialComponent
SoftwareBusInputPort
ActivationModeEunm
periodic / sporadic
aperiodic / background
hybrid / timed
SoftwareBusOutputPort AADLTProcess
PortTypeEnum
data
event
eventData
AADLFunctionInputPort
AADLFunction AADLThread
activationMode
timeBudget
activationPeriod
AADLFunctionOutputPort
AADLThreadInputPort AADLThreadOutputPort
portType portType
The tool palette shown in Fig. 6 is inspired from the graphical aspect of
AADL [20]. It consists of some Capella existing artefacts such as the physical
component that we consider as a processing unit, the physical actor or device
which represents the sensors and actuators, and the physical link. It provides
the AADL-like artefacts such as threads with different variants (periodic, spo-
radic, aperiodic, etc.), the different inter-thread communications (synchronous,
asynchronous, and reset), the function component or sub-program as well as the
functional exchange that connects two different functions.
A Reverse Design Framework 143
subprograms that are executed by the threads. The functions have ports that
constitute the functional exchanges, i.e. the means of communication between
the functions. Threads also have ports for inter-thread communication.
In the next section, we present how our GIMPLE interpreter creates an
AADL model from a Makefile, and illustrate it with examples obtained from
retro-engineering Paparazzi autopilot.
5 Validation
This work has been tested on Paparazzi autopilot for validation purposes with
the research group of National School of Civil Aviation1 (ENAC) and the
Paparazzi founders as part of the European Comp4Drones project2 . The devel-
opers believe that the autopilot instances customisation using the framework
presented in this work will be much easier than by direct accessing to the code.
We present hereafter some excerpts of our framework utilisation to modify the
Paparazzi code.
The diagram elements are extracted along the traversal of the parsing tree.
This operation requires the recognition of the code statements corresponding to
the diagram elements. For instance, the statement pthread create () (see Line
2, Listing 1.2) responsible for the creation of the POSIX thread will be translated
to a thread in the diagram. The created thread takes the name of the function
executed by the thread designated by the third argument of the pthread create
() statement. Once the thread is created, the function is created inside it. To
create the sub-functions, the definition of this later must be found first. Then,
every function call is transformed into a sub-function (see Listing 1.2, Lines 5–9).
1 // i 2 c t h r e a d t h r e a d c r e a t i o n
2 1 = p t h r e a d c r e a t e (& t i d , 0B, i 2 c t h r e a d , p ) ;
3 ...
4 // i 2 c t h r e a d f u n c t i o n e x e c t u e d by t h e t h r e a d
5 i 2 c t h r e a d ( v o i d ∗ data ) {
6 // sub−f u n c t i o n
7 get rt prio () ;
8 ...
9 }
Listing 1.2. GIMPLE code to be transformed
Figure 7 shows the result of the reverse engineering process applied to the
Paparazzi code. It consists of multiple threads, where each thread is composed
of interconnected functions. Due to space limitations, the figure cannot be pre-
sented in its entirety in this article. However, we have zoomed in to show some
details.
1
http://optim.recherche.enac.fr/.
2
https://www.comp4drones.eu/.
A Reverse Design Framework 145
6 Conclusion
This paper presented a model-based framework for reverse engineering allowing
the visualisation of the functional structure of a given input source code and
more precisely the source code of autopilots. The objective behind this work is
to make autopilots MOTS, i.e., software that can be customised according to
the user’s needs. Indeed, the source code of an autopilot can be visualised to be
well understood and to analyse the performance and non-functional properties
of the modification at an early design step. We believe that this framework
can shorten sharply the design process of MOTS software. The framework has
been demonstrated on a concrete example, namely the open source autopilot
Paparazzi, in the context of the European project Comp4drones.
Acknowledgement. This work has received funding from the European Union’s Hori-
zon 2020 research and innovation program under grant agreement N. 826610.
References
1. Bin, H., Justice, A.: The design of an unmanned aerial vehicle based on the ardupi-
lot. Indian J. Sci. Technol. 2(4), 12–15 (2009)
2. Meier, L., Honegger, D., Pollefeys, M.: PX4: a node-based multithreaded open
source robotics framework for deeply embedded platforms. In: 2015 IEEE Inter-
national Conference on Robotics and Automation (ICRA), pp. 6235–6240. IEEE
(2015)
146 S. Kamni et al.
3. Brisset, P., Drouin, A., Gorraz, M., Huard, P.-S., Tyler, J.: The paparazzi solu-
tion. In: MAV 2006, 2nd US-European Competition and Workshop on Micro Air
Vehicles. Citeseer (2006)
4. Nouacer, R., Hussein, M., Espinoza, H., Ouhammou, Y., Ladeira, M., Castiñeira,
R.: Towards a framework of key technologies for drones. Microprocess. Microsyst.
77, 103142 (2020)
5. Butenhof, D.R.: Programming with POSIX Threads. Addison-Wesley Professional,
Boston (1997)
6. Nutt, G.: Nuttx operating system user’s manual (2014)
7. Capella: open source solution for model-based systems engineering. https://www.
polarsys.org/capella/. Accessed 01 Aug 2022
8. Feng, Q., Mookerjee, V.S., Sethi, S.P.: Application development using modifiable
off-the-shelf software: a model and extensions (2005)
9. Mousavidin, E., Silva, L.: Theorizing the configuration of modifiable off-the-shelf
software. Inf. Technol. People (2017)
10. Ross, D.T.: Structured analysis (SA): a language for communicating ideas. IEEE
Trans. Softw. Eng. SE-3(1), 16–34 (1977)
11. Brown, A.W.: Model driven architecture: principles and practice. Softw. Syst.
Model. 3(4), 314–327 (2004)
12. ARCADIA: a model-based engineering method. https://www.eclipse.org/capella/
arcadia.html. Accessed 01 Aug 2022
13. AUTOSAR. The standardized software framework for intelligent mobility
14. Brosgol, B.: DO-178C: the next avionics safety standard. ACM SIGAda Ada Lett.
31(3), 5–6 (2011)
15. Chikofsky, E.J., Cross, J.H.: Reverse engineering and design recovery: a taxonomy.
IEEE Softw. 7(1), 13–17 (1990)
16. Booch, G., Rumbaugh, J., Jackobson, I.: UML: unified modeling language. Versão
(1997)
17. Wood, J., Silver, D.: Joint Application Development. Wiley, Hoboken (1995)
18. Ferenc, R., Sim, S.E., Holt, R.C., Koschke, R., Gyimóthy, T.: Towards a stan-
dard schema for C/C++. In: Proceedings Eighth Working Conference on Reverse
Engineering, pp. 49–58. IEEE (2001)
19. Gimple (GNU compiler collection (GCC) internals). Accessed 01 Aug 2022
20. SAE. SAE. Architecture analysis and design language V2.0 (AS5506), Septem-
ber 2008. https://www.sei.cmu.edu/our-work/projects/display.cfm?customel
datapageid 4050=191439www.aadl.info
21. ANTLR (another tool for language recognition). https://www.antlr.org/. Accessed
01 Aug 2022
Efficient Checking of Timed Ordered
Anti-patterns over Graph-Encoded Event
Logs
Abstract. Event logs are used for a plethora of process analytics and
mining techniques. A class of these mining activities is conformance
(compliance) checking. The goal is to identify the violation of such
patterns, i.e., anti-patterns. Several approaches have been proposed to
tackle this analysis task. These approaches have been based on differ-
ent data models and storage technologies of the event log including rela-
tional databases, graph databases, and proprietary formats. Graph-based
encoding of event logs is a promising direction that turns several process
analytic tasks into queries on the underlying graph. Compliance checking
is one class of such analysis tasks.
In this paper, we argue that encoding log data as graphs alone is
not enough to guarantee efficient processing of queries on this data. Effi-
ciency is important due to the interactive nature of compliance checking.
Thus, anti-pattern detection would benefit from sub-linear scanning of
the data. Moreover, as more data are added, e.g., new batches of logs
arrive, the data size should grow sub-linearly to optimize both the space
of storage and time for querying. We propose two encoding methods using
graph representations, realized in Neo4J & SQL Graph Database, and
show the benefits of these encoding on a special class of queries, namely
timed ordered anti-patterns. Compared to several baseline encoding, our
experiments show up to 5x speed up in the querying time as well as a
3x reduction in the graph size.
1 Introduction
Organizations strive to enhance their business processes to achieve several goals:
increase customer satisfaction, gain more market share, reduce costs, and show
adherence to regulations among other goals. Process mining techniques [1] col-
lectively help organizations achieve these goals by analyzing execution logs
of organizations’ information systems. Execution logs, a.k.a. event logs, group
events representing the execution of process steps into process instances (cases).
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 147–161, 2023.
https://doi.org/10.1007/978-3-031-21595-7_11
148 N. M. Zaki et al.
2 Background
We formalize the concepts of events, traces, logs and graphs to help in under-
standing the formalization introduced later in the paper.
Efficient Checking of Anti-patterns over Graph-Encoded Event Logs 149
We reserve the first three properties in the event tuple to reflect the case, the
activity label, and the timestamp properties.
In general, graph data models can be classified into two major groups [13]:
directed edge-labeled graphs, e.g., RDF, and labeled property graphs. In the
context of this paper, we are interested in labeled property graphs as they provide
a richer model that represents the same data in a smaller graph size.
compliance checking, and compliance patterns are used for categorizing the types
of compliance requirements [22].
Occurrence patterns are concerned with activities having been executed
(Existence) or not (Absence) within a process instance. Order patterns are con-
cerned with the execution order between pairs of activities. The Response pattern
(e.g., Response(A, B)) states that if the execution of activity A is observed at
some point in a process instance, the execution of activity B must be observed
at some future point of the same case before the process instance is terminated.
A temporal window can further restrict these patterns. For instance, we need
to observe B after A in no more than a certain amount of time. Alternatively,
we need to observe B after observing A, where at least a certain amount of
time has elapsed. Both patterns can be further restricted by so-called exclude
constraint [4]. That is, between the observations of A and B, it is prohibited
to observe any of the activities listed in the exclude constraint. Definition 5
formalizes the Response pattern.
Conversely, the Precedes pattern (e.g., Precedes(A,B)) states that if the exe-
cution of activity B is observed at some point in the trace, A must have been
observed before (Definition 6).
However, in this paper, we focus on the core response and precedence patterns,
due to space limitations.
When checking for compliance, analysts are interested in identifying process
instances, i.e., cases that contain a violation, rather than those that are compli-
ant. Therefore, it is common in the literature about compliance checking to use
the term “anti-pattern” [16]. In the rest of this paper, we refer to anti-patterns
rather than patterns when presenting our approach to detecting violations over
graph-encoded event logs.
3 Related Work
There is vast literature about the business process compliance checking domain.
For our purposes, we focus on compliance checking over event logs; we refer to
this as auditing. For more details, the reader can check the survey in [12].
Auditing can be categorized in basic terms based on the perspective of the
process, including control flow, data, resources, or time. We can also split these
categories based on the formalism and technology that underpins them. Agrawal
et al. [3] presented one of the first works on compliance auditing, in which process
execution data is imported into relational databases and compliance is verified by
recognizing anomalous behavior. Control-flow-related topics are covered by the
technique.
Validating process logs against control-flow and resource-aware compliance
requirements has been proposed while applying model checking techniques [2].
For control-flow and temporal rules, Ramezani et al. [19,20] suggest alignment-
based detection of compliance violations.
De Murillas et al. [17] present a metamodel and toolset for extracting process-
related data from operational systems logs, such as relational databases, and popu-
lating their metamodel. The authors show how different queries can be translated
into SQL. However, such queries are complex (using nesting, joins, and unions).
Relational databases have also been used for declarative process mining [23], which
can be seen as an option for checking logs against compliance rules.
Compliance violations, i.e. anti-patterns can be checked by Match Recognize
(MR), the ANSI SQL operator. MR verifies patterns as regular expressions, where
the tuples of a table are the symbols of the string to search for matches within. MR
runs linearly through the number of tuples in the table. In our case, the tuples are
the events in the log. In practice, the operational time can be enhanced by paral-
lelizing the processing, e.g., partitioning the tuples by the case identifier. Still, this
does not change the linearity of the match concerning the number of tuples in the
table. A recent work speeds up MR by using indexes in relational databases [15] for
strict contiguity patterns, i.e., patterns where events are in strict sequence. Order
compliance patterns frequently refer to eventuality rather than strict order, limit-
ing the use of indexes to accelerate the matching process.
Storing and querying event data into an integrated graph-based data struc-
ture has also been investigated. Esser et al. [9] provide a rich data model for
multi-dimensional event data using labeled property graphs realized on Neo4j as
a graph database engine. To check for compliance, the authors use path queries.
152 N. M. Zaki et al.
Such queries suffer from performance degradation when the distance between
activities in the trace gets longer and when the whole graph size gets larger.
Graph representation of event logs is a promising approach for event logs analy-
sis [5], especially for compliance checking [9]. This is due to the richness of this
graph representation model, mature database engines supporting it, e.g., Neo4J1 ,
and the declarative style of the query languages embraced by such engines, e.g.,
Cypher2 . In this sense, compliance checking can be mapped to queries against
the encoded log to identify violations.
We show how encoding of the event log has a significant effect on the efficiency
of answering compliance queries. We start from a baseline approach (Sect. 4.1)
and propose a graph encoding method, Sect. 4.2, that leverages the finite nature
of event logs to store the same event log in a smaller graph and answer compliance
queries faster.
Table 1 shows an excerpt of a log that serves as the input to the differ-
ent encoding methods. In the “Optional details” columns, the “StartTime” and
“CompleteTime” columns are converted to Unix timestamp.
1
https://neo4j.com/.
2
Cypher for Neo4J is like SQL for relational databases.
Efficient Checking of Anti-patterns over Graph-Encoded Event Logs 153
:Case A C
E
:Event-to-case
:Directly-Follows
:Event :Event
Activity: STRING Activity: STRING
Resource: STRING Resource: STRING
Start time: DATETIME Start time: DATETIME
Complete time: DATETIME Complete time: DATETIME
Position: INTEGER Position: INTEGER
(b) Representation of the log excerpt in Ta-
(a) Nodes, edges, labels, and properties ble 1
Events and cases constitute the nodes of the graph. Node types, i.e., events,
cases, etc., are distinguished through labels. Edges represent either structural
or behavioral relations. Structural relations represent event-to-case relations.
Behavioral relations represent the execution order among events in the same
case, referred to as directly-follows relationships. Activity labels, resource names,
activity lifecycle status, and timestamps are modeled as properties of the event
nodes. Similarly, case-level attributes are modeled as case node properties.
Figure 1a shows the representation of the Baseline graph.
Formally, for each log L, cf. Definition 3, a labeled property graph G, cf.
Definition 4, is constructed by Esser et al. [9] approach as follows:
1. Labels for the graph elements are constituted of four literals. Formally, L =
{event, case, event to case, directly f ollows},
2. Keys for properties are the domain names fromm which values of the different
event attributes are drawn. Formally, K = i=1 {name(Di )} ∪ {ID},
3. For each unique case in the log, there is a node in the graph that is labeled
as “case” and has a property ID that takes the value of the case identifier.
Formally, ∀c ∈ Dc ∃ n ∈ G.N : {case} ∈ label(n) ∧ prop(n, ID) = c,
4. For each event in the log, there is a node in the graph that is labeled as
“event” and inherits this event’s properties. Formally, ∀e = (p1 , p2 , . . . , pm ) ∈
L ∃ n ∈ G.N : {event} ∈ label(n) ∧ ∀2≤i≤m Di : prop(n, name(Di )) = pi ,
154 N. M. Zaki et al.
Although the query is expressive and captures the semantics of the violation,
it is expensive to evaluate. To resolve those nodes, the processing engine must
scan the :Directly follows relation linearly. Another problem is the linear growth
of the graph size w.r.t. the log size. In Fig. 1b, we observe that each time an
activity occurs in the log, a distinct node is created in the graph.
In the following subsection, we propose a concise representation of the event
log that improves both the space and time required to store and query the log.
Many compliance patterns are concern is constructed in the same cases in process
execution and their ordering. When checking such rules against event traces, we
can exploit the finiteness of these traces and the positions of events within traces
Efficient Checking of Anti-patterns over Graph-Encoded Event Logs 155
to simplify the queries and speed up their evaluation by utilizing indexes and
skipping the linear scan of the :Directly follows relation among events. So, we
extend the baseline mapping by explicitly assigning a position property to each
event node. Table 1 has a highlighted column, tagged as added detail column,
where we assign each event to a position in the case (trace). For instance, the
fourth row in Table 1 records that activity ‘E’ has been the third activity to
be executed in case 1. Thus, the position property value is 3. We can observe
that the check for ordering explicitly refers to the position property of the event
nodes without the expensive transitive closure traversal. We follows the same
formalism shown in Sect. 4.1, except for encoding the directly f ollows relation
as we add the explicit position property to event nodes. The dropping of such a
relation positively affects the graph size.
Although the position property simplifies the processing of compliance
queries, it inherits the linear growth of the graph size w.r.t the log size. To fur-
ther limit the growth of the graph size, we modify the construction of the labeled
property graph. This section’s proposed encoding ensures a linear growth with
the size of the set of activity labels, i.e., Da . We generate a separate edge con-
necting a case node to the corresponding node representing the activity a ∈ Da
and add properties to the edges that reflect the position, timestamp, resource,
etc. These events’ properties represent the activity’s execution in the respective
case. Formally, for each log L, a labeled property graph G is constructed as
follows:
1. Labels for the graph elements are constituted of case and activity labels.
Formally, L = {case, event to case} ∪ Da ,
2. Keys for properties are the domain names from
m which values of the different
event attributes are drawn. Formally, K = i=1 {name(Di )} ∪ {ID} ,
3. For each unique case in the log, there is a node in the graph labeled as
“case” with a property ID that takes the value of the case identifier. Formally,
∀c ∈ Dc ∃ n ∈ G.N : {case} ∈ label(n) ∧ prop(n, ID) = c,
4. For each unique activity in the log, there is a node in the graph labeled as
“activity”. Formally, ∀a ∈ Da ∃ n ∈ G.N : {activity} ∈ label(n),
5. The structural relation between an event and its case is represented by
a labeled relation between the activity node of the event’s activity and
the case node. Additionally, all event-level properties are mapped to prop-
erties on edge. Formally, ∀e = (p1 , p2 , . . . , pm ) ∈ L ∃ r ∈ G.E :
r = (node(e.a), node(e.c)) ∧ {event to case} ∈ label(r) ∧ ∀3≤i≤m Di :
prop(r, name(Di )) = pi .
Figure 2 visualizes the graph resulting from encoding the log excerpt in
Table 1 using the unique activities method. For example, for activity E, there is
only one node and four different edges connecting to cases 1, 2, and 3. Two of
these edges connect case 3, as activity E was executed twice in this case.
156 N. M. Zaki et al.
Resource: John
Listing 2 shows the mod- StartTime: 1612256400
CompleteTime: 1612346400 Activity: C
ification on the Precedence Position: 4
anti-pattern query. The query
checks the ordering of the
A
events using the position
property, which is accessed 3
in Line 2. With this encod- 1
5 Evaluation
This section reports the evaluation of the method we proposed to encode event
logs as graphs. We compare our method, UA, against the baseline method BM.
In addition, we compare the storage of event logs in relational databases. The
relational table consists of four columns to store the case ID, the activity, the
timestamp of the event and the position of each event within a case. To detect
compliance violations, we evaluate two approaches. The first uses common SQL
Efficient Checking of Anti-patterns over Graph-Encoded Event Logs 157
operators such as joins and nested queries (NQ). The second uses the analytical
Match Recognize (MR) operator.
We selected four real life logs: three logs Table 2. Logs characteristics
from the BPI challenges to evaluate our Logs #Traces #Events #UA
experiments, namely: BPIC’12 [8], BPIC’14 BPIC’12 13087 262200 24
[7], BPIC’19 [10] and the log namely: BPIC’14 41353 369485 9
RTFMP [21]. We considered these logs as BPIC’19 220810 979942 8
they expose different characteristics as sum- RTFMP 150370 561470 11
marized in Table 2.
In the first experiment, we report on the loading time of the logs following the
respective encoding, i.e., loading into Neo4J, SQL graph database (SGD), and
the relational database (RDB). For each log, we report the loading time and the
number of nodes and edges created in the graph database (Table 3).
Loading large logs following the BM method, Neo4J crashed with an out-of-
memory error due to the large amount of data. This is the case for the BPIC’14,
BPIC’19, and the RTFMP logs. We have examined several subsets of these logs.
The number of cases reported in Table 3 corresponds to the maximum size that
could be loaded using the Neo4J configuration we mentioned earlier. For the UA,
SGD, and RDB encoding, all the data are loaded into the database for the full
log sizes. For the common log sizes, graph-based encoding using SGD is superior
3
SQL Server does not support Match Recognize. Thus, we used Oracle.
4
We will report later on experiments with exclude property, i.e. S = φ.
5
https://github.com/nesmayoussef/Graph-Encoded-Methods.
158 N. M. Zaki et al.
Table 3. Loading time (seconds) for each encoding method. [LT: Loading Time, # N:
number of nodes, # E: number of edges]
BM UA
Methods RDB
Graph Details Neo4j SGD Graph Details Neo4j SGD
Logs # Cases #N #E LT LT #N #E LT LT LT
BPIC’12 13087 177597 315933 16 56.6 13111 164510 6.7 17.4 1641
15000 148883 252766 13 3.7 15009 133883 9.9 1.2 890
BPIC’14
41353 410833 697607 — 11.7 41362 369480 9.4 2.9 2447
25000 135933 196866 13 2.3 25008 110933 4.9 1.3 960
BPIC’19
220810 1197804 1733178 — 49 220818 976994 19.7 11.8 8153
50000 236633 323266 21 3.8 50011 186633 10 1.7 1250
RTFMP
150370 711810 972510 — 55.9 150011 560046 19 6.7 3731
Table 4. Execution time (msec) for the variants of the Precedes queries [B: Before
time window, W: Within time window]
SGD Neo4J
Methods NQ MR
BM UA BM UA
Logs # Cases B W B W B W B W B W B W
BPIC’12 13087 668 665 309 296 74 138 47 30 124 251 571 723
15000 526 775 281 351 253 85 58 28 137 178 432 633
BPIC’14
41373 1354 1626 596 762 — — 17 29 352 925 1202 1759
25000 694 668 575 174 76 68 36 54 206 96 1006 634
BPIC’19
220810 6112 5885 3638 2519 — — 12 78 2355 813 9154 57704
50000 1015 778 437 378 138 118 61 92 203 187 1106 477
RTFMP
150370 3584 2551 1403 1087 71 219 649 484 3352 1447
For the precedence anti-patterns, in the case of the UA method, the reduction
of execution time goes up to 14x, as in the case of the BPIC’12 log for the Before
Efficient Checking of Anti-patterns over Graph-Encoded Event Logs 159
Table 5. Execution time (msec) for the variants of the Response queries [A: After time
window, W: Within time window]
SGD Neo4J
Methods NQ MR
BM UA BM UA
Logs # Cases A W A W A W A W A W A W
BPIC’12 13087 374 381 274 484 112 39 59 12 1374 293 747 953
15000 391 399 291 278 71 46 17 26 307 151 425 712
BPIC’14
41373 962 1122 651 824 — — 28 32 838 366 1181 1979
25000 554 412 315 275 29 20 12 16 6077 93 1267 931
BPIC’19
220810 4874 3629 3544 2814 — — 54 123 824 1009 11523 8461
50000 1362 542 931 368 62 91 14 31 916 147 1433 942
RTFMP
150370 4357 1669 2082 1062 — — 51 121 1355 434 4345 2824
0.9 0.4
0.8 0.35
Processing Time (sec.)
0.7
Processing Time (sec.)
0.3
0.6
0.25
0.5
0.2
0.4
0.15
0.3
0.2 0.1
0.1 0.05
0 0
201 646 670 949 127 430 646 942
# of cases # of cases
Comparing the Neo4J to SGD, Neo4J is faster. The best gain of UA in Neo4J
compared to SGD is 303x in the case of the BPIC’19 log for the Before time
window and 65x in the case of the BPIC’19 log for the After time window,
160 N. M. Zaki et al.
Table 4 and Table 5, respectively. This shows the superiority of the native graph
databases compared to graph extension of relational databases.
In the third experiment, we run Response with exclude, i.e., Response
(A, B, {C}, Δt, <) anti-pattern queries against BPIC’15 [11] log. This log con-
tains 1199 cases with 52217 events and 398 unique activities. We chose this log
due to its large number of unique activities. We empirically validate that the
proposed method still gives the best execution time. This experiment was run
five times with different activities and time windows for the different encoding
methods/storage engines.
Figures 3a and b report the execution time of the queries, with and without
time window, respectively. We show on the x-axis the query results sorted by the
matching number of cases. Obviously, the UA method shows the best scalability
as the number of matching cases (process instances) is a function in both the
input log size and the anti-pattern query.
Overall, for the different types of anti-pattern queries, the graph-based encod-
ing of event logs outperforms the relational database encoding. This aligns with
recent directions to employ graph databases for process analytics [9]. Addition-
ally, the UA encoding method we propose improves query time and storage space
against the baseline BM graph encoding method.
References
1. van der Aalst, W.: Process Mining. Springer, Heidelberg (2016). https://doi.org/
10.1007/978-3-662-49851-4
2. van der Aalst, W.M.P., de Beer, H.T., van Dongen, B.F.: Process mining and
verification of properties: an approach based on temporal logic. In: Meersman, R.,
Tari, Z. (eds.) OTM 2005. LNCS, vol. 3760, pp. 130–147. Springer, Heidelberg
(2005). https://doi.org/10.1007/11575771 11
3. Agrawal, R., et al.: Taming compliance with sarbanes-oxley internal controls using
database technology. In: ICDE, pp. 92–92. IEEE (2006)
4. Awad, A., Weidlich, M., Weske, M.: Visually specifying compliance rules and
explaining their violations for business processes. J. Vis. Lang. Comput. 22(1),
30–55 (2011)
Efficient Checking of Anti-patterns over Graph-Encoded Event Logs 161
1 Introduction
Mobile application development is one of the most highly required and demanded areas
in information technology. This is a result of the huge number of smartphones and tablets
that have created a huge and competitive market for mobile applications. Consequently,
the demand for mobile application developers is increasing all over the world [1]. As a
result of the variety of mobile operating systems and the need for the mobile application
to operate similarly on all these operating systems, there exist different cross-platform
solutions that enable developers to write their application once and run it on different
platforms.
Although they are time and cost-effective, cross-platform solutions are known for
their relatively low performance when compared to native applications. Several commer-
cial/research tools have been introduced as a solution for cross-platform development
[2–6]. In [5] Salama et al. proposed a trans-compiler-based approach for converting
native Android applications’ source code to native IOS applications’ source code. How-
ever, they only succeeded in converting the backend Java source code to swift source
code without including Android features, which resulted in a low conversion rate. In [6]
the solution in [5] was enhanced to support some of the Android features like connecting
UI views to the backend, responding to user events, and Android intents.
However, the enhanced solution did not support converting database connections,
which is an essential module in most of our daily used mobile applications. According
to [7], the number of available mobile applications until the first quarter of 2021 on
the two main big app stores (Google play and Apple app store) are 3.48 million and
2.22 million, respectively. Out of these millions of apps in the app stores, it would be
difficult to find one that does not require a database or some sort of storing or handling
data in a particular way. Therefore, almost all our daily used mobile applications require
storage and management of data including querying the stored data to retrieve certain
information.
Hence, mobile databases are considered an essential part of most mobile application
development. Therefore, a native mobile application converter tool that does not sup-
port database connections’ conversion is missing an essential functionality for mobile
applications.
Enhancing the existing solution to support database connections’ conversion will
noticeably enhance the performance of the converter and will help make the needed
human modifications to the converted code minimal.
The contribution of this paper is to propose a new trans-compiler-based database
code conversion model that aims to extend the solutions in [5] and [6] to support the
conversion of Firebase Firestore [8] database connections, evaluating the conversion
accuracy results and the improvement achieved from the proposed solution extension.
The outline structure for the rest of this paper is as follows: Sect. 2 represents the
related work. Section 3 gives a background on mobile databases. Section 4 presents the
methodology for developing the proposed model and applying it to enhance TCAIOSC.
Section 5 presents the results for the database code conversion model and the effect of
these results on the solution performance. Finally, Sect. 6 presents the conclusion and
future work.
2 Literature Reviews
There are many papers that present and categorize either the mobile application devel-
opment approaches [9–15] or the different mobile application types [15–18] or evalu-
ate different approaches, tools, or solutions [12, 19–24]. His section presents common
approaches and application types for mobile application development. Then it mentions
the former attempts at supporting database-related code conversion for tools converting
native-to-native applications.
164 R. Barakat et al.
3 Background
There are two main types of databases that are used in mobile application development
which are SQL and NoSQL databases. Each type has its own advantages and disadvan-
tages. Therefore, selecting the right database for a mobile application depends on the
type of developed application. Table 1 shows the advantages and disadvantages of each
type.
% of Apps % of installs
87.9
83.2
77.05
68.44
2.18
2.08
1.38
1.03
0.97
0.83
Fig. 1. Percentages of used Android database libraries in total apps and total number of installs
on play store
166 R. Barakat et al.
In the world of Android and iOS development, there are many database frameworks
under different database types like SQLite, Realm, Firebase, CoreData (iOS), and others.
According to AppBrain [9], among database libraries used in Android applications on
the play store, Firebase came in second place after the Android Architecture Components
with a percentage of 68.44% of apps and 83.20% of installs on the play store. Therefore,
converting Firebase Firestore database code to test the applicability of the solution was
chosen over the other libraries. Figure 1 shows the AppBrain statistics for the most used
database libraries in Android.
4 Methodology
TCAIOSC has successfully provided two code conversion units, one for backend code
conversion and the other for UI code conversion. These code conversion units have
proven to successfully convert backend and frontend code when tested to convert simple
applications that didn’t use database connections from Android to iOS. Although the
architecture for TCAIOSC implies that these converters can be easily extended to support
any library/API, this supposition is built on the assumption that the Android and iOS code
will always have a one-to-one lexical mapping. In this section, the enhanced methodol-
ogy to support database libraries and the proposed enhanced solution’s architecture are
presented.
In practice, mapping between the two platforms or languages is not always a direct one-
to-one relationship. For example, an Android Java code to achieve a simple functionality
such as initializing an instance of Cloud Firestore, from a compiler’s perspective, can
be implemented as a variable declaration including a method call in Java while it is
implemented in Swift as a method call followed by a variable declaration including a
method call as shown in Table 2.
The proposed methodology for database code conversion merges between direct
(one-to-one) mapping and indirect (one-to-many or many-to-many) pattern matching.
In this approach, a set of patterns for the database library are collected from the Firebase
Firestore official documentation [28] and predefined to match the input source code
against.
Trans-Compiler-Based Database Code Conversion Model 167
Table 3. Examples for direct and indirect mapping between Android and iOS
In the second example (one-to-many), the trans-compiler first tries to convert the
statement as a normal variable declaration, including a method call. It will check the
variable type and used method against the compiler’s defined data types and methods
to get the corresponding data type and method in Swift. Then, the pattern matcher will
attempt to match the statement against the defined patterns and add the app configuration
statement in Swift.
168 R. Barakat et al.
In the third example (many-to-many), the trans-compiler will try to convert the
statement using normal trans-compilation. It will match the object type and methods
against the compiler’s defined data types and methods. Then, it will reach the listener
statements with no equivalent direct mapping. Then, the pattern matcher module will
read the pattern and match it to the defined corresponding pattern in Swift.
Java Lexer and Parser. The Java lexer and parser components are generated using
ANTLR (ANother Tool for Language Recognition) [29], which is a parser generating
tool that is passed the grammar file for the Java language to produce the source language
tokens. The generated tokens are passed to the parser, which constructs the parse tree
according to the grammar file and generates interfaces that are used in the backend
converter to traverse the parse tree during the backend code conversion process.
UI-related information found in activity files that are used by the UI converter during
UI code conversion.
Firebase Pattern Matching Module. This module is introduced as a solution for indi-
rect Firestore statements’ mapping. It uses a set of predefined Firestore patterns in Java
and their equivalent Firestore code patterns in Swift. If a Firestore statement or statements
were not converted using direct mapping, they are passed from the backend converter to
the pattern matching module to be checked for one-to-many or many-to-many mapping.
Firebase Detector. The firebase detector has two main roles, one concerning the con-
version of the code itself and the other concerning the testing and evaluation of the con-
verted Firestore code and calculating the improvement rate for the tool after supporting
the Firestore conversion.
This module is used by the backend converter to determine whether a certain state-
ment is firestore-related or not. Whenever a statement is passed to the detector, it checks
keywords in the statement against a set of firestore keywords that have been previously
defined by a developer and stored in the database. This set of keywords includes one-
level keywords like library names and data types and two-level keywords like methods
that belong under a certain library, where the method name is considered the first level
and the parent library is considered the second level.
The advantage of this module’s design is that it is generic and extensible as it can
be easily extended to be a general database statement detector that detects any database
statement not only firestore by simply extending the set of keywords by adding keywords
for other database frameworks that has keywords up to any number of levels.
XML Lexer and Parser. The XML lexer and parser components are also generated
using ANTLR using the grammar file for XML language. Then it generates tokens that
are passed to the parser, which constructs the XML parse tree and generates interfaces
that are used in the UI converter to traverse the parse tree during the UI code conversion
process.
UI Converter. The UI converter maps each XML file that represents an activity in the
Android project into a scene in the iOS project. The different scenes resulting from
different XML activity files are then grouped together into one Storyboard file by the
UI converter and passed to the control unit.
For the UI converter to convert the UI code, it needs both the UI parse tree and the
backend parse tree in order to handle the UI-related code that existed in the backend
(.JAVA) code files and was previously identified by the backend converter.
The UI converter, much like the backend converter, is responsible for implementing
the interfaces that are produced by the XML parser. This implementation is then used
to build the output UI code for iOS, which is then passed to the controller unit.
Databases. The database contains all the necessary mapping data to complete the back-
end, UI, and Firestore conversions. It includes mappings for data types, methods and
their signatures (parameters and their types), methods return types, libraries, operators,
static built-in functions, UI data types, observers, and a defined set of Firestore key-
words that the Firestore detector module uses to determine whether a given statement is
a Firestore statement.
170 R. Barakat et al.
In this section, three evaluation requirements are presented. The first is to evaluate the
success in converting the database code; the second is to evaluate different applications’
code conversion rates before and after applying the proposed enhancement to measure
the size of the enhancement; and the third is to compare the runtime of the solution on
different applications before and after integrating the proposed enhancement.
A set of open-source native Android applications were selected from GitHub to test
the solution. Table 4 lists the sample applications that are used to test the solution. The
criteria for selecting the test applications set were:
Selecting Most Recent Open-Source Applications. To guarantee that the test samples
include the most recent Firestore features and avoid deprecated code in old applications,
all the selected applications were published on GitHub after 2019.
Selecting Applications from Different Categories. To test the tool’s performance for
different and broad types of applications and to test the generalization of the results.
Selecting Android Java Applications Only. Since TCAIOSC only supports the con-
version of Android Java applications, not Kotlin.
Trans-Compiler-Based Database Code Conversion Model 171
The same metric used by TCAIOSC to calculate the percentage of converted code was
adopted, that is, the percentage of successfully converted statements. This was done
to establish consistency between TCAIOSC and the enhanced solution. Also, it was
adopted to keep the integrity of the results when calculating the improvement in the
second part of the evaluation. Table 5 presents the percentage of successfully converted
statements for the set of test applications. The equation for calculating this percentage
is as follows:
Number of converted firestore statments
Firestore Conversion % = × 100 (1)
Total number of firestorestataments
After analyzing the results for converting Firestore statements, the statements that
were not converted were due to:
• Using Consecutive Multiple Listeners in Android: This pattern has no direct equivalent
in iOS.
• Partially Converted Statements: Some statements were partially converted whereas the
Firestore part of the statement was converted. However, the statement uses another
unsupported feature/data type in TCAIOSC, which results in the whole statement
being counted as not converted.
• Unmatched Statements: Some Firestore statements did not match any of the defined
patterns (this can be improved by adding more patterns to the set of predefined
patterns).
172 R. Barakat et al.
5.2 The Improvement of the Overall Conversion Rate for an Entire Application
The overall improvement in the conversion rate was calculated by converting the same
set of test applications by TCAIOSC before and after the support of Firestore. Table 6
compares the backend conversion results for TCAIOSC before and after supporting the
Firestore library.
Table 6. Comparison between TCAIOSC’s results before and after supporting firestore
Considering that the application conversion process is a one-time process that will
be executed once, the observed increase could be accepted since the total conversion
time is still found relatively small for producing a mobile application source code in
seconds.
Table 7. Comparison between TCAIOSC’s runtime before and after supporting Firestore
• Using Consecutive Multiple Listeners in Android: the impact of this limitation can be
minimized by defining a standardized way for the tool to handle certain code in the
source language or platform that has no equivalent in the target language or platform.
• Partially Converted Statements: this can be solved/minimized by supporting more
backend features in the original tool of TCAIOSC.
• Unmatched Statements: can be improved by defining and adding more patterns to the
set of predefined patterns.
174 R. Barakat et al.
References
1. Montandon, J.E., Politowski, C., Silva, L.L., Valente, M.T., Petrillo, F., Guéhéneuc, Y.G.:
What skills do IT companies look for in new developers? A study with stack overflow jobs.
Inf. Softw. Technol. 129(August), 2021 (2020). https://doi.org/10.1016/j.infsof.2020.106429
2. Cordova. https://cordova.apache.org/
3. xamarin. https://dotnet.microsoft.com/en-us/apps/xamarin
4. J2OBJC. https://developers.google.com/j2objc
5. Salama, D.I., Hamza, R.B., Kamel, M.I., Muhammad, A.A., Yousef, A.H.: TCAIOSC: Trans-
compiler based android to iOS converter. In: Hassanien, A.E., Shaalan, K., Tolba, M.F. (eds.)
AISI 2019. AISC, vol. 1058, pp. 842–851. Springer, Cham (2020). https://doi.org/10.1007/
978-3-030-31129-2_77
6. Hamza, R.B., Salama, D.I., Kamel, M.I., Yousef, A.H.: CAIOSC: application code conversion.
In: 2019 Novel Intelligent and Leading Emerging Sciences Conference (NILES), vol. 1,
pp. 230–234 (2019). https://doi.org/10.1109/NILES.2019.8909207
7. Google Play Store: number of apps 2021 | Statista. https://www.statista.com/statistics/266
210/number-of-available-applications-in-the-google-play-store/ (accessed Sep. 04, 2021)
8. Firebase. https://firebase.google.com/
9. Perchat, J., Desertot, M., Lecomte, S.: Component based framework to create mobile cross-
platform applications. Procedia Comput. Sci. 19, 1004–1011 (2013). https://doi.org/10.1016/
j.procs.2013.06.140
10. Rahul Raj, C.P., Tolety, S.B.: A study on approaches to build cross-platform mobile appli-
cations and criteria to select appropriate approach. In: 2012 Annual IEEE India Conference
INDICON 2012, pp. 625–629 (2012). https://doi.org/10.1109/INDCON.2012.6420693
11. Ribeiro, A., Da Silva, A.R.: Survey on cross-platforms and languages for mobile apps. In:
Proceedings of 2012 8th International Conference on Quality Information Communication
Technology QUATIC 2012, pp. 255–260 (2012). https://doi.org/10.1109/QUATIC.2012.56
12. Heitkötter, H., Hanschke, S., Majchrzak, T.A.: Evaluating cross-platform development
approaches for mobile applications. In: Web Information Systems and Technologies,
pp. 120–138 (2013)
13. El-Kassas, W.S., Abdullah, B.A., Yousef, A.H., Wahba, A.: ICPMD: integrated cross-platform
mobile development solution. In: Proceedings of 2014 9th IEEE International Conference on
Computer Engineering and Systems, ICCES 2014, pp. 307–317 (2014). https://doi.org/10.
1109/ICCES.2014.7030977
14. El-Kassas, W.S., Abdullah, B.A., Yousef, A.H., Wahba, A.M.: Enhanced code conversion
approach for the Integrated Cross-Platform Mobile Development (ICPMD). IEEE Trans.
Softw. Eng. 42(11), 1036–1053 (2016). https://doi.org/10.1109/TSE.2016.2543223
Trans-Compiler-Based Database Code Conversion Model 175
15. El-Kassas, W.S., Abdullah, B.A., Yousef, A.H., Wahba, A.M.: Taxonomy of cross-platform
mobile applications development approaches. Ain Shams Eng. J. 8(2), 163–190 (2017).
https://doi.org/10.1016/j.asej.2015.08.004
16. Smutný, P.: Mobile development tools and cross-platform solutions. In: 2012 13th Interna-
tional Carpathian Control Conferenc, ICCC 2012, pp. 653–656 (2012). https://doi.org/10.
1109/CarpathianCC.2012.6228727
17. Ohrt, J., Turau, V.: Cross-platform development tools for smartphone applications. Comput.
(Long. Beach. Calif) 45(9), 72–79 (2012). https://doi.org/10.1109/MC.2012.121
18. Xanthopoulos, S., Xinogalos, S.: A comparative analysis of cross-platform development
approaches for mobile applications. In: BCI 2013: Proceedings of the 6th Balkan Conference
in Informatics, September 2013. https://doi.org/10.1145/2490257.2490292
19. Rieger, C., Majchrzak, T.A.: Weighted evaluation framework for cross-platform app develop-
ment approaches. In: Wrycza, S. (ed.) SIGSAND/PLAIS 2016. LNBIP, vol. 264, pp. 18–39.
Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46642-2_2
20. Rieger, C., Majchrzak, T.A.: Towards the definitive evaluation framework for cross-platform
app development approaches. J. Syst. Softw. 153, 175–199 (2019). https://doi.org/10.1016/j.
jss.2019.04.001
21. Jobe, W.: Native apps Vs. mobile web apps. Int. J. Interact. Mob. Technol. 7(4), 27 (2013).
https://doi.org/10.3991/ijim.v7i4.3226
22. Mohammadi, F., Jahid, J.: Comparing native and hybrid applications with focus on features.
p. 49 (2016)
23. Pulasthi, L.K., Gunawardhana, D.: Native or web or Hybridwhich is better for mobile
application. Turkish J. Comput. Math. Educ. Res. Artic. 12(6), 4643–4649 (2021)
24. Nawrocki, P., Wrona, K., Marczak, M., Sniezynski, B.: A comparison of native and cross-
platform frameworks for mobile applications. Comput. (Long. Beach. Calif). 54(3), 18–27
(2021). https://doi.org/10.1109/MC.2020.2983893
25. Umuhoza, E., Brambilla, M.: Model driven development approaches for mobile applications:
a survey. In: Younas, M., Awan, I., Kryvinska, N., Strauss, C., Thanh, D.V. (eds.) MobiWIS
2016. LNCS, vol. 9847, pp. 93–107. Springer, Cham (2016). https://doi.org/10.1007/978-3-
319-44215-0_8
26. Heitkötter, H., Majchrzak, T.A., Kuchen, H.: Cross-platform model-driven development of
mobile applications with MD 2. In: Proceedings on ACM Symposium on Applied Computing.
SAC, pp. 526–533 (2013). https://doi.org/10.1145/2480362.2480464
27. Mechdome. http://www.mechdome.com/
28. Firestore. https://firebase.google.com/docs/firestore
29. Parr, T.: The Definitive ANTLR 4 Reference. Pragmatic Bookshelf (2013)
MDMSD4IoT a Model Driven
Microservice Development for IoT
Systems
Abstract. Nowadays, IoT systems are widely used, they are embedded
with sensors, software, and technologies enabling communication and
automated control. The development of such applications is a complex
task. Therefore, we have to use a simplified methodology and a flex-
ible and scalable architecture to build and run such applications. On
the one hand, Model-driven development (MDD) provides significant
advantages over traditional development methods in terms of abstrac-
tion, automation, and ease of conception. On the other hand, microser-
vice architecture (MSA) is one of the booming concepts for large-scale
and complex IoT systems, it promises quick and flawless software man-
agement compared to monolithic architectures. In this paper, we present
MDMSD4IoT, Model-driven Microservice architecture development for
IoT based on the profile SysML4IoT and combined Model Driven Devel-
opment and microservice architecture. We illustrate our contribution
through a smart classroom case study.
1 Introduction
The field of IoT is growing exponentially and has sparked a revolution in the
industrial world [16]. It refers to an emerging paradigm that allows the intercon-
nection of physical devices equipped with sensing, networking, and processing
capabilities to collect and exchange data. These things connected to each other
form a much larger system and enable new ubiquitous and pervasive comput-
ing services [23]. Therefore, the development of IoT systems is very challenging
due to their complexity [6] and the lack of IoT development methodologies and
appropriate application architecture style.
Several modeling languages and tools based on the Model Development Engi-
neering approach [26] have been proposed to design and develop complex soft-
ware systems through meta-modeling, model transformation, code generation,
and automatic execution. Thus, Model-Driven Development (MDD) [24] and
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 176–189, 2023.
https://doi.org/10.1007/978-3-031-21595-7_13
MDMSD4IoT a Model Driven Microservice Development for IoT Systems 177
Model Driven Architecture (MDA) [5] are classified as Model Driven Engineer-
ing (MDE) [25], MDD is a development paradigm that uses models as the pri-
mary artifact of the development process and is a subset of MDE, MDA is a
subset of MDD and relies on the use of OMG standards [4]. This latter proposes
several specification languages, the most relevant one for IoT-based applications
is the OMG System Modeling Language SySML [12], it is a general purpose
graphical modelling language for specifying, analyzing, designing, and verifying
complex systems that may include hardware, software, information, etc. It has
been conceived as a profile of UML [13].
Furthermore, IoT requires heavy integration between devices, data, and
applications. This integration is becoming increasingly costly, complex, and time-
consuming, especially with a monolithic architecture. These problems could be
reduced considerably by using microservices architecture since it structures an
application as a collection of services as small, modular, independently deploy-
able, and loosely-coupled services [14,22].
In order to provide some solutions to these IoT systems engineering issues,
we believe that we have to take advantage of the most convenient and relevant
approaches, methodologies, languages, and tools for IoT application development
and combine them in an original way. Therefore, our proposed approach is based
on: 1) SysML profile for IoT [18] based on the architecture reference model IoT-
ARM [3,8], 2) microservices architecture (MSA) [20], 3) Model Driven Architec-
ture [5] and 4) a methodology for IoT development process adopted in recent IoT
projects [7]. Our approach is called MDMSD4IoT for Model Driven Microservices
Architecture Development for IoT systems. It aims to design and develop IoT sys-
tems in an efficient and flexible way. Indeed, firstly, Systems Modeling Language
(SysML) [18] is the most popular tool for model-based development, it allows
modeling physical aspects like hardware, sensors, etc. Secondly, the microser-
vices architecture [20] defines an application as a set of small autonomous ser-
vices, which is very suitable for IoT systems thanks to its characteristics such as
weak coupling, modularity, flexibility, independent deployment, and resilience.
Thirdly, model-driven architecture (MDA) [5] is a paradigm that promotes the
use of models to solve software engineering problems through techniques such as
abstraction and automation. Finally, IoT Design Methodology proposed in [7,18]
is a useful methodology that consists of a set of development steps and is based
on the Architecture Reference Model (ARM) [8]. In this paper, we extended this
design methodology with two steps related to the definition and the specification
of microservices.
We illustrate our approach through the development of “UC2Smart
Classroom” using the fingerprint module for secure access and automatic
resource management, eg. smart lighting, etc. in our university.
This paper is outlined as follows: Sect. 2 presents a summary of some related
work. Section 3 describes the MDMSD4IoT approach. Section 5 illustrates the
proposed approach through UC2SmartClassroom case study which is a smart
classroom prototype for the University of Constantine 2. Finally, we conclude
the paper and present some future work.
178 M. Belguidoum et al.
2 Related Work
In order to address the above-mentioned challenges and be able to develop effi-
ciently IoT applications, we focus on the most relevant related work, namely
based on: IoT Architecture Reference Model, SysML4IoT profile, Microservice
Architecture (MSA), and Model Driven Development (MDD).
The goal of the IoT Architectural Reference Model project [3] is to provide
developers with common technical basics and a set of guidelines for building
interoperable IoT systems. The architectural reference model (ARM) [8] pro-
vides the highest level of abstraction for the definition of IoT systems. Besides
models, the IoT domain model provides the concepts and definitions on which
IoT architectures can be built.
In [11] IDeA (IoT DevProcess & IoT AppFramework) is proposed as a model-
based systems engineering methodology for IoT application development. It
focuses on modelling and considers the model as the primary artifact for systems
development. The main objective of this methodology is to provide a high-level
abstraction to deal with the heterogeneity of software and hardware components.
IDeA is composed of a method, called IoT DevProcess, and a support tool,
called IoT AppFramework. The IoT DevProcess is used for the design of IoT
applications, to support IoT DevProcess activities, the IoT AppFramework pro-
vides a SysML profile for IoT applications called SysML4IoT [11], which is
strongly based on the IoT-ARM domain model presented previously.
In [19], a model-driven approach is proposed to ease the modeling and real-
ization of adaptive IoT systems, it is based on SysML4IoT (an extension of the
SysML) to specify the system functions and adaptations, the generated code
is deployed later on to the hardware platform of the system, a smart lighting
system is developed.
A model-driven environment called CHESSIoT is presented in [21] to design
and analysis of industrial IoT systems. It follows a multi-view, component-based
modeling approach with a comprehensive way to perform event-based modeling
on system components for code generation purposes employing an intermediate
ThingML model [17]. An Industrial real-time safety use case is designed and
analysed.
The authors of [7] have proposed a generic design methodology for IoT sys-
tems independent from the specific product, service, or programming language
and allow designers to compare various alternatives for loT system components.
The presented methodology is generally based on the IoT-ARM reference model
[8], it focuses on the domain model to describe the main concepts of the system
to be designed and to help designers understand the IoT domain for which the
system must be designed. The methodology consists of the specification of objec-
tives and requirements, the process, the domain model, the information model
service, IoT level, functional view, operational view, integration of devices and
components, and finally, application development.
MDMSD4IoT a Model Driven Microservice Development for IoT Systems 179
Authors in [9] propose the FloWare approach and its toolchain, it combines
Software Product Line and Flow-Based Programming paradigms to manage the
complexity in the various stages of the IoT application development process.
The final IoT application and the executable Node-RED flow are generated
using an automatic transformation procedure starting from a configuration of
the designed Feature Models.
In [10] a model-driven integrated approach is provided to exploit traceability
relationships between the monitored data of a microservice-based running sys-
tem and its architectural model to derive recommended refactoring actions that
lead to performance improvement. The approach has been applied and validated
on e-commerce and ticket reservation, and the architectural models have been
designed in UML profiled with MARTE.
In [15], the authors explained how typical Microservice Service Architecture
problems can be addressed using Model Driven Development such as abstraction,
model transformation, and modelling viewpoints. Indeed, MDD offers several
advantages in terms of service development, integration and migration. However,
MDD is rarely applied in SOA engineering. Nonetheless, the authors claim that
the use of MSA greatly facilitates and fosters the usage of MDD in microservice
development processes. They list these characteristics according to Newman [22]
and correlates them with the means of MDD, for example, Service Identification
is supported by model transformation, Technology Heterogeneity is supported
by abstraction and code generation techniques, while Organizational Alignment
is supported by modelling viewpoints.
Table 1 presents a comparison with some related work. We have classified
them according to the following four criteria: IoT context, Model driven devel-
opment and MDA approach, microservice architecture style, and modelling with
SysML language. We noticed that none of the existing approaches takes into
account the four criteria at the same time, namely the architecture of microser-
vices with the SysML modelling (for the IoT hardware parts, etc.) and the use
of the MDA approach.
MDMSD4IoT facilitates the specification and the design of complex IoT systems
through the SysML4IoTMSA profile. This profile provides stereotypes to repre-
sent the concepts of IoT and microservices and their associations. According to
SysML4IoTMSA, a microservices-based IoT system is made up of four parts:
user, microservices, environment, and hardware. Figure 2 details the stereotypes
defined in the profile. SysML Block is the principal extended element since
SysML conceptualizes it as a modular unit of the system that is used to define
a type of software, hardware, or data elements of the system or its composition
of them [11]. The user part (with the yellow color) represents the users of the
system (client-side application). The environment and materials part (with the
green color) represents the physical and hardware aspects. It is related to the
application domain concepts. For example, the building automation domain is
expressed in terms of floors and rooms. The main concepts are as follows:
MDMSD4IoT a Model Driven Microservice Development for IoT Systems 181
In our approach, we have extended the methodology for the development of IoT
proposed in [7] by adding new steps related to microservices (step D: Definition
of the microservice architecture and step E : Microservice specifications) using
SysML diagrams. In the following, we briefly explain the development steps:
a) Objective and requirements analysis: the first step is to define the objective
and the main requirements of the system. In this step, we describe our system,
why it is designed, and its expected functionalities.
b) Requirement specification: in this step, the use cases of the IoT system
are described and derived from the first step. It consists of extracting the
functional requirements of the system and modelling them with the SysML
requirement diagram. In this step, the system functionalities are grouped into
domains.
c) Process specification: describes the behavioral aspect of the system, and
how it works through SysML activity diagram.
d) Definition of the microservice architecture: the fourth step is to define the
logical architecture of the system. Designers must break down the system
into fundamental architecture building blocks that represent microservices,
each one is highly cohesive and encapsulates a unique business capability. A
microservice can be a functional type or an infrastructure type. This step
represents the interactions between the different microservices according to
the topology of the architecture.
e) Microservice Specification: consists of defining the microservice specifica-
tions. It models each functional microservice within and architecture (defined
in the previous step) using a definition block diagram and describes the main
concepts (interface, entities, service, operations, and the relationships between
them), in order to understand each microservice domain.
f) Functional View Specification: defines the Functional View (FV). It defines
the different functionalities of the IoT system grouped into functional groups
(FG). In this step the system is represented in layers, each layer will be
mapped to one or more groups (FG) according to its functionalities.
g) Operational view specification : consists of defining the operational view.
The various options related to the deployment and the system operation are
defined, such as the options of hosting services, storage, and devices.
h) Integration of devices and components: it installs and integrates the various
devices.
i) Application Development: this step involves developing the entire IoT appli-
cation (backend and frontend), testing it and deploying it.
184 M. Belguidoum et al.
The system interacts with the classroom environment and with the system
administrator or the teacher.
Microservices allow making the correspondence between each service with its
corresponding resources and devices (sensors and actuators) in order to carry
out this service. For each functional microservice, we develop a block definition
diagram, to extract the offered services and to define the responsibilities, i.e.,
associate services with their required resources. The microservices specification
also shows the devices that host the resources.
Spring Cloud Netflix offers to use Zuul proxy to forward requests received
by the Gateway API to functional microservices. In a microservices architecture
hosted in the Cloud, it is difficult to anticipate the number of instances of the
same microservice (depending on the load) or even where they will be deployed
(and therefore on which IP and which port they will be accessible). The role
of Eureka server, is to connect microservices. Each microservice will register
and retrieve the address of its adhesion. The microservices (as well as the other
servers) will load their application configuration from the Config Server, whose
role is to centralize the configuration files of microservices in a single place.
The configuration files are versioned on a git directory containing a common
configuration for all microservices and a specific configuration for each microser-
vice (in this case the Git Bash tool is used as a version manager). Changing the
configuration no longer requires rebuilding applications or redeploying them, a
simple restart is sufficient.
References
1. Acceleo. https://www.eclipse.org/acceleo/
2. Eclipse papyrus. https://www.eclipse.org/papyrus/
3. Iot-a: internet of things architecture. https://portal.effra.eu/project/1470
4. Object management group (omg). https://www.omg.org/
5. OMG: object management group MDA (Model Driven Architecture) Guide Version
1.0.1. http://www.omg.org/mda/ (2001)
6. Aguilar-Calderón, J.A., Tripp-Barba, C., Zaldı́var-Colado, A., Aguilar-Calderón,
P.A.: Requirements engineering for internet of things (loT) software systems devel-
opment: a systematic mapping study. Appl. Sci. 12(15), 7582 (2022)
7. Bahga, A., Madisetti, V.: Internet of things: a hands-on approach, chap. 5, pp.
99–115. Bahga and Madisetti (2014)
8. Bassi, A., et al.: Enabling Things to Talk: Designing IoT Solutions with the IoT
Architectural Reference Model. 1st edn. Springer, Berlin (2013). https://doi.org/
10.1007/978-3-642-40403-0
9. Corradini, F., Fedeli, A., Fornari, F., Polini, A., Re, B.: FloWare: an approach for
IoT support and application development. In: Augusto, A., Gill, A., Nurcan, S.,
Reinhartz-Berger, I., Schmidt, R., Zdravkovic, J. (eds.) BPMDS/EMMSAD -2021.
LNBIP, vol. 421, pp. 350–365. Springer, Cham (2021). https://doi.org/10.1007/
978-3-030-79186-5 23
MDMSD4IoT a Model Driven Microservice Development for IoT Systems 189
10. Cortellessa, V., Pompeo, D.D., Eramo, R., Tucci, M.: A model-driven approach for
continuous performance engineering in microservice-based systems. J. Syst. Softw.
183, 111084 (2022). https://doi.org/10.1016/j.jss.2021.111084
11. Costa, B., Pires, P., Delicato, F.: Modeling IoT Applications with SysML4IoT. In:
42th Euromicro Conference on Software Engineering and Advanced Applications
(SEAA), pp. 157–164 (2016)
12. Debbabi, M., Hassaı̈ne, F., Jarraya, Y., Soeanu, A., Alawneh, L.: Verification and
Validation in Systems Engineering. Springer, Berlin (2010). https://doi.org/10.
1007/978-3-642-15228-3
13. Delsing, J., Kulcsár, G., Haugen, Ø.: SysML modeling of service-oriented system-
of-systems. Innov. Syst. Softw. Eng. (2022). https://doi.org/10.1007/s11334-022-
00455-5
14. Dragoni, N., et al.: Microservices: yesterday, today, and tomorrow. In: Present and
Ulterior Software Engineering, pp. 195–216. Springer, Cham (2017). https://doi.
org/10.1007/978-3-319-67425-4 12
15. F. Rademacher, J. Sorgalla, P.W.S.S., Zundorf, A.: Microservice architecture and
model-driven development: yet singles, soon married (?). In: Proceedings of the
19th International Conference on Agile Software Development: Companion, p. 5,
No. 23, ACM, New York, USA (2018)
16. Giannelli, C., Picone, M.: Editorial industrial IoT as it and OT convergence: chal-
lenges and opportunities. IoT 3(1), 259–261 (2022)
17. Harrand, N., Fleurey, F., Morin, B., Husa, K.E.: ThingML: a language and code
generation framework for heterogeneous targets. In: Proceedings of the 19th Inter-
national Conference on Model Driven Engineering Languages and Systems, pp.
125–135. ACM (2016)
18. Holt, J., Perry, S.: SysML for Systems Engineering. 2nd edn. The Institution of
Engineering and Technology, London (2013)
19. Hussein, M., Li, S., Radermacher, A.: Model-driven development of adaptive IoT
systems. In: MoDELS (2017)
20. Nadareishvili, I.R., Mitra, M.M., Amundsen, M.: Microservice Architecture. 1st
edn. O’Reilly Media, Sebastopol (2016)
21. Ihirwe, F., Ruscio, D.D., Mazzini, S., Pierantonio, A.: Towards a modeling and
analysis environment for industrial IoT systems. In: Iovino, L., Kristensen, L.M.
(eds.) STAF 2021 Software Technologies: Applications and Foundations. CEUR
Workshop Proceedings, vol. 2999, pp. 90–104. CEUR-WS.org (2021). http://ceur-
ws.org/Vol-2999/messpaper1.pdf
22. Newman, S.: Building Microservices. O’Reilly Media, Sebastopol (2015)
23. Sethi, P., Sarangi, S.: Internet of things: architectures, protocols, and applications.
J. Electr. Comput. Eng. 1, 1–25 (2017)
24. Picek, R., Strahonja, V.: Model driven development - future or failure of software
development? (2007)
25. da Silva, A.R.: Model-driven engineering: a survey supported by the unified con-
ceptual model. Comput. Lang. Syst. Struct. 43, 139–155 (2015)
26. Stahl, T., Voelter, M., Czarnecki, K.: Model-Driven Software Development: Tech-
nology, Engineering. Management. Wiley, Hoboken (2006)
Database Systems
Parallel Skyline Query Processing
of Massive Incomplete
Activity-Trajectories Data
Abstract. The big spatial temporal data captured from technology tools
produce massive amount of trajectories data collected from GPS devices.
The top-k query was proposed by many researchers, on which they used
distance and text parameters for processing. However, the information
related to text parameter like activity is always not presented due to some
reason like lack internet connection. Furthermore, with massive amount of
keyword semantic activity-trajectories, user may enter the wrong activ-
ity to find its activity-trajectory. Therefore, it’s hard to return the desir-
able results based on the exact keyword activity. Our previous work pro-
posed an efficient algorithm to handle the trajectory fuzzy problem based
on edit distance and activity weight. However, the algorithm proposed
does not work with incomplete Trajectory DataBases (TDBs). There-
fore, the present investigation focuses on handling the trajectory skyline
problem based on distance and frequent activities in incomplete TDB. To
accelerate the query processing, the massive trajectory objects is man-
aged through Distributed Mining Trajectory R-Tree (DMTR-Tree index)
based on R-tree indexes and inverted lists. Afterward, an efficient algo-
rithm is developed to handle the query. For a rapid computation, a cluster-
computing framework of Apache Spark with MapReduce is used. Theoret-
ical analysis and the experimental results show a well agreement and both
attest on the higher efficiency of the proposed algorithm.
1 Introduction
Nowadays, more and more spatial temporal data are produced by means of new
sensors and smartphones, etc., which embedded with GPS (Global Positioning
System) produce huge volumes of trajectories data. To discover the knowledge
and support the decision-making, these moving objects are usually stored and
archived in TraJectory DataBases (TDBs) for in-depth analysis and process-
ing [1]. Several application domains such as animal breeding, traffic jam, etc.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 193–206, 2023.
https://doi.org/10.1007/978-3-031-21595-7_14
194 A. Belhassena and W. Hongzhi
exploit these massive TDBs. In particular, TDBs have been successfully applied
to the tourism and marketing application domains, since end-users are usually
equipped with smart phones, or GPS embedded vehicles, which are able to track
the movements of these people at detailed spatial and temporal scales. The anal-
ysis of these human activities providing semantic trajectories data could be then
used to improve the quality of offered services. Indeed, end-users can manually
add to the trajectory data some description about their activities like shopping,
working, etc. when they arrive at Point of Interest (POI) locations. This crowd-
sourcing approach allowing people to comment, edit POIs is very common, and
adopted by commercial services enterprises (such as Google) or free platforms
like Wikipedia. Although, these GPS datasets become more and more massive,
data concerning the end-users activities is not always present, and does not asso-
ciated to the GPS points, for different reasons: lack of internet connection, the
end-user does not want publish the data for privacy issues, lack of time, etc. Such
kinds of applications using POIs can also be translated in other contexts where
the POI is a location with a particular relevance such as drinking points in the
bravery applications, energy recharge points for agricultural and cars vehicles,
etc. This makes the proposal of this paper open to different application domains
different from the tourism one described in the next of the paper.
A real common and used query type on these trajectories datasets annotated
with activities consists in finding out the frequent shortest activity-trajectory,
near to user location including a set of preferable activities. For example, “What
is the shortest activity-trajectory to reach Eiffel tower from La Bastille (in Paris)
and do some shopping at the same time?” For such a challenging query, fre-
quent mining activity-trajectory algorithms used for top-k queries are not trivial.
Indeed, efficient similarity measures among activity-trajectories are necessary.
Moreover, the huge volume of this data needs efficient computation and stor-
age methods such as distributed indexes. Finally, the lack of some activity data
must also be taken into account. So, due to some reasons like: lack adaptation for
tactile tablets, users may make mistakes while inputting their activity queries.
Thus, it’s hard to return the desirable activity-trajectory based on the exact
keyword activity. Indeed, approximate keyword search has to be considered.
Some efforts have been done about the top-k trajectory similarity with activ-
ities like [2], others have used some indexing methods like hybrid grid index [3].
However, to the best of our knowledge, processing the top-k frequent activity-
trajectories query with missing data has not been researched so far. Therefore,
the proposals of our work aim to process the top-k query based on the similarity
of the spatial distance and the full activity information by taking into account
the problems of missing data and approximate activity keywords search. In par-
ticular, to solve the incomplete information problem in top-k queries, we will use
an optimized method of skyline queries. Skyline query is a database query where
the skyline operator answers to an optimization problem, used to filter results
and keep only those objects that are not worse than any other. In our context,
a skyline query is based on both spatial and textual (i.e. activities) dimensions.
Furthermore, our solution is promoted also to process the fuzzy problem for
approximate activities in POI using edit distance and activity weights [4]. To
Parallel Skyline Query Processing 195
answer the skyline query mentioned above, which returns the activity-trajectory
that is not dominated by others based on the massive historical trajectory data,
it’s hard to process this query without passing by an efficient storage method
as well as a distributed parallel computing approach. Therefore, we use our pre-
vious method to organize the massive trajectory data into Distributed Mining
Trajectory R-Tree index (DMTR-TREE) [5,6], which is based on distributed R-
tree indexes and inverted lists. Using an aggregate of data through distributed
parallel operations, we have developed a new algorithm for skyline query pro-
cessing. This algorithm is performed using a distributed cluster based on Spark
and MapReduce model to accelerate the process of large data trajectories. The
contributions of our paper are:
The remaining of our paper is organized as follows: Sect. 2 presents the related
work. Section 3 presents the overview of our index. Section 4 introduce a set of
functions that used for query processing. Section 5 explains the query algorithm
proposed. Section 6 describes our experimental studies.
2 Related Work
Recently, many researchers have widely studied the top-k trajectory query pro-
cessing on trajectory data [3,7–9]. In [10] they integrated social activity data into
the spatial keyword query of semantic trajectory. Furthermore, to process the
top-k trajectory queries, an efficient indexing method can be necessary. Recently,
many studied have proposed to index massive data through spatial access meth-
ods like R-Tree index [11], which is more suitable to index the spatial movement
objects. However, as data is rapidly increased, the indexing methods based on
a single node are insufficient. Therefore, some researchers have proposed dis-
tributed indexes to handle the limitation of centralized methods based on clus-
ters [12,13], where authors have used h-base. To perform MapReduce to index
the massive spatial data, in [14], they have developed an efficient framework
called SpatialHadoop. In order to improve the MapReduce limitation, GeoSpark
[12] have been proposed to support spatial access methods.
Furthermore, the distance of semantic trajectories have to be considered.
Thus, it is hard to process top-k query based on both approximate semantic key-
words and distance. Such a problem could be solved using skyline query. Authors
in [16] have firstly proposed the skyline query by introducing the skyline operator
for a relational database system based on the B-Tree index. In trajectory data pro-
cessing, authors in [17] addressed the problem of trajectory planning by exploring
the skyline concept. [18] have proposed an efficient algorithm to retrieve stochas-
tic skyline routes for a given source-destination pair and a start time. Recently,
196 A. Belhassena and W. Hongzhi
[19] applied the skyline queries method for a personalized travel route recommen-
dation scheme using the mining of the collected check-in data. [20] combined the
skyline query and top-k method to recommend travel routes that cover different
landscape categories and meet user preferences. With the important contributions
made by the aforementioned work, none of them take into consideration the miss-
ing semantic trajectory data for skyline query processing.
n
i=1 [d(oi .T,oi+1 ,T )]+M (Ti , Tj )+d(on .T, D.q)
F (q, T ) = 1 − N (L.q.oi .T ) + dˆ
(1)
198 A. Belhassena and W. Hongzhi
search space by using a list to store all keys of partitions with their MBRs (line
1). This list is in the master node in the cluster. Then, to select the covered
MBRs (lines 2–5), the master node computes the distance between the query
points and the MBRs before starting any traversing of the distributed indexes.
The search space is pruned based on this distance formula: dis(M BR, q) =
dis(S, M BR) + dis(M BR, E) (line 6). Moreover, based on trajectory lengths
and partition method, we have classified the activity-trajectories into two classes.
The first class comprises the short trajectories (lines 8–9) while the second class
comprises the long trajectories (line 10–11). In the following, we will explain the
process of both classes.
In this class, the activity-trajectory matching is short, i.e., this trajectory belongs
to one partition and is organized through one R-tree index. We have developed
algorithm 2 to process the short activity-trajectory query. In algorithm 2, ini-
tially, we simultaneously start traversing the index partition. Using an RDD
spark, we can read the activity-trajectory data from this tree (line 2). Then,
based on this RDD, we process another RDD filters using a function FILTER
to prune the search space and return the activity-trajectory candidates (line 3).
For the pseudo-code of the function FILTER presented in Algorithm 3, the
distance σ is computed between R-tree and the query q (line 3). It aims to select
the node that should be visited by pruning the search space while traversing
along the tree (lines 5–9). In the end, we return a list of the nodes, which
include the activity-trajectory objects (line 9). In the case when the activity-set
information is missing in the selected node (lines 11), using the distance function,
another node with a minimal distance has to be found (lines 12–15). This new
node returned must contain similar activity-set keywords of q (lines 16–17).
Parallel Skyline Query Processing 201
Using the activity function, the SIMILAR function [6] is invoked in line 16.
It allows multi-activity similarity based on the edit distance [4]. As the leaf node
in the tree may store multiple activities of the same POI, this function aims to
return the activity which has the minimum edit distance with the keywords of
q. Afterward, the list K is updated (line 17). In the opposite case, i.e., when we
have full activities, we just compare the similarity between activity-set holding
in the visited node and q (line 19). Then, we return the new list K (line 20).
To return the final result to user, we processed the data mining algorithm
to choose the best top-k activity trajectories. We used Apriori algorithm to
202 A. Belhassena and W. Hongzhi
calculate the support Sup and the confidence Conf of activities and stored them
in the inverted lists. This method helps us to traverse the inverted list of each
activity-trajectory candidate T and extract the Sup and Conf of its activity-set.
To collect T objects (the POI with activities), we use the Primary-Trajectory
function [6] (line 4 in Algorithm 2) that is presented in Algorithm 4. It aims to
return a list D (lines 2–8) containing collected activity-trajectories. The frequent
trajectory is a trajectory on which its activity set has a higher Sup and a higher
Conf . To extract the frequent activity-trajectory from the activity-trajectory
candidates obtained in the previous step, which are stored in the list D, we need
just to return the trajectory with a best Sup and Conf .
6 Experimental Evaluation
We organized a series of experiments to evaluate the performance of the algo-
rithm presented in the previous section. The experiments aim at:
7 Conclusion
In this paper, we investigated a novel problem of skyline query in massive Tra-
jectory DataBase (TDBs) with incomplete semantic trajectory. We studied the
skyline query based on both spatial and textual (i.e. activities) dimensions. In
other terms, our skyline approach aims to find out the best results based on the
spatial distance and the number of activities that compose activity-trajectories.
Further, with a massive amount of activities, such data is always not presented
due to some reason like lack of internet connection. Further, users may make mis-
takes while typing activity text in the system keyboard. Thus, such problems
make it hard to return the desirable results based on the exact keywords.
To handle the problem efficiently, we firstly re-used distributed indexes to
organize the massive activity-trajectory data based on the R-tree index. Then,
we developed a parallel activity-trajectory query algorithm based on approxi-
mate activity keywords and distance functions. These functions aim to evaluate
three points. The first point is the multi-frequent activity, where we used the
data-mining algorithm to find the frequent POI based on its activities. Further,
we were promoted also to process the fuzzy query for approximate activities in
POI using edit distance and activity weights. The second point is the distance
measured between activity-trajectories in TDB and the query. The third point
combines point 1 and point 2 to handle the missing activities problem by finding
other good similar activity-trajectory to the query. To achieve scalability and
fault tolerance, we used Apache Spark cluster to implement both distributed
indexes and the query algorithm. The algorithm proposed solved the problem
efficiently. Extensive experimental results show that our algorithm offers effi-
ciency. As future studies, we plan to use a large dataset with tera-byte size,
add more machines in our cluster, compare our work with existing methods and
handle the temporal dimension problem in semantic trajectory skyline query for
incomplete TDBs.
References
1. Htet, A.H., Long, G., Kian-Lee, T.: Mining sub-trajectory cliques to find frequent
routes. In: Proceedings of the 13th of ISASTD, Munich, vol. 8098 (2013)
2. Kong, K., et al.: Trajectory query based on trajectory segments with activities. In:
Proceedings of the 3rd ACM SIGSPATIAL ACM, pp. 1–8 (2017)
3. Zheng, K., Shang, S., Yuan, N.J., Yang, Y.: Towards efficient search for activity
trajectories, pp. 230–241 (2013)
4. Li, J., Wang, H., Li, J., Gao, H.: Skyline for geo-textual data. GeoInformatica
20(3), 453–469 (2016). https://doi.org/10.1007/s10707-015-0243-9
5. Belhassena, A., HongZhi, W.: Trajectory big data processing based on frequent
activity. Tsinghua Sci. Technol. 24, 317–332 (2019)
6. Belhassena, A., Wang, H.: Distributed skyline trajectory query processing. In: Pro-
ceedings of the ACM Turing 50th Celebration Conference, Shanghai (2017)
7. Chen, M., Wang, N., Lin, G., Shang, J.S.: Network-based trajectory search over
time intervals. Big Data Res. 100221 (2021)
206 A. Belhassena and W. Hongzhi
8. Rocha Junior, J.B., Nørvåg, K.: Top-k spatial keyword queries on road networks,
USA, pp. 168–179 (2012)
9. Han, Y., Wang, L., Zhang, Y., Zhang, W., Lin, X.: Spatial keyword range search on
trajectories. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M.A. (eds.) DASFAA
2015. LNCS, vol. 9050, pp. 223–240. Springer, Cham (2015). https://doi.org/10.
1007/978-3-319-18123-3 14
10. Cao, K., Sun, Q., Liu, H., Liu, Y., Meng, G., Guo, J.: Social space keyword query
based on semantic trajectory. Neurocomputing 428, 340–351 (2021)
11. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Pro-
ceedings of ACM SIGMOD, vol. 14, pp. 47–57 (1984)
12. Yu, J., Wu, J., Sarwat, M.: GeoSpark: a cluster computing framework for process-
ing large-scale spatial data. In: Proceedings of the ACM SIGSPATIAL GIS, USA
(2015)
13. Wang, L., Chen, B., Liu, Y.: Distributed storage and index of vector spatial data
based on h-base. In: Proceedings of Geoinformatics, pp. 1–5 (2013)
14. Eldawy, A., Mokbel, M.F.: A demonstration of spatialhadoop: an efficient mapre-
duce framework for spatial data. In: Proceedings of the VLDB, vol. 6, pp. 1230–
1233 (2013)
15. Li, G., Deng, D., Feng, J.: A partition-based method for string similarity joins with
edit-distance constraints. ACM Trans. Database Syst. 38, 1–33 (2013)
16. Borzsony, S., Kossmann, D., Stocker, K.: The skyline operator. In: Proceedings of
the 17th ICDE, pp. 421–430. IEEE (2001)
17. Hsu, W.T., Wen, Y.T., Wei, L.Y., Peng, W.C.: Skyline travel routes: exploring
skyline for trip planning. In: Proceedings of the 15th ICMDM, vol. 2, pp. 31–36.
IEEE (2014)
18. Yang, B., Guo, C., Jensen, C.S., Kaul, M., Shang, S.: Stochastic skyline route
planning under time-varying uncertainty. In: Proceedings of the 30th ICDE, pp.
136–147 (2014)
19. Jiang, B., Du, X.: Personalized travel route recommendation with skyline query.
In: Proceedings of the 9th DESSERT, pp. 549–554. IEEE (2018)
20. Ke, C.-K., Lai, S.-C., Chen, C.-Y., Huang, L.-T.: Travel route recommendation via
location-based social network and skyline query. In: Hung, J.C., Yen, N.Y., Chang,
J.-W. (eds.) FC 2019. LNEE, vol. 551, pp. 113–121. Springer, Singapore (2020).
https://doi.org/10.1007/978-981-15-3250-4 14
21. HongZhi, W., Belhassena, A.: Parallel trajectory search based on distributed index.
Inf. Sci. 388, 62–83 (2017)
22. Ju, H., Ju, F., Guoliang, L., Shanshan, C.: Top-k fuzzy spatial keyword search.
Chin. J. Comput. 35(11), 2237–2246 (2012). (in Chinese)
Compact Data Structures for Efficient
Processing of Distance-Based Join
Queries
1 Introduction
The efficient storage and management of large datasets has been a research topic
for decades. Spatial databases are an example of such datasets. Some of the meth-
ods used to efficiently manage and query them include distributed algorithms,
streaming algorithms, or efficient secondary storage management [9], frequently
accompanied by the use of indexes such as R∗ -trees to speed up queries.
– The execution of a set of experiments using large real-world datasets for exam-
ining the efficiency and the scalability of the proposed strategy, considering
performance parameters and measures.
In this section, we review some basic concepts about DJQs and the k 2 -tree
compact data structure, as well as a brief survey of the most representative
contributions in both fields in the context of spatial query processing.
The εDJQ can be considered as an extension of the KCPQ, where the dis-
tance threshold of the pairs (ε) is known beforehand and the processing strategy
(e.g., plane-sweep technique) can be the same as in the KCPQ for generating
the candidate pairs of the final result.
If both P and Q are non-indexed, the KCPQ between two point sets that
reside in main-memory can be solved using plane-sweep-based algorithms [10].
The Classic plane-sweep algorithm for KCPQ consists of two steps: (1) sorting
the entries of the two points sets, based on the coordinates of one of the axes
(e.g. X) and (2) combining the reference point of one set with all the comparison
points of the other set satisfying that their distance on the X-axis is less than δ
(distance of the K th closest pair found so far), and choosing those pairs whose
point distance is smaller than δ. A faster variant called Reverse-Run plane-sweep
algorithm is based on the concept of run (a continuous sequence of points of the
same set that does not contain any point from the other set) and the reverse
order of processing of the comparison points with respect to the reference point.
To reduce the search space and considering the reference point, three methods are
applied in these two plane-sweep algorithms: Sliding Strip (δ on X-axis), Sliding
Window and Sliding Semi-Circle. These DJQs have been recently designed and
implemented in SpatialHadoop and LocationSpark, that are Hadoop-based and
Spark-based distributed spatial data management systems (Big Spatial Data
context) [5], respectively.
The problem of DJQs has also received research attention by the spatial
database community in scenarios where at least one of the datasets is indexed.
If both P and Q are indexed using R-Trees, the concept of synchronous tree
traversal and Depth-First (DF) or Best-First (BF) traversal order can be com-
bined for the query processing [3]. In [7], an extensive experimental study com-
paring the R*-tree and Quadtree-like index structures for DJQs together with
index construction methods was presented. In the case that only one dataset
is indexed, in [6] an algorithm is proposed for KCPQ, whose main idea is to
partition the space occupied by the dataset without an index into several cells
or subspaces (according to a grid-based data structure) and to make use of the
properties of a set of distance functions defined between two MBRs [3].
2.2 k2 -tree
A k 2 -tree [1] is a compact data structure used to store and query a binary matrix
that can represent a graph or a set of points in discretized space. Figure 1 shows
in (a) a set of points in a 2D discrete space, and its direct translation into a
binary matrix in (b). For the k 2 -tree representation, choosing k = 2, (c) is the
Compact Data Structures for Efficient Processing of DJQs 211
conceptual tree and (d) the actual bitmaps that are stored. T represents the
“tree” part (non leaf nodes), and L the leaves of the conceptual tree.
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
1
2
·· · ·· 1
2
1
1
0
0
0
1
0
0
1
1
0
0
(c) 1101
3
4
· · ·· · · 3
4
0
0
1
0
0
1
1
1
1
0
1
0
1111 1001 1010
5
6
·· · 5
6
1
1
1
0
1010 0010 0100 0111 1010 1100 1110 0001
(b)
0 0 L = 1010 0010 0100 0111 1010 1100 1110 0001
(d)
Fig. 1. A 2D-space model with its binary matrix and k2 -tree representations
synthetic dataset and for real data the combination was 76451 × 20480 and
4499454 × 196902. Moreover, the total response time of the KCPQ experi-
ments is questionable because their implementation of (Classic) plane-sweep-
based KCPQ algorithm needed hours to solve the query, when it should be
answered in an order of ms. These surprising performance results also make
KCPQ implementations on k 2 -trees and their results questionable. Note also
that their implementation, basically of the same algorithm, is in Java and it is
not publicly available, and the high-level pseudo-codes provided in the journal
paper omit low-level details that are key to performance. This makes it difficult
to accurately reproduce their results from the available information. Finally, to
the best of our knowledge, εDJQ has never been tackled using k 2 -trees. Here we
present efficient implementations of KCPQ and εDJQ to show the interest of the
strategy of using k 2 -trees to represent spatial data. Our algorithms were coded
in C++ and are available to the community, and they were tested with large
real-world datasets, comparing them with Classic and Reverse-Run plane-sweep
DJQs on main memory.
We describe in this section the new algorithms for KCPQ (Algorithm 1) and
εDJQ (Algorithm 2) using k 2 -trees. For all the distance calculations, we have
used the Euclidean distance.
The input for Algorithm 1 consists on two matrices A and B, stored as
k 2 -trees (they would correspond to the P and Q datasets in the definitions
in Sect. 2.1), and the number of pairs of closest points (although we use the
name NumPairs instead of K in the pseudocode to avoid confusion with the k
parameter of k 2 -trees), We denote A[i] the ith bit value of the bitmap for A.
A.lastLevel is the last level of the tree, corresponding to its leaves. Levels range
from 0 to lg(n) − 1, being n the width of the original matrix.
The following data structures are used by Algorithm 1:
PQueue is a min priority queue over the distance (that is, pairs with lower min-
imum distance come first). It uses the standard methods: isEmpty(), enqueue()
and dequeue().
– An ordered list OutList with capacity for N umP airs elements (we actually
use a max binary heap to manage these elements), each one storing a pair of
Compact Data Structures for Efficient Processing of DJQs 213
points (one coming from each input matrix), and the distance between them.
The elements (pairs of points) are sorted according to their distance. It uses
the methods: length(), maxDist() and insert().
– MinDist(pA,pB,size), shown in Algorithm 3, obtains the minimum possible
(Euclidean) distance between points of the matrices A and B that have their
origins in pA and pB and are squares of size × size.
Note that the pairs of (A,B) matrices that are generated in Algorithm 1
have a special property: their origin coordinates are always multiples of size.
This allows us to compute the minimum distance more efficiently, but it would
not work to get the minimum distance between two matrices in general.
The idea behind Algorithm 1 is to recursively partition the matrices A and B
into k 2 submatrices each and compare each possible pair of submatrices (down
to when they are not actually submatrices but really individual cells or points).
One of the strong points of this algorithm is that, at some point, we can stop
without processing all the remaining pairs of submatrices. This happens when
the required number of closest pairs has already been obtained, and the largest
distance between them is not larger than the minimum possible distance between
the pairs of submatrices not yet processed.
The algorithm follows a Best-First (BF) traversal. It starts by enqueuing the
whole matrices (which correspond to the level 0 of the k 2 -tree and have 0 as
the minimum possible distance between them). The output list OutList is also
initialized, with room for at most N umP airs elements.
Then, the priority queue is processed until it is empty, or the stop criteria are
reached. Lines 6 − 8 check if the output list already has the target N umP airs
elements. If so, and the minimum distance of the current pair of matrices is
at least the maximum distance in OutList, we can be sure that the current
and remaining matrices can be safely discarded, and the algorithm returns the
current output list.
In other case, the current matrices are partitioned (lines 11 − 20), but only
if they have children (which is tested by directly using the k 2 -tree bitmaps in
lines 12 for matrix A and 15 for matrix B). For each pair of child submatrices,
if they are in the last level of the k 2 -tree then they are actually points. So, if
there is room in OutList or its maximum distance is greater than the distance
between the current pair of points, then the pair and its distance are inserted in
order in OutList (lines 21 − 25). Recall that the insert operation may need to
remove the element with the largest distance if the output list already contains
NumPairs elements.
If the submatrices are not in the last level of the k 2 -tree, and if they meet
the conditions to contain candidate pairs of points (OutList is not full or the
minimum distance between the matrices is less than the maximum distance in
OutList) they are enqueued in the priority queue (lines 28 − 30).
Algorithm 2 (εDJQ) uses the same scheme as the previous one, but with some
key differences. The input consists now of the two k 2 -trees A and B, plus the
distance threshold ε. Since the algorithm does not limit the number of output
pairs, OutList is now an unlimited-size, unordered list. For the same reason,
214 G. de Bernardo et al.
the algorithm does not have an “early exit”, and it exits only after the priority
queue is empty. Additionally, each element in the priority queue stores not only
the minimum possible distance between the matrices, but also the maximum
possible distance, computed by the function M axDist (shown in comments in
the
√ pseudocode of Algorithm 3). The initial M axDist for the whole matrices is
2n, where n is the width of each matrix.
The partitioning is done the same way, but for every pair of child submatrices
the process is different:
– At leaf level of the k 2 -trees (lines 18 − 21) the pair of nodes is inserted in
OutList if the distance between them is at most ε.
– If the maximum distance (M axDist) between the two matrices is at most
ε, then all the combinations of points between the two matrices meet the
criteria. We use the rangeQuery operation of the k 2 -trees to get the points
and insert all possible pairs into OutList (lines 23 − 31).
– Otherwise, if the minimum distance is at most ε, we enqueue the submatrices
with the minimum and maximum distances between them (lines 32 − 33).
Compact Data Structures for Efficient Processing of DJQs 215
4 Experimental Results
We have tested our DJQ algorithms using the following real-world 2D point
datasets, obtained from OpenStreetMap1 : LAKES (L), that contains bound-
aries of water areas (polygons); PARKS (P), that contains boundaries of parks
or green areas (polygons); ROADS (R), which contains roads and streets around
the world (line-strings); and BUILDINGS (B), which contains boundaries of all
buildings (polygons). For each source dataset, we take all the points extracted
from the geometries of each line-string to build a large point dataset. Addition-
ally, we round coordinates to 6 decimal positions, in order to be able to transform
these values to k 2 -tree coordinates in a consistent manner. Table 1 summarizes
the characteristics of the original datasets and the generated point sets obtained
from them. Note that all the datasets represent worldwide data, and points are
stored as (longitude, latitude) pairs.
The main performance measures that we have used in our experiments are
the space required by the data structure vs. the plain representation, and the
total execution time to run a given DJQ. We measure elapsed time, and only
consider the time necessary to run the query algorithm. This means that we
ignore time necessary to load the files, as well as time required to sort the points
for the plane-sweep algorithms.
All experiments were executed on an HP ProLiant DL380p Gen8 server with
two 6-core Intel R
Xeon R
CPU E5-2643 v2 @ 3.50 GHz processors with 256 GiB
RAM (Registered @1600 MHz), running Oracle Linux Server 7.9 with kernel
Linux 4.14.35 (64bits). Our algorithms were coded in C++ and are publicly
available2 . For the k 2 -tree algorithms, the SDSL-Lite3 library was used.
First, we build the k 2 -tree for each dataset. We use the simplest variant of
k -tree with no optimizations. In order to insert the points in the k 2 -tree, they
2
way, the points fit into a binary matrix with 360 million rows and 160 million
columns, that is finally stored as a k 2 -tree.
Table 2 shows the space required by the k 2 -tree representation of each dataset.
We display as a reference the plain size of the dataset, as well as a “binary size”
estimated considering that each coordinate can be represented using two 32-bit
words. Note that each coordinate component can be stored using 28–29 bits for
our datasets, but this would make data access slower, so we consider 32 bits to
be the minimum cost for a reasonable plane-sweep algorithm that works with
uncompressed data. We also display the compression ratio of the k 2 -tree relative
to the binary input size. Results show that the k 2 -tree representation is able
to efficiently represent the collection, and the compression obtained improves
with the size of the dataset. Notice also that the k 2 -tree version we use in our
experiments does not include any of the existing optimizations for the k 2 -tree to
improve compression.
We compared the performance of our KCPQ algorithm with 4 different imple-
mentations based on plane-sweep: two implementations of Classic plane-sweep,
with Sliding Strip (PS-CS) and with Sliding Semi-Circle (PS-CC) respectively,
and the equivalent implementations of Reverse-Run, with Sliding Strip (PS-
RRS) and Sliding Semi-Circle (PS-RRC). We performed our experiments check-
ing all the pairwise combinations of our datasets. Due to space constraints,
we display only the results for some combinations, denoted LxP, LxB, PxR,
RxB, and PxB. The remaining combinations yielded similar comparison results.
For each combination of datasets, we run the KCPQ algorithm for varying
K ∈ {1, 10, 102 , 103 , 104 , 105 }.
Figure 2 displays the query times obtained by our algorithm and the four
variants of plane-sweep studied. The first five plots display the results for all
variants for 5 different dataset combinations. Results clearly show that the Clas-
sic variants (PS-CS and PS-CC) are much slower than the other alternatives in
all cases (as in [10]). Therefore, we will focus on the comparison between our
proposal and the Reverse-Run variants that are competitive with it.
The point datasets used have a significantly different amount of points, and
correspond to different features, which leads to very different query times among
the plots in Fig. 2. However, results show that our algorithm always achieves the
best query times for large values of K. Particularly, for K = 105 , our algorithm
218 G. de Bernardo et al.
25 80
600
20 k 2 -tree 70
PS-CS 500 60
15 PS-CC 400 50
PS-RRS 300 40
10 PS-RRC 30
200
5 20
100 10
0 0 0
1 10 10 2 10 3 10 4 10 5 1 10 10 2 10 3 10 4 10 5 1 10 10 2 10 3 10 4 10 5
K K K
2
160 12 k -tree
100 PS-RRS
140 11 PS-RRC
120 80 10
9
100
60 8
80
7
60 40 6
40 5
20
20 4
0 0 3
2 3 4 5 2 3 4 5 2 3 4
1 10 10 10 10 10 1 10 10 10 10 10 1 10 10 10 10
K K K
Fig. 2. Query times for KCPQ in k2 -trees and plane-sweep variants, changing K.
is between 1.15 and 33 times faster than the best alternative, PS-RRC, depend-
ing on the joined datasets. Additionally, we are always the fastest option for
K ≥ 104 , and in some datasets from K = 103 . For smaller K, our proposal is
competitive but slightly slower than the Reverse-Run plane-sweep algorithms.
The lower right chart of Fig. 2 shows a subset of the results for the PxB join,
to better display the differences in performance for these smaller values of K.
Results are similar in the remaining experiments: for smaller K, the k 2 -tree algo-
rithm is 3–15% slower than PS-RRC, depending on the dataset. This evolution
with K is due to the characteristics of our algorithm: independently of K, we
need to traverse a relatively large number of regions in both k 2 -trees, even if
many of these regions are eventually discarded, so the base complexity of our
algorithm is comparable to that of Classic plane-sweep. On the other hand, this
means that many candidate pairs have already been expanded and enqueued,
so they can be immediately processed if more results are needed, making our
algorithm more efficient for larger values of K.
Next, we compare our algorithm for εDJQ with two plane-sweep variants,
Classic plane-sweep with Sliding Strip (εDJQ-CS) and Reverse-Run with Sliding
Semi-Circle (εDJQ-RRC). We select a representative subset of joined datasets,
namely LxP, PxR, RxB and PxB. In order to measure the scalability of the algo-
rithms, we perform tests for varying ε ∈ (7.5, 10, 25, 50, 75, 100) × 10−5 (these
values of ε are associated with the original coordinates in degrees, but recall
that in the k 2 -tree coordinates are scaled to integer values, so values of ε are
also scaled accordingly).
Figure 3 displays the results obtained for each join query. Our algorithm based
on k 2 -trees is slower for LxP, but much faster in most cases for PxR, RxB and
PxB (notice the logarithmic scale for query times). We attribute this difference
Compact Data Structures for Efficient Processing of DJQs 219
LxP PxR
100000 2 100000 2
100 1000
7.5 10 25 50 75 100 7.5 10 25 50 75 100
-5 -5
ε (x10 ) ε (x10 )
RxB PxB
100000 100000
εDJQ-CS εDJQ-CS
εDJQ-RRC εDJQ-RRC
10000
10000
1000
1000 100
7.5 10 25 50 75 100 7.5 10 25 50 75 100
-5 -5
ε (x10 ) ε (x10 )
Fig. 3. Query times for εDJQ in k2 -trees and plane-sweep variants, changing ε.
mainly to the size of the datasets: LxP joins the two smallest datasets, whereas
the remaining configurations involve one or two of the larger datasets. For these
3 larger joins, our algorithm is always much faster for the smaller values of ε. In
this case, our algorithm does not improve for larger ε, as for KCPQ, because no
early stop condition exists: we must traverse all candidate pairs as long as their
minimum distance is below ε, and for very large ε the added cost to traverse
the k 2 -trees to expand many individual pairs makes our proposal slower, even if
we are able to efficiently filter out many candidate regions. These queries with
smaller values of ε, in which we are much faster than plane-sweep algorithms,
are precisely the ones that would most benefit from our approach based on
compact data structures, since the number of query results increases with ε: for
ε = 100 × 10−5 we obtain over 109 results, and these results would become the
main component of memory usage. Notice that, in practice, in our experiments
we measure the time to retrieve and count the query results, but do not store
them in RAM to avoid memory issues in some query configurations.
alternatives for small K, but become much faster than plane-sweep algorithms
for larger values of K. Our algorithm for εDJQ also achieves competitive query
times and is especially faster when the join query involves the largest datasets.
As future work, we plan to test the performance of our algorithms with other
variants of the k 2 -tree, that are able to obtain similar query times but require
much less space [1]. Particularly, our algorithms can be adjusted to work with
hybrid implementations of the k 2 -tree, that use different values of k, as well as
variants that use statistical compression in the lower levels of the conceptual
tree. Another interesting research line would be the application of these DJQ
algorithms based on k 2 -tree in Spark-based distributed spatial data management
systems, since they are more sensitive to memory constraints. Finally, we plan
to explore other DJQ and similar algorithms that may also take advantage of
the compression and query capabilities of k 2 -trees.
References
1. Brisaboa, N.R., Ladra, S., Navarro, G.: Compact representation of web graphs with
extended functionality. Inf. Syst. 39(1), 152–174 (2014)
2. Brisaboa, N.R., Cerdeira-Pena, A., de Bernardo, G., Navarro, G.: Óscar Pedreira:
extending general compact querieable representations to GIS applications. Inf. Sci.
506, 196–216 (2020)
3. Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Algorithms
for processing k-closest-pair queries in spatial databases. Data Knowl. Eng. 49(1),
67–104 (2004)
4. Álvarez Garcı́a, S., Brisaboa, N., Fernández, J.D., Martı́nez-Prieto, M.A., Navarro,
G.: Compressed vertical partitioning for efficient RDF management. Knowl. Inf.
Syst. 44(2), 439–474 (2015)
5. Garcı́a-Garcı́a, F., Corral, A., Iribarne, L., Vassilakopoulos, M., Manolopoulos, Y.:
Efficient distance join query processing in distributed spatial data management
systems. Inf. Sci. 512, 985–1008 (2020)
6. Gutiérrez, G., Sáez, P.: The k closest pairs in spatial databases - when only one
set is indexed. GeoInformatica 17(4), 543–565 (2013)
7. Kim, Y.J., Patel, J.M.: Performance comparison of the R*-tree and the quadtree for
kNN and distance join queries. IEEE Trans. Knowl. Data Eng. 22(7), 1014–1027
(2010)
8. Mamoulis, N.: Spatial Data Management. Synthesis Lectures on Data Manage-
ment. Morgan & Claypool Publishers (2012)
Compact Data Structures for Efficient Processing of DJQs 221
1 Introduction
Relational databases (RDs) have been widely used and studied by researchers and
practitioners for decades due to their simplicity, low data redundancy, high data
consistency, and uniform query language (SQL). Hence, the size of web data has
grown exponentially during the last two decades. The interconnections between
web data entities (e.g. interconnection between YouTube videos or people on Face-
book) are measured by billions or even trillions [6] which pushes the relational
model to quickly reach its limits as querying high interconnected web data requires
complex SQL queries which are time-consuming. To overcome this limit, the graph
database model is increasingly used on the Web due to its flexibility to present data
in a normal form, its efficiency to query a huge amount of data and its analytic
powerful. This suggests studying a mapping from RDs to graph databases (GDs)
to benefit from the aforementioned advantages. This kind of mapping has not
received more attention from researchers since only a few works [4,5,13,14] have
considered it. A real-life example of this mapping has been discussed in [13]: “inves-
tigative journalists have recently found, through graph analytics, surprising social
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 222–235, 2023.
https://doi.org/10.1007/978-3-031-21595-7_16
Complete Direct Mapping from Relational Databases to Property Graphs 223
Stoica et al. [13,14] studied the mapping of RDs to GDs and any relational
query (formalized as an extension of relational algebra) into a G-core query.
Firstly, the choice of source and destination languages hinders the practicability
of the approach. Moreover, it is hard to see if the mapping is semantic preserving
since no definition of graph data consistency is given. Attributes, primary and
foreign keys are verbosely represented by the data graph, which makes this later
hard to understand and to query.
Orel et al. [11] discussed mapping relational data only into property graphs
without giving attention neither to schema nor mapping properties.
The Neo4j system provides a tool called Neo4j-ETL [1], which allows users
to import their relational data into Neo4j to be presented as property graphs.
Notice that the relational structure (both instance and schema) is not preserved
during this mapping since some tuples of the relational data (resp. relations of
the relational schema) are represented as edges for storage concerns (as done in
[5]). However, as remarked in [13], this may skew the results of some analytical
tasks (e.g. density of the generated graphs). Moreover, Neo4j-ETL does not
allow the mapping of queries. S. Li et al. [9] study an extension of Neo4j-ETL by
proposing mapping of SQL queries to Cypher queries. However, their mapping
inherits the limits of Neo4j-ETL. In addition, no detailed algorithm is given for
the query mapper which makes impossible the comparison of their proposal with
other ones. This is also the limit of [10].
Finally, Angles et al. [2] studied mappings RDF databases to property graphs
by considering both data and schema. They proved that their mapping ensures
both information and semantic preservation properties.
Table 1 summarizes most important features of related works.
Complete Direct Mapping from Relational Databases to Property Graphs 225
2 Preliminaries
This section defines the several notions that will be used throughout this paper.
Let R be an infinite set of relation names, A is an infinite set of attribute
names with a special attribute tid, T is a finite set of attribute types (String,
Date, Integer, Float, Boolean, Object), D is a countably infinite domain of data
values with a special value null.
We study in this paper the mapping of relational data into PG data. In addition,
we show that any relational query over the source data can be translated into
an equivalent graph query over the generated data graph. To this end, we model
relational queries with the SQL language [8] and the graph queries with the
Cypher language [7] since each of these languages is the most used in its category.
To establish a compromise between the expressive power of our mapping and its
processing time, we consider a simple but very practical class of SQL queries and
we define its corresponding class of Cypher queries. It is necessary to understand
the relations between basic SQL queries and basic Cypher queries before studying
all the expressive power of these languages.
The well-known syntax of SQL queries is “Select I from R where C” where:
a) I is a set of items; b) R is a set of relations names; and c) C is a set of
conditions. Intuitively, an SQL query selects first some tuples of relations in R
that satisfy conditions in C. Then, the values of some records (specified by I) of
these tuples are returned to the user.
On the other side, the Cypher queries considered in this paper have the
syntax: “M atch patterns W here conditions Return items”. Notice that a Cypher
query aims to find, based on edge-isomorphism, all subgraphs in some data
graph that match the pattern specified by the query and also satisfy conditions
defined by this latter. Once found, only some parts (i.e. vertices, edges, and/or
attributes) of these subgraphs are returned, based on items specified by Return
clause. Therefore, the M atch clause specifies the structure of subgraphs we are
looking for on the data graph; the W here clause specifies conditions (based on
vertices, edges and/or attributes) these subgraphs must satisfy; and finally, the
Return clause returns some parts of these subgraphs to the user as a table.
Complete Direct Mapping from Relational Databases to Property Graphs 227
Inspired from [12–14], We define in this section the direct mapping from a rela-
tional database into a graph database and we discuss its properties. Given a
relational database DR composed of SR and IR , a direct mapping consists of
translating each entity in DR into a graph database without any user interaction.
That is, any DR = (SR , IR ) (with possibly empty SR ), is translated automati-
cally into a pair of property graphs (SG , IG ) (with possibly empty SG ), that we
call a graph database. Let DR be an infinite set of relational databases, and DG
be an infinite set of graph databases. Based on these notions, we give the next
definition of direct mapping and its properties.
(schema and instance). We call our mapping Complete since some proposed
mappings (e.g. [11]) deal only with data and not schema.
Definition 5 (Complete Mapping). A complete mapping is a function CM :
DR → DG from the set of all RDs to the set of all GDs such that: for each
complete relational database DR = (SR , IR ), CM (DR) generates a complete
graph database DG = (SG , IG ).
In order to produce a complete graph database, our CM process is based on
two steps, schema mapping and instance mapping, which we detail hereafter.
1. VI and EI are the set of vertices and the set of edges as defined for schema
graph;
2. for each vertex vi ∈ VI , there exists a vertex vs ∈ VS such that: a) LI (vi ) =
LS (vs ); and b) for each pair (a : c) ∈ AI (vi ) there exists a pair (a : t) ∈ AS (vs )
with type(c)=t. We say that vi corresponds to vs , denoted by vi ∼ vs .
3. for each edge ei = (vi , wi ) in EI , there exists an edge es = (vs , ws ) in ES
such that: a) LI (ei ) = LS (es ); b) for each pair (a : c) ∈ AI (ei ) there exists a
pair (a : t) ∈ AS (es ) with type(c)=t; and c) vi ∼ vs and wi ∼ ws . We say
that ei corresponds to es , denoted by ei ∼ es .
Example 3. Figure 3 depicts (A) a relational schema and (B) its corresponding
schema graph. One can see that our schema mapping rules are respected: 1)
each relation is mapped to a vertex that contains the label of this relation, its
primary key and a list of its typed attributes; 2) each foreign key between two
relations (e.g. relations Admissions and Patients in part A) is represented by an
edge between the vertices of these two relations (e.g. edge Admissions-Patients
in part B).
It is clear that the special attribute vid is used to preserve the tuples identi-
fication (i.e. the value of attribute tid) during the mapping process.
4 Properties of CM
We show that our CM satisfies the three fundamental mapping properties [14]:
information preservation, query preservation and semantics preservation.
Proof. Theorem 1 can be proved easily by showing that there exists a computable
mapping CM −1 : DG → DR that reconstructs the initial relational database
from the generated graph database. Since our mapping CM is based on two
steps (schema and instance mappings), then CM −1 requires the definition of
SM −1 and IM −1 processes. Due to the space limit, the definition of CM −1 is
given in the extended version of this paper [3].
Second, we show that the way CM maps relational instances into instance graphs
allows one to answer the SQL query over a relational instance by translating it
into an equivalent Cypher query over the generated graph instance.
Complete Direct Mapping from Relational Databases to Property Graphs 233
– For each vertex vs ∈ Vs with P k(vs ) = “a1 , ..., an ” and each vertex vi ∈ VI
that corresponds to vs : there exists no pair (ai : N U LL) ∈ AI (vi ) with i ∈
[1, n]. Moreover, for each vi ∈ VI \{vi } that corresponds to vs , the following
condition must not hold: for each i ∈ [1, n], (ai : c) ∈ AI (vi ) ∩ Ai (vi ).
234 A. Boudaoud et al.
– For each edge es ∈ Es with F k(es , s) = “a1 , ..., an ” and F k(es , d) = “b1 , ...,
bn ”, if ei = (v1 , v2 ) ∈ EI is an edge that corresponds to es then we have:
(ai : c) ∈ AI (v1 ) and (bi : c) ∈ AI (v2 ) for each i ∈ [1, n].
Intuitively, the consistency of graph databases is inspired from that of rela-
tional databases.
Theorem 3. The direct mapping CM is semantic Preserving.
Proof. The proof of Theorem 3 is straightforward and can be done by contradic-
tion based on the mapping rules of IM (Sect. 3.3). Given a relational database
DR = (SR , IR ) and let DG = (SG , IG ) be its equivalent graph database gener-
ated by CM. We suppose that CM is not semantic preserving. This means that
either (A) IR is consistent and IG is inconsistent; or (B) IR is inconsistent while
IG is consistent. We give only proof of case (A) since that of case (B) can be
done in a similar way.
We suppose that IR is consistent w.r.t SR while IG is inconsistent w.r.t SG .
Based on Definition 8 IG is inconsistent if one of the following conditions holds:
1) There exists a vertex vi ∈ VI that corresponds to a vertex vs ∈ Vs where:
a) P k(vs ) = “a1 , ..., an ”; and b) (ai : N U LL) ∈ AI (vi ) for some attribute
ai∈[1,n] . Based on mapping rules of IM process, vi corresponds to some tuple t
in IR and attributes a1 , ..., an correspond to a primary key defined over IR by
SG . Then (b) implies that the tuple t assigns a N U LL value to the attribute
ai which makes IR inconsistent. However, we supposed that IR is consistent.
2) There are two vertices v1 , v2 ∈ VI that correspond to a vertex vs ∈ Vs
where: a) P k(vs ) = “a1 , ..., an ”; and b) (ai : c) ∈ AI (v1 ) ∩ AI (v2 ) for each
i ∈ [1, n]. Based on mapping rules of IM process, v1 (resp. v2 ) corresponds to
some tuple t1 (resp. t2 ) in IR and attributes a1 , ..., an correspond to a primary
key defined over IR by SG . Then (b) implies that the tuples t1 and t2 assign
the same value to each attribute ai which makes IR inconsistent. However,
we supposed that IR is consistent.
3) There exists an edge ei = (v1 , v2 ) in EI that corresponds to an edge
e = (vs , vd ) in Es where: a) F k(e, s) = “a1 , ..., an ”; b) F k(e, d) = “b1 , ..., bn ”;
and c) there exists some attribute ai∈[1,n] with (ai : c1 ) ∈ AI (v1 ), (ai :
c2 ) ∈ AI (v2 ), and c1 = c2 . Based on mapping rules of IM process, v1 (resp.
v2 ) corresponds to some tuple t1 (resp. t2 ) in IR , vs (resp. vd ) corresponds
to some relation s (resp. d) in SR , the function F k over edge e refers to a
foreign-key s[a1 , ..., an ] → d[b1 , ..., bn ] defined over IR by SG . Then (b) implies
that the tuple t1 assigns a value to some attribute ai∈[1,n] that is different
to that assigned by tuple t2 to attribute bi∈[1,n] . This means that there is a
violation of foreign-key by tuple t1 which makes IR inconsistent. However, we
supposed that IR is consistent.
Therefore, each case of IG inconsistency leads to a contradiction, which means
that if IR is consistent then its corresponding IG cannot be inconsistent.
By doing proof of part (B) in a similar way, we conclude that if IR is consistent
(resp. inconsistent) then its corresponding instance graph IG must be consistent
(resp. inconsistent). This completes the proof of Theorem 3.
Complete Direct Mapping from Relational Databases to Property Graphs 235
References
1. Neo4j ETL. https://neo4j.com/developer/neo4j-etl/
2. Angles, R., Thakkar, H., Tomaszuk, D.: Mapping RDF databases to property graph
databases. IEEE Access 8, 86091–86110 (2020)
3. Boudaoud, A., Mahfoud, H., Chikh, A.: Towards a complete direct mapping from
relational databases to property graphs. CoRR abs/2210.00457 (2022). https://
doi.org/10.48550/arXiv.2210.00457
4. De Virgilio, R., Maccioni, A., Torlone, R.: Converting relational to graph databases,
p. 1 (2013)
5. De Virgilio, R., Maccioni, A., Torlone, R.: R2G: a tool for migrating relations to
graphs, pp. 640–643 (2014)
6. Fan, W., Wang, X., Yinghui, W.: Answering pattern queries using views. IEEE
Trans. Knowl. Data Eng. 28, 326–341 (2016)
7. Francis, N., et al.: Cypher: an evolving query language for property graphs, pp.
1433–1445 (2018)
8. Guagliardo, P., Libkin, L.: A formal semantics of SQL queries, its validation, and
applications. Proc. VLDB Endow. 11, 27–39 (2017)
9. Li, S., Yang, Z., Zhang, X., Zhang, W., Lin, X.: SQL2Cypher: automated data
and query migration from RDBMS to GDBMS. In: Zhang, W., Zou, L., Maamar,
Z., Chen, L. (eds.) WISE 2021. LNCS, vol. 13081, pp. 510–517. Springer, Cham
(2021). https://doi.org/10.1007/978-3-030-91560-5 39
10. Matsumoto, S., Yamanaka, R., Chiba, H.: Mapping RDF graphs to property
graphs, pp. 106–109 (2018)
11. Orel, O., Zakošek, S., Baranović, M.: Property oriented relational-to-graph
database conversion. Automatika 57, 836–845 (2017)
12. Sequeda, J.F., Arenas, M., Miranker, D.P.: On directly mapping relational
databases to RDF and OWL, pp. 649–658 (2012)
13. Stoica, R., Fletcher, G., Sequeda, J.F.: On directly mapping relational databases
to property graphs, pp. 1–4 (2019)
14. Stoica, R.-A.: R2PG-DM: a direct mapping from relational databases to property
graphs. Master’s thesis, Eindhoven University of Technology (2019)
A Matching Approach to Confer
Semantics over Tabular Data Based
on Knowledge Graphs
1 Introduction
Consolidating and implementing the FAIR1 principles2 for data conveyed on the
Web is a real need to facilitate their management and use. Indeed, the added
value of such a process is the generation of new knowledge through the tasks
of data integration, data cleaning, data mining and machine learning. Thus, the
successful implementation of FAIR principles drastically improves the value of
data by making it: findable, accessible while overcoming semantic ambiguities.
1
FAIR stands for Findability, Accessibility, Interoperability, and Reuse.
2
https://www.go-fair.org/fair-principles/.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 236–249, 2023.
https://doi.org/10.1007/978-3-031-21595-7_17
Kepler-aSI 237
2 Key Notions
In what follows, some key notions relating to our studied context supported by
some examples and illustrations.
– Knowledge Graph: Knowledge Graphs have been the focus of research since
2012, resulting in a wide variety of published descriptions and definitions. The
lack of a common core is a fact that was also reported by Paulheim [7] in 2015.
Paulheim listed in his survey of Knowledge Graph refinement the minimum
set of characteristics that must be present to distinguish Knowledge Graphs
from other knowledge collections, which restricts the term to any graph-based
knowledge representation. In the online review [7], authors agreed that a more
precise definition was hard to find at that point. This statement points out
the need for a closer investigation and a reflection in this area. Farber and al.
[8] defined a Knowledge Graph as a Resource Description Framework (RDF)
graph. Also, the authors stated that the KG term was coined by Google to
describe any graph-based Knowledge Base (KB).
3 Literature Review
Various research works have tackled the issue of semantic tables annotation.
They vary according to the deployed techniques as well as the adopted app-
roach. The CSV2KG system [13] consists of six phases. First, raw annotations
Kepler-aSI 239
are assigned to each cell of the table considered as input. Candidates undergo
disambiguation by similarity measures applied to each candidate’s label. Then
the column types and the properties between the columns are inferred using the
seed annotations. In the next step, the inferred column types and properties
are used to create more refined header cell annotations (the cells in the first
column of a table). Further processing uses the newly generated header cells to
correct other table cells, using property annotations. Finally, new column types
are inferred using all available corrected cells. The source code for this system
is in Python. DAGOBAH [2] is a system implemented as a set of sequential
complementary tools. The three main functionalities are: (i ) the identification
of the semantic relationships between tabular data and Knowledge Graphs, (ii )
Knowledge Graphs enrichment by transforming the informational volume con-
tained in the array into triples, (iii ) Metadata production that can be used for
reference, research and recommendation processes. DAGOBAH determines the
list of candidate annotations using a SPARQL query. As for the LinkingPark [4]
system, it takes a table as input, which is passed to an editor to extract entity
links as well as property links. From the entity links, said method generates
candidate entities via a cascading pipeline which becomes the input for both the
disambiguation module and the property link detection module. The authors also
integrated the property characteristics to determine the relationship between the
different rows of the starting table. Furthermore, the JenTab [1] system operates
according to 9 modules, each of which has a well-defined objective. The first
module constitutes the system core and is responsible for most matching opera-
tions. Based on the search for annotation tags, this module generates candidates
for the three tasks (CTA, CEA, and CPA) and removes unlikely candidates. The
second module attempts to retrieve missing CEA matches based on the context of
a row and a column. Indeed, this processing only applies to the cells which have
not received any matching during the first module. Subsequently, in the third
module, and the absence of candidates in the CTA task, the processing (CEA by
Row) relaxes the requirements. The fourth module focuses on the selection of
solutions. After a new stage of filtering on the CEA candidates using the context
of each row, the authors opted to select the solutions with very high confidence
values. Modules 5 and 7 attempt to fill in the gaps relating to the failure to iden-
tify potential candidates. In case new candidates are identified, modules 6 and
8 screen them again. Module 9 represents the last resort to generating solutions
for features without candidate annotations. At this level, the authors assume
that certain parts of the context are false, to re-examine each assertion. Authors
reconsider all the candidates discarded in the previous stages to find the best
solution among them. The LexMa system [12] starts with a preprocessing phase
cropping the text in the cell and converting the resulting strings to uppercase.
After that, the system retrieves the top 5 entities for each cell value from the
Wikidata search service. Subsequently, the lexical match is evaluated based on
cosine similarity. This similarity measure is applied to vectors coded and formed
from labels derived from cell values. Cell labels and values are tokenized, then
stop words are removed before creating input vectors. At this point, the authors
240 W. Baazouzi et al.
trigger an identical search on DBpedia via its dedicated search service operating
with SPARQL queries.
To sum up, all of the above approaches rely on a learning strategy. Moreover,
for the real-time context, the applications become greedy, which imposes obtain-
ing the result as quickly as possible. This scenario requires the deployment of
more logistical and technical efforts. Moreover, the applicability of these solutions
will strengthen semantic interoperability across all domains. In the following, we
present a detailed description of our contribution, namely Kepler-aSI.
Each item is marked with the tag in Wikidata or DBPedia. This treatment
allows semantics identification. The CTA task is performed based on Wikidata
or DBPedia APIs to look for an item according to its description. The col-
lected information about a given entity used in our approach is an instance
list (expressed by the instanceOf primitive and accessible by the P31 code),
the subclass of (expressed by the subclassOf primitive and accessible by code
P279) and overlaps (expressed by the partOf primitive and accessible by code
P361). At this point, we can process the CTA task using a SPARQL query. The
SPARQL query is our interrogation mean fed from the entity information that
governs the choice of each data type since they are a list of instances (P31), of
Kepler-aSI 243
subclasses (P279) or a part of a class (P361). The result of the SPARQL query
may return a single type, but in some cases, the result is more than one type,
so in this case, no annotation is produced for the CTA task.
9 end
10 Annotate(T .coli , getBestRankedClass(class annot))
11 end
Our approach reuses the results of the CTA task process by introducing the
necessary modifications to the SPARQL query. If the operation returns more
than one annotation, we run a treatment based on examining the context of the
considered column, relative to what was obtained with the CTA task, to overcome
the ambiguity problem.
244 W. Baazouzi et al.
Indeed, the CPA task looks for annotating the relationship between two cells
in a row via a property. Similarly, this task is performed analogously to the CTA
and CEA tasks. The only difference in the CPA task is that the SPARQL query
must select both the entity and the corresponding attributes. The properties are
easy to match since we have already determined them during CEA and CTA task
processing.
Kepler-aSI 245
5.1 Round 1
In this first round of SemTab 2021, we have four tasks, namely: CTA-WD, CEA-WD,
CTA-DBP and CEA-DBP. The column type annotation (CTA -WD) assigns a Wiki-
data semantic type (a Wikidata entity) to a column. Cell Entity Annotation
(CEA-WD) maps a cell to a KG entity. The processing carried out to search for
correspondence on Wikidata is similarly carried out on Dbpedia. Data for the
CTA-WD and CEA-WD tasks focus on Wikidata. As we explained in Sect. 1, Wiki-
data is structured according to the RDF formalism, i.e., subject (S), predicate
(P) and Object (O). Each element considered is marked with a label in Wiki-
data, thus guaranteeing maximum advantage of its semantics. The CTA-WD and
8
All the official experimental values obtained and presented within the framework of
this study (and challenge) are available and searchable via this link: https://www.
aicrowd.com/challenges/semtab-2021. Please refer to the first author profile for a
clear and detailed overview of all metrics. Note that there are 3 Rounds.
246 W. Baazouzi et al.
CEA-WD task data contain 180 tables. In Table 1, we provide an input table exam-
ple. The first column contains an entity label, while the other columns contain
the associated attributes.
The column type annotation (CTA -DBP) assigns a DBPedia semantic type (a
DBPedia entity) to a column. Cell Entity Annotation (CEA-DBP) matches a cell
to a Knowledge Graph entity. The CTA-DBP and CEA-DBP task data also contain
180 tables. Results are summarized in Table 2.
5.2 Round 2
In Round 2, despite the distinction of the data and their grouping into two dif-
ferent families, they have a biology tint. Due to advances in biological research
techniques, new data are generated in the biomedical field and published in
unstructured or tabular form. These data are delicate to be integrated semanti-
cally due to their size and the complexity of biological relationships maintained
between the entities. Summary of metrics for this round is in Table 3.
Specifically, for tabular data annotation, the data representation can have
a significant impact on performance since each entity can be represented by
alphanumeric codes (e.g. chemical formulas or gene names) or even have multi-
ple synonyms. Therefore, the studied field would greatly benefit from automated
methods to map entities, entity types, and properties to existing datasets to
speed up the new data integrating process through the domain. In this round,
the focus was on Wikidata through two test cases: BioTable and HardTable. The
different tasks: BioTable-CTA-WD, BioTable-CEA-WD and BioTable-CPA-WD on
the one hand, to which we add Hard-CTA-WD, Hard-CEA-WD and Hard-CPA-WD,
are all carried out on 110 tables. During Round 2, we focused on the disam-
biguation problem. We have to decide when to obtain several candidates after
querying the KGs. Indeed, our approach during Round 1 was useful and allowed
us to reuse certain achievements. At this stage, we affirm that automatic ele-
ments disambiguation processing remains a tedious task, given what it generates
as an effort of semantic analysis and interpretation. Indeed, we have opted for
the use of an external resource, namely Uniprot9 [11]. UniProt integrates, inter-
prets and standardizes data from multiple selected resources to add biological
knowledge and associated metadata to protein records and acts as a central
9
https://www.uniprot.org.
248 W. Baazouzi et al.
hub. UniProt was recognized as an ELIXIR core data resource in 2017 [6] and
received CoreTrustSeal certification in 2020. The data resource fully supports
Findable, Accessible, Interoperable and Reusable, thus concretizing the FAIR
data principles [9].
5.3 Round 3
Round 3 has 3 main test families: BioDiv: represented by 50 tables, GitTables:
represented by 1100 tables and HardTables: represented by 7207 tables. Note
that the stakes are the same for this round. Moreover, the evaluation is blind, i.e.,
the participants do not have access to the evaluation platform and its options.
In other words, there is no test opportunity to adjust the approach parameters
according to the characteristics of the input. In this round, we opted for Uniprot
to carry out treatments similar to those described in Round 2. Out of the 7
proposed tasks, Kepler-aSI managed to process 3. In the CTA-BioDiv task,
we ranked first. For the GIT-DBP base test, we ranked second and for CTA-HARD
we ranked sixth. For the other cases, our method produces outputs containing
duplications, whereas these correspondences do not allow us to obtain evaluation
metrics.
References
1. Abdelmageed, N., Schindler, S.: JenTab: matching tabular data to knowledge
graphs. In: Proceedings of the Semantic Web Challenge on Tabular Data to
Knowledge Graph Matching (SemTab 2020) co-located with the 19th International
Semantic Web Conference (ISWC 2020), Virtual conference (originally planned to
be in Athens, Greece), 5 November 2020. CEUR Workshop Proceedings, vol. 2775,
pp. 40–49 (2020)
2. Chabot, Y., Labbé, T., Liu, J., Troncy, R.: DAGOBAH: an end-to-end context-
free tabular data semantic annotation system. In: Proceedings of the Semantic
Web Challenge on Tabular Data to Knowledge Graph Matching Co-located with
the 18th International Semantic Web Conference, SemTab@ISWC 2019, Auckland,
New Zealand, 30 October 2019. CEUR Workshop Proceedings, vol. 2553, pp. 41–48
(2019)
3. Chen, J., Jiménez-Ruiz, E., Horrocks, I., Sutton, C.: Learning semantic annotations
for tabular data. arXiv preprint arXiv:1906.00781 (2019)
4. Chen, S., et al.: Linkingpark: an integrated approach for semantic table inter-
pretation. In: Proceedings of the Semantic Web Challenge on Tabular Data to
Knowledge Graph Matching (SemTab 2020) co-located with the 19th International
Semantic Web Conference (ISWC 2020), Virtual conference (originally planned to
be in Athens, Greece), 5 November 2020. CEUR Workshop Proceedings, vol. 2775,
pp. 65–74 (2020)
5. Cremaschi, M., De Paoli, F., Rula, A., Spahiu, B.: A fully automated approach
to a complete semantic table interpretation. Futur. Gener. Comput. Syst. 112,
478–500 (2020)
6. Drysdale, R., et al.: The ELIXIR core data resources: fundamental infrastructure
for the life sciences. Bioinformatic 38, 2636–2642 (2020)
7. Ehrlinger, L., Wöß, W.: Towards a definition of knowledge graphs. SEMANTiCS
(Posters, Demos, SuCCESS) 48, 1–4 (2016)
8. Färber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of dbpe-
dia, freebase, opencyc, wikidata, and yago. Semantic Web 9(1), 77–129 (2018)
9. Garcia, L., Bolleman, J., Gehant, S., Redaschi, N., Martin, M.: Fair adoption,
assessment and challenges at UniProt. Sci. Data 6(1), 1–4 (2019)
10. Pham, M., Alse, S., Knoblock, C.A., Szekely, P.: Semantic labeling: a domain-
independent approach. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp.
446–462. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4 27
11. Ruch, P., et al.: Uniprot. Tech. rep. (2021)
12. Tyagi, S., Jiménez-Ruiz, E.: LexMa: tabular data to knowledge graph matching
using lexical techniques. In: Proceedings of the Semantic Web Challenge on Tabu-
lar Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th
International Semantic Web Conference (ISWC 2020), Virtual conference (origi-
nally planned to be in Athens, Greece), 5 November 2020. CEUR Workshop Pro-
ceedings, vol. 2775, pp. 59–64 (2020)
13. Vandewiele, G., Steenwinckel, B., Turck, F.D., Ongenae, F.: CVS2KG: transform-
ing tabular data into semantic knowledge. In: Proceedings of the Semantic Web
Challenge on Tabular Data to Knowledge Graph Matching Co-located with the
18th International Semantic Web Conference, SemTab@ISWC 2019, Auckland,
New Zealand, 30 October 2019. CEUR Workshop Proceedings, vol. 2553, pp. 33–
40 (2019)
14. Zhang, Z.: Effective and efficient semantic table interpretation using tableminer+.
Seman. Web 8(6), 921–957 (2017)
τ JUpdate: A Temporal Update Language
for JSON Data
1 Introduction
The lightweight format JavaScript Object Notation (JSON) [15], which is
endorsed by the Internet Engineering Task Force (IETF), is currently being used
by various networked applications to store and exchange data. Moreover, many
of these applications running in IoT, cloud-based and mobile environments, like
Web services, online social networks, e-health, smart-city and smart-grid appli-
cations, require bookkeeping of the full history of JSON data updates so that
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
P. Fournier-Viger et al. (Eds.): MEDI 2022, LNAI 13761, pp. 250–263, 2023.
https://doi.org/10.1007/978-3-031-21595-7_18
τ JUpdate: A Temporal Update Language for JSON Data 251
they can handle temporal JSON data, audit and recover past JSON document
versions, track JSON document changes over time, and answer temporal queries.
However, in the state-of-the-art of JSON data management [1,6,17,20–
22,27], there is neither a consensual nor a standard language for updating (i.e.,
inserting, modifying, and deleting) temporal JSON data, like the TSQL2 (Tem-
poral SQL2) [28] or SQL:2016 [24] language for temporal relational data. It is
worth mentioning here that the extension of the SQL language, named SQL/J-
SON [18,23,29] and standardized by ANSI to empower SQL to manage queries
and updates on JSON data, has no built-in support for updating time-varying
JSON data. In fact, even for non-temporal data, SQL/JSON is limited since it
does not support the update of a portion of a JSON document through the SQL
UPDATE statement [26].
Moreover, existing JSON-based NoSQL database management systems
(DBMSs) (e.g., MongoDB, Couchbase, CouchDB, DocumentDB, MarkLogic,
OrientDB, RethinkDB, and Riak) and both commercial DBMSs (e.g., IBM DB2
12, Oracle 19c, and Microsoft SQL Server 2019) and open-source ones (e.g.,
PostgreSQL 15, and MySQL 8.0) do not provide any support for maintaining
temporal JSON data [3,11,13].
In this context, with the aim of having an infrastructure that allows efficiently
creating and validating temporal JSON instance documents and inspired by
the τ XSchema design principles [9], we have proposed in [2] a comprehensive
framework, named τ JSchema (Temporal JSON Schema). In this environment,
temporal JSON data are produced from conventional (i.e., non temporal) JSON
data, by applying a set of temporal logical and physical characteristics that have
been already specified by the designer on the conventional JSON schema, that
is a JSON Schema [14] file that defines the structure of the conventional JSON
data:
– the temporal logical characteristics [2] allow designers to specify which com-
ponents (e.g., objects, object members, arrays, . . . ) of the conventional JSON
schema can vary over valid and/or transaction time;
– the temporal physical characteristics [2] allow designers to specify where
timestamps should be placed and how the temporal aspects should be repre-
sented.
conserving the semantics of the temporal JSON data, that is keeping the same
temporal logical characteristics); (iii) it does not require changes to existing
JSON instance/schema files nor revisions of the JSON technologies (e.g., the
IETF specification of the JSON format [15], the IETF specification of the JSON
Schema language [14], JSON-based NoSQL DBMSs, JSON editors/validators,
JSON Schema editors/generators/validators, etc.). However, there is no feature
for temporal JSON instance update in τ JSchema.
With the aim of overcoming the lack of an IETF standard or recommenda-
tion for updating JSON data, we have recently proposed a powerful SQL-like
language, named JUpdate (JSON Update) [6], to allow users to perform updates
on (non-temporal) JSON data. It provides fourteen user-friendly high-level oper-
ations (HLOs) to fulfill the different JSON update requirements of users and
applications; not only simple/atomic values but also full portions (or chunks) of
JSON documents can be manipulated (i.e., inserted, modified, deleted, copied
or moved) The semantics of JUpdate is based on a minimal and complete set
of six primitives (i.e., low-level operations, which can be easily implemented)
for updating JSON documents. The data model behind JUpdate is the IETF
standard JSON data model [15]. Thus, from one hand, it is independent from
any underlying DBMSs, which simplifies its use and implementation, and, from
the other hand, it can be used to maintain generic JSON documents.
Taking into account the requirements mentioned above, we considered very
interesting to fill the evidenced gap and to propose a temporal JSON update lan-
guage that would help users in the non-trivial task of updating temporal JSON
data. Moreover, based on our previous work, we think that (i) the JUpdate lan-
guage [6] can be a good starting point for deriving such a temporal JSON update
language, and (ii) the τ JSchema framework can be used as a suitable environ-
ment for defining the syntax and semantics of a user-friendly temporal update
language, mainly due to its support of logical and physical data independence.
For all these reasons, we propose in this paper a temporal update language
for JSON data named τ JUpdate (Temporal JUpdate) and define it as a temporal
extension of our JUpdate language, to allow users to update (i.e., insert, modify,
and delete) JSON data in the τ JSchema environment. To this purpose, both
the syntax and the semantics of the JUpdate statements have been extended
to support temporal aspects. The τ JUpdate design allows users to specify in
a friendly manner and efficiently execute temporal JSON updates. In order to
motivate τ JUpdate and illustrate its use, we will provide a running example.
The rest of the paper is structured as follows. The next section presents
the environment of our work and motivates our proposal. Section 3 proposes
τ JUpdate, the temporal JSON instance update language for the τ JSchema
framework. Section 4 illustrates the use of some operations of τ JUpdate, by
means of a short example. Section 5 provides a summary of the paper and some
remarks about our future work.
τ JUpdate: A Temporal Update Language for JSON Data 253
We assume that a company uses a JSON repository for the storage of the infor-
mation about the devices that it manufactures and sells, where each device is
described by its name and cost price. For simplicity, let us consider a temporal
granularity of one day for representing the data change events (and, therefore,
for temporal data timestamping). We assume that the initial state of the device
repository, valid from February 1, 2022, can be represented as shown in Fig. 1:
it contains, in a JSON file named device1.json, data about one device called
CameraABC costing e35.
{ " device
devices ":[
{ " devic
vice ":{
"nam
name ":
":" Came
meraABC " ,
"co
costPri
rice ":3
":35 } } ] }
Fig. 1. The initial state of the device repository (file device1.json, on February 01,
2022).
Then, we assume that, effective from April 15, 2022, the company starts
producing a new device named CameraXYZ with a cost price of e42 and Cam-
eraABC’s cost price is raised by 8%. The new state of the device repository can
be represented in a JSON file named device2.json as shown in Fig. 2. Changed
parts are presented in red color.
Fig. 2. A new state of the device repository (file device2.json, on April 15, 2022). (Color
figure online)
{ " t e m p o r a l J S O N D o c u m e n t ":{
"c o n v e n t i o n a l J S O N D o c u m e n t ":{
"sliceSequence ":[
{" slice ":{
"location ":" device1 . json " ,
"begin ":"2022 -02 -01" } } ,
{" slice ":{
"location ":" device2 . json " ,
"begin ":"2022 -04 -15" } } ] } } }
Fig. 3. The temporal JSON document representing the entire history of the device
repository (file deviceTJD.json, on April 15, 2022).
Fig. 4. The squashed JSON document corresponding to the entire history of the device
repository (file deviceSJD.json, on April 15, 2022).
Notice that the squashed JSON document deviceSJD.json in Fig. 4 also cor-
responds to one of the manifold possible representations of our temporal JSON
[3] data without the τ JSchema approach.
After that, let us consider that we have to record in the device repository that
the company has stopped manufacturing the device CameraABC effective from
May 25, 2022. At the state-of-the-art of JSON technology, we could use JUpdate
HLOs to directly perform the required updates on the deviceSJD.json file in
Fig. 4. A skilled developer, expert in both temporal databases and JUpdate,
and aware of the precise structure of the squashed document, will satisfy such
requirements via the following JUpdate statement:
UPDATE deviceSJD.json
PATH $.devices[@.device[@.name="CameraABC"
&& @.VTend="Forever"].VTend]
VALUE "2022-05-24" (S1)
256 Z. Brahmia et al.
Although this solution is simpler than solution (S1), it requires from the devel-
oper a detailed knowledge of the specific temporal structuring of the JSON file
including version organization and timestamping. Another consequence is that
such a solution template would not be portable to another setting in which a
different temporal structuring of JSON data is adopted.
Moreover, our second contribution is to integrate the temporal JUpdate
extension into the τ JSchema framework, in order to enjoy the logical and physical
independence property. In this framework, the required update will be specified
via the following τ JUpdate DeleteValue statement:
DELETE FROM deviceSJD.json
PATH $.devices[@.device[@.name="CameraABC"]]
VALID from "2022-05-24" (S3)
The update could be applied either to the temporal JSON document (i.e.,
deviceTJD.json) or to its squashed version (i.e., deviceSJD.json); the system
using the temporal logical and physical characteristics can manage both ways
correctly. Notice that, ignoring the VALID clause, the solution (S3) represents
exactly the same way we would specify the deletion of the device CameraABC’s
data in a non-temporal environment (e.g., executing it on the device2.json file in
Fig. 2). In practice, we want to allow the developer to focus on the structuring
of data simply as defined in the conventional JSON schema and not on the tem-
poral JSON schema, leaving the implementation details and their transparent
management to the system (e.g., the mapping to a squashed JSON document,
being aware of the temporal characteristics). This means, for example, that in
order to specify a cost price update, we want τ JUpdate users be able to deal with
updates to the “device.costPrice” value instead of dealing with updates to the
“device.costPrice” array of objects, where each object represents a version of a
cost price and has three properties: “VTbegin” (the beginning of the valid-time
τ JUpdate: A Temporal Update Language for JSON Data 257
timestamp of the version), “VTend” (the end of the valid-time timestamp of the
version), and “value” (the value of the version).
Notice that such a way in which temporal updates of JSON data will be
specified with our τ JUpdate language, corresponds exactly to the way updates
of temporal relational data can be specified using a temporal query language like
TSQL2 [28] or SQL:2016 [24], that is using the same update operations that are
used in a non-temporal context augmented with a VALID clause to specify the
applicability period of each update operation.
In sum, the motivation of our approach is twofold: from one hand, (i) leverag-
ing the logical/physical independence supported by the τ JSchema framework to
the JUpdate language and, from the other hand, (ii) equipping τ JSchema with
a user-friendly update language, which is consistent with its design philosophy.
The management of transaction time does not require any syntactic extension
to the JUpdate language: owing to the transaction time semantics, only current
data can be updated and the “applicability period” of the update is always
[Now, UntilChanged], which is implied and cannot be overridden by users. On
the contrary, the management of valid time is under the user’s responsibility.
Hence, syntactic extensions of the JUpdate language are required to allow users
to specify a valid time period representing the “applicability period” of the
update. To this purpose, the JUpdate update HLOs [6] are augmented with a
VALID clause as shown in Fig. 5.
Due to space limitations, we do not consider here other JUpdate HLOs (e.g.,
InsertMember, ReplaceMember, UpdateObjects) as they are used for specifying
complex updates; they will be investigated in a future work. Temporal expres-
sions “from T” and “to T”, while T is a temporal value, are used as syntactic
sugar for the temporal expressions “in [T, Forever]” and “in [Beginning, T]”,
respectively.
As far as the semantics of τ JUpdate is concerned, we can define it, for the sake
of simplicity, by considering JSON update operations on the temporal JSON doc-
ument in its unsquashed form. Based on the well-known theory developed in the
temporal database field [12,19], the operational semantics of a τ JUpdateHLO,
equal to a JUpdateHLO augmented with the VALID clause, can be defined as
follows:
The last step aims at limiting the unnecessary proliferation of slices, giving
rise to redundant JSON files in the unsquashed setting. Two slices, jdoc vers1
and jdoc vers2 with timestamps VTimestamp1 and VTimestamp2, respectively,
can be coalesced when jdoc vers1 and jdoc vers2 are equal and VTimestamp1
meets VTimestamp2 [28]. In this case, coalescing produces one slice jdoc vers1
with timestamp VTimestamp1 ∪ VTimestamp2.
This definition of the τ JUpdate HLO semantics, which can be easily extended
to the transaction-time or bitemporal case, is in line with the τ JSchema prin-
ciples, considering a temporal JSON document as representing a sequence of
conventional JSON documents, and reuses the standard (non-temporal) JUp-
date HLOs.
Even if the temporal JSON document jdoc is physically stored in squashed
form, the above semantics can still be used to evaluate a τ JUpdate HLO after
the document has been explicitly unsquashed. The results of the evaluation can
then be squashed back to finally produce an updated temporal JSON document.
Although correct from a theoretical point of view, such a procedure could be inef-
ficient in practice, in particular when the temporal JSON document is composed
of several slices. To resolve this problem, a different method can be applied for
τ JUpdate: A Temporal Update Language for JSON Data 259
updating temporal JSON documents that are stored in squashed form. To this
end, the semantics of τ JUpdate HLOs can be defined in an alternative way, as
shown in the next subsection (the solution is inspired from our previous work
on updates to temporal XML data [7]).
The first one is an example of InsertValue HLO that inserts CameraXYZ’s data,
while the second one is an example of UpdateValue HLO that increases Camer-
aABC’s cost price. The result of this HLO sequence corresponds to the temporal
JSON document in Fig. 3 completed by the slices in Fig. 1 and Fig. 2, and which
has been shown in squashed form in Fig. 4.
As an example of DeleteValue HLO, we can consider the τ JUpdate HLO (S3)
in Sect. 2.2, deleting CameraABC’s data effective from 2022-05-25. As an exam-
ple of RenameMember HLO, we can consider changing the name of the “devices”
object to “products”, also valid from 2022-05-25. Notice that such an operation
could be more properly considered as a conventional JSON schema change, as it
acts on metadata rather than on data and, thus, could be better effected using the
high-level JSON schema change operation RenameProperty, acting on the conven-
tional JSON schema, we previously defined in [5], which is automatically propa-
gated to extant conventional JSON data. However, as part of τ JUpdate, we can
also consider it a JSON data update that propagates indeed to the JSON schema
by means of the implicit JSON schema change mechanism that we have proposed
in [4]. The global effects in the τ JSchema framework, anyway, are exactly the
same. Such updates can be performed via the following τ JUpdate HLOs:
DELETE FROM deviceTJD.json
PATH $.devices[device.name="CameraABC"]
VALID from "2022-05-25";
ALTER DOCUMENT deviceTJD.json
OBJECT $.devices
RENAME MEMEBER devices TO products
VALID from "2022-05-25"
The result of this HLO sequence is the new temporal JSON document shown in
Fig. 6 with the new slice shown in Fig. 7.
260 Z. Brahmia et al.
Fig. 6. The new temporal JSON document representing the whole history of the device
repository (file deviceTJD.json).
Fig. 8. The squashed JSON document (file deviceSJD1.json) corresponding to the first
conventional JSON schema version deviceCJS1.json. (Color figure online)
5 Conclusion
References
1. Bourhis, P., Reutter, J., Vrgoč, D.: JSON: data model and query languages. Inf.
Syst. 89, 101478 (2020)
2. Brahmia, S., Brahmia, Z., Grandi, F., Bouaziz, R.: τ JSchema: a framework for
managing temporal JSON-based NoSQL databases. In: Hartmann, S., Ma, H. (eds.)
DEXA 2016. LNCS, vol. 9828, pp. 167–181. Springer, Cham (2016). https://doi.
org/10.1007/978-3-319-44406-2 13
3. Brahmia, S., Brahmia, Z., Grandi, F., Bouaziz, R.: A disciplined approach to tem-
poral evolution and versioning support in JSON data stores. In: Emerging Tech-
nologies and Applications in Data Processing and Management, pp. 114–133. IGI
Global (2019)
4. Brahmia, Z., Brahmia, S., Grandi, F., Bouaziz, R.: Implicit JSON schema version-
ing driven by big data evolution in the τ JSchema framework. In: Farhaoui, Y. (ed.)
BDNT 2019. LNNS, vol. 81, pp. 23–35. Springer, Cham (2020). https://doi.org/
10.1007/978-3-030-23672-4 3
5. Brahmia, Z., Brahmia, S., Grandi, F., Bouaziz, R.: Versioning schemas of JSON-
based conventional and temporal big data through high-level operations in the
τ JSchema framework. Int. J. Cloud Comput. 10(5–6), 442–479 (2021)
6. Brahmia, Z., Brahmia, S., Grandi, F., Bouaziz, R.: JUpdate: a JSON update lan-
guage. Electronics 11(4), 508 (2022)
7. Brahmia, Z., Grandi, F., Bouaziz, R.: τ XUF: a temporal extension of the XQuery
update facility language for the τ XSchema framework. In: Proceedings of the
23rd International Symposium on Temporal Representation and Reasoning (TIME
2016), Technical University of Denmark, Copenhagen, Denmark, 17–19 October
2016, pp. 140–148 (2016)
8. Burns, T., et al.: Reference model for DBMS standardization, database architecture
framework task group (DAFTG) of the ANSI/X3/SPARC database system study
group. SIGMOD Rec. 15(1), 19–58 (1986)
9. Currim, F., Currim, S., Dyreson, C., Snodgrass, R.T.: A tale of two schemas:
creating a temporal XML schema from a snapshot schema with τ XSchema. In:
Bertino, E., et al. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 348–365. Springer,
Heidelberg (2004). https://doi.org/10.1007/978-3-540-24741-8 21
10. Davoudian, A., Chen, L., Liu, M.: A survey on NoSQL stores. ACM Comput. Surv.
(CSUR) 51(2), 1–43 (2018)
11. Goyal, A., Dyreson, C.: Temporal JSON. In: 2019 IEEE 5th International Confer-
ence on Collaboration and Internet Computing (CIC 2019), pp. 135–144 (2019)
12. Grandi, F.: Temporal databases. In: Encyclopedia of Information Science and Tech-
nology, Third Edition, pp. 1914–1922. IGI Global (2015)
τ JUpdate: A Temporal Update Language for JSON Data 263
13. Hu, Z., Yan, L.: Modeling temporal information with JSON. In: Emerging Tech-
nologies and Applications in Data Processing and Management, pp. 134–153. IGI
Global (2019)
14. Internet Engineering Task Force: JSON Schema: A Media Type for Describing
JSON Documents, Internet-Draft, 19 March 2018. https://json-schema.org/latest/
json-schema-core.html
15. Internet Engineering Task Force: The JavaScript Object Notation (JSON) Data
Interchange Format, Internet Standards Track document, December 2017. https://
tools.ietf.org/html/rfc8259
16. Irshad, L., Ma, Z., Yan, L.: A survey on JSON data stores. In: Emerging Technolo-
gies and Applications in Data Processing and Management, pp. 45–69. IGI Global
(2019)
17. Irshad, L., Yan, L., Ma, Z.: Schema-based JSON data stores in relational databases.
J. Database Manag. (JDM) 30(3), 38–70 (2019)
18. ISO/IEC, Information technology − Database languages − SQL Technical Reports
− Part 6: SQL support for JavaScript Object Notation (JSON), 1st Edition, Tech-
nical report ISO/IEC TR 19075-6:2017(E), March 2017. http://standards.iso.org/
ittf/PubliclyAvailableStandards/c067367 ISO IEC TR 19075-6 2017.zip
19. Jensen, C., Snodgrass, R.: Temporal database. In: Liu, L., Özsu, M.T. (eds.) Ency-
clopedia of Database Systems, 2nd edn., pp. 3945–3949. Springer, New York (2018).
https://doi.org/10.1007/978-1-4614-8265-9 395
20. Liu, Z.: JSON data management in RDBMS. In: Emerging Technologies and Appli-
cations in Data Processing and Management, pp. 20–44. IGI Global (2019)
21. Liu, Z., Hammerschmidt, B., McMahon, D.: JSON data management: supporting
schema-less development in RDBMS. In: Proceedings of the 2014 ACM SIGMOD
International Conference on Management of Data (SIGMOD 2014), Snowbird, UT,
USA, 22–27 June 2014, pp. 1247–1258 (2014)
22. Lv, T., Yan, P., Yuan, H., He, W.: Linked lists storage for JSON data. In: 2021
International Conference on Intelligent Computing, Automation and Applications
(ICAA 2021), pp. 402–405 (2021)
23. Melton, J., et al.: SQL/JSON part 1, DM32.2-2014-00024R1, 6 March 2014.
https://www.wiscorp.com/pub/DM32.2-2014-00024R1 JSON-SQL-Proposal-1.
pdf
24. Michels, J., et al.: The new and improved SQL:2016 standard. ACM SIGMOD
Rec. 47(2), 51–60 (2018)
25. NoSQL Databases List by Hosting Data − Updated 2020. https://hostingdata.co.
uk/nosql-database/
26. Petković, D.: SQL/JSON standard: properties and deficiencies. Datenbank-
Spektrum 17(3), 277–287 (2017)
27. Petković, D.: Implementation of JSON update framework in RDBMSs. Int. J.
Comput. Appl. 177, 35–39 (2020)
28. Snodgrass, R.T., et al. (eds.): The TSQL2 Temporal Query Language. Kluwer
Academic Publishing, New York (1995)
29. Zemke, F., et al.: SQL/JSON part 2 − Querying JSON, ANSI INCITS
DM32.2-2014-00025r1, 4 March 2014. https://www.wiscorp.com/pub/DM32.2-
2014-00025r1-sql-json-part-2.pdf
Author Index
Said, Lobna A. 3
Faiz, Sami 236
Sayed, Wafaa S. 3
Fawzi, Sahar Ali 102
Sebaq, Ahmad 16
Ferrarotti, Flavio 72
Sehili, Ines 176
Fetteha, Marwan A. 3
Selim, Sahar 26
Sochor, Hannes 72
Gad, Eyad 26
Gamal, Aya 26 Yousef, Ahmed H. 162
Ghoneim, Vidan Fathi 58
Gourari, Aya 176 Zaki, Nesma M. 147