QNLP
QNLP
QNLP
Abstract. One of the most important challenges in the field of software code
audit is the presence of vulnerabilities in software source code. Ev-ery year,
more and more software flaws are found, either internally in proprietary code
or revealed publicly. These flaws are highly likely ex-ploited and lead to
system compromise, data leakage, or denial of ser-vice. C and C++ open-
source code are now available in order to create a large-scale, classical
machine-learning and quantum machine-learning system for function-level
vulnerability identification. We assembled a siz-able dataset of millions of
open-source functions that point to poten-tial exploits. We created an
efficient and scalable vulnerability detection method based on a deep neural
network model– Long Short-Term Mem-ory (LSTM), and quantum machine
learning model– Long Short-Term Memory (QLSTM), that can learn features
extracted from the source codes. The source code is first converted into a
minimal intermediate representation to remove the pointless components
and shorten the de-pendency. Previous studies lack analyzing features of
the source code that causes models to recognize flaws in real-life examples.
Therefore, We keep the semantic and syntactic information using state-of-
the-art word embedding algorithms such as Glove and fastText. The
embedded vectors are subsequently fed into the classical and quantum
convolutional neural networks to classify the possible vulnerabilities. To
measure the performance, we used evaluation metrics such as F1 score,
precision, re-call, accuracy, and total execution time. We made a
comparison between the results derived from the classical LSTM and
quantum LSTM using basic feature representation as well as semantic and
syntactic represen-tation. We found that the QLSTM with semantic and
syntactic features detects significantly accurate vulnerability and runs faster
than its clas-sical counterpart.
⋆ Supported by organization x.
Keywords: Cyber Security; Vulnerability Detection; Classical Machine
Learning; Feature Extraction; Quantum Natural Language Processing;
1 Introduction
2 Literature Review
3 Methodology
We adopted a Quantum Long Short-Term Memory (QLSTM), a subfield of
Quantum Machine Learning (QML), for this research and applied the model to the
text dataset. Figure 1 demonstrates the framework representing the im-
plementation process. At first, we pre-processed raw data prior to providing it as
input to the QML model. We used Python, keras tokenizer, sklearn LabelEn-
coder, Keras sequence, Keras padding, Keras embedding, Glove, and FastText
embeddings for pre-processing the dataset. Keras library has been used to extract
the basic input representation and the semantic and syntactic representations
have been extracted using Glove and FastText Embeddings. For the experiment,
we consider only the balanced portions of the dataset, which contains an almost
equal number of vulnerable and non-vulnerable datasets, to avoid underfitting or
overfitting. In quantum machine learning models, we need to feed numerical
values, so we converted text data into numerical values, and all the numerical
values were normalized to maintain a similar scale. We made a comparison with
results derived from LSTM and QLSTM models with or without training the
semantic and syntactic representations. Results have been shown in section 4.
The sigmoid layer state’s equation is:
and found some common words with it’s corresponding counts: = (505570), if
(151663), n (113301), == (92654), return (77438), * (71897), the (71595), int
(53673), < (43703), + (41855), for (35849), char (33334), else (31358).
Tokenizer In the natural language processing project, the basic units called
words are mandatory to concentrate on the computational process. In the
NLP field, the word is also known as a token. Tokenization is a process that
sepa-rates sentences into small units that can be used for several purposes.
With this concept, Keras provides a tokenizer library called Tokenizer.
Tokenizer contains two methods named tokenize() and detokenize(). The
methods go through the plain text and separate the word. We used the Keras
tokenizer for our initial data preprocessing step kathuria2019real.
GloVe Semantic vector space language algorithms replace each word with a
vector. The vectors are useful since the machine cannot understand words but a
vector. Therefore, numerous applications can make use of these vectors as
features– question answering [28], parsing [29], document classification [30],
in-formation retrieval, and named entity recognition [31]. The glove is a
language algorithm for prevailing vector representations of words. This is an
unsupervised learning algorithm; the training process has been performed on
global word-word co-occurrence statistics from a corpus [32].
Pennington et al. [32] dispute that the online scanning process followed by
word2vec is inferior as it does not provides global statistical values about
word co-occurrences. The model produces a vector space with a valid
substructure with 75% of performance on a real-life example. GloVe was built
on two concepts– lo-cal context window and global matrix factorization.
CBOW and skip-Gram are Local context window methods. CBOW is better for
frequent words, whereas Skip-gram works well on small datasets with rare
observations. While global matrix factorization is the matrix factorization
method that derives from lin-ear algebra is responsible for reducing the long-
term frequency matrices. The matrices constitute the occurrence of words.
The vector representation of the software source code has been fed to the
LSTM, and QLSTM models. The dataset has been divided into training and
validation portions for the purpose of training the models. Finally, the test
dataset is used to evaluate each trained model.
1 2
h h h ht
The tanh layer generates a vector of new candidates for producing a new
value to the state C(t), while the sigmoid layer determines which value should
be updated.
The equation for tanh layer’s equation is as follows:
˜
C(t) = tanh(Wc.[ht−1, xt] + bC ) (4)
˜
The addition of C(t) ∗ it and Ct − 1 ∗ f t updates the new cell at state C(t). The
updated state’s equation is:
˜
(5)
Ct = Ct−1 ∗ ft + C(t) ∗ it
In order to determine which output needs to be maintained, the output is ulti-
mately filtered out using the sigmoid and the tanh functions.
ht = Ot ∗ tanh(Ct) (7)
In this state, ht gives outputs that are used for the input of the next hidden
layer.
The U(x) block represents the state preparation which converts the
classical input x into the quantum state. In contrast, the block represents the
variational part along with the learnable parameters for doing optimization
during the gra-dient descent process. We measure a subset of the encoded
qubits to retrieve a classical bit string, for example, 0100.
Quantum LSTM We modify the traditional LSTM architecture into a
quantum version by replacing the neural networks in the LSTM cells with Vari-
ational Quantum Circuits (VQCs) [40]. The VQCs play roles in both feature
extraction as well as data compression. The components we used for the
VCQs are shown in Figure 5.
x(t) refers to the input at time t, ht refers to the the hidden state, ct refers to the cell
state, and yt refers to the output. The blocks σ and tanh represent the sigmoid and the
hyperbolic tangent activation function, respectively. Finally, the
⊗ and ⊕ represents element-wise multiplication and addition. The
mathematical functions [40] for each state of QLSTM model are stated below :
it = σ(V QC2(vt))
Ct′ = T anh(V QC3(vt))
ct = ft ∗ ct − 1 + it ∗ Ct′
ot = σ(V QC4(vt))
ht = σ(V QC5(ot ∗ tanh(ct))
yt= V QC6(ot ∗ tanh(ct))
3.4 Evaluation metrics
TP
P recision = TP +FP (8)
Here TP refers to True Positive values and FP refers to False Positive values.
Recall : The metric recall is the opposite of Precision. The Precision is used when
the false negatives (FN) are high. In the vulnerability detection classification
problem, if the model gives low recall, then many vulnerable codes will be said as
non-vulnerable; for high recall, it will ignore the false negative values by learning
with false alarms. The recall can be calculated as follows:
TP
Recall = TP +FP (9)
F1 score: F1 score combines Precision and recall and provides an overall
accu-racy measurement of the model. The value of the F1 score lies between
1 and 0. If the predicted value matches with the expected value, then the f1
score gives 1, and if none of the values matches with the expected value, it
gives 0. The F1 score can be calculated as follows:
TP +TN
Accuracy = TP +TN +FP +FN (11)
here, TN refers to True Negative and FN refers to False Negative.
4 Result and Discussion
From the previous study, we found that the application of quantum models has not
been applied to the software security field. Since, the majority of software
companies face a surge due to software flaws, those require a system that can
provide an accurate result as well as efficiency. Before training the neural net-
works, several criteria need to be followed. One important criterion is feature
analysis. There is a huge chance that a classifier performs poorly due to the lack
of feature analysis techniques. As investigators did not consider the in-depth
feature analysis process in software security field, we have shown a step-by-step
process for extracting the semantic and syntactic features. We developed the
LSTM model with the same number of parameters for both the classical and
quantum versions to get a clear observation. We implemented the classical LSTM
architecture using TensorFlow with 50 hidden units.
It has a softmax layer to convert the output to a single target value y t. The total
number of parameters is 123301 in the classical LSTM. In case of QLSTM, we
used 6 VQCs shown in Figure 5. Each of the VQCs consists of 4 qubits with 2
depths in each variational layer. Additionally, there are 2 parameters for scaling in
the final step. The total number of parameter is 122876. We chose pennylane as
our simulation environment for the quantum circuit. Through our experimental
results, we found that the QLSTM learns faster than the classical LSTM does with
a similar number of parameters. Our comparative analysis between the classical
Long Short-Term Memory model and the quantum long Short-Term Memory
model illustrates in Table 1 and Table 2.
1 Conclusion
Quantum computing has recently gained prominence with prospects in the com-
putation of machine learning algorithms that have addressed challenging prob-
lems. This paper conducted a comparative study on quantum Long Short-Term
Memory (QLSTM) and traditional Long Short-Term Memory (LSTM) and an-alyzes
the performance of both models using vulnerable souce code . Moreover, We
extracted the semantic and syntactic information using state-of-the-art word
embedding algorithms such as Glove and fasText, which can make more accurate
result. The QML model was used on a open sourced Penny Lane simulator due to
the limited availability of the quantum computer. We have tried to implement
machine learning algorithms for sequence modeling, such as natural language
processing, vulnerable source code recognition on noisy intermediate-scale quan-
tum (NISQ) device. We assessed the model’s performance using accuracy and
processing criteria. According to the experimental findings, the QLSTM with Glove
and fastText embedding model learns noticeably more vulnerable source
No Title Given
No Author Given
No Institute Given
(a) (b)
(c) (d)
Fig.7: Comparison between sine function learning result obtained from jupyter note-
book using LSTM+ Glove + FasText model for (a) epoch 1 and b) epoch 30 and
QLSTM+ Glove + FasText model for (c) epoch 1 and (d) epoch 30
code features and operates more quickly than its conventional equivalent. Al-
though advances have been made in quantum machine learning over the past
few decades, more work is still needed because the current generation of
quan-tum simulators only has a small number of qubits, making them
unsuitable for sensitive source code. It is possible that a large number of
convergent qubits using quantum machine learning models will have a
significant impact on clas-sification performance and computing time.
Acknowledgement
References
32. J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014
conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
33. A. Filonenko, K. Gudkov, A. Lebedev, I. Zagaynov, and N. Orlov, “Fastext: Fast and small text extractor,” in 2019 International
Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 4, pp. 49–54, IEEE, 2019.
34. M. Busta, L. Neumann, and J. Matas, “Fastext: Efficient unconstrained scene text detector,” in Proceedings of the IEEE international conference
on computer vision,
pp. 1206–1214, 2015.
35. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computa-tion, vol. 9, no. 8, pp. 1735–1780, 1997.
36. J. Schmidhuber, “A fixed size storage o (n 3) time complexity learning algorithm for fully recurrent continually running networks,” Neural Computation, vol.
4, no. 2,
pp. 243–248, 1992.
37. H. Song, J. Dai, L. Luo, G. Sheng, and X. Jiang, “Power transformer operating state prediction method based on an lstm
network,” Energies, vol. 11, no. 4, p. 914, 2018.
38. A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M. Chow, and J. M. Gambetta, “Hardware-efficient variational
quantum eigensolver for small molecules and quantum magnets,” Nature, vol. 549, no. 7671, pp. 242–246, 2017.
39. S. Sim, P. D. Johnson, and A. Aspuru-Guzik, “Expressibility and entangling capa-bility of parameterized quantum circuits for
hybrid quantum-classical algorithms,” Advanced Quantum Technologies, vol. 2, no. 12, p. 1900070, 2019.
40. S. Y.-C. Chen, S. Yoo, and Y.-L. L. Fang, “Quantum long short-term memory,” in ICASSP 2022-2022 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8622–8626, IEEE, 2022.
41. V. Havl´ıˇcek, A. D. C´orcoles, K. Temme, A. W. Harrow, A. Kandala, J. M. Chow, and J. M. Gambetta, “Supervised
learning with quantum-enhanced feature spaces,” Nature, vol. 567, no. 7747, pp. 209–212, 2019.
42. K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, “Quantum circuit learning,” Physical Review A, vol. 98, no. 3, p. 032309, 2018.
43. P.-L. Dallaire-Demers and N. Killoran, “Quantum generative adversarial net-works,” Physical Review A, vol. 98, no. 1, p. 012324, 2018.
44. S. Y.-C. Chen, C.-H. H. Yang, J. Qi, P.-Y. Chen, X. Ma, and H.-S. Goan, “Vari-ational quantum circuits for deep reinforcement learning,” IEEE Access, vol. 8,