Performance Improvements in Software-Defined Vehicular Ad Hoc Networks
Performance Improvements in Software-Defined Vehicular Ad Hoc Networks
by
c Copyright
Dajun Zhang, 2020
The undersigned hereby recommends to the
Faculty of Graduate and Postdoctoral Affairs
acceptance of the dissertation
Carleton University
January, 2020
ii
Abstract
iii
the information received from different distributed controllers. Specifically, we give
consensus steps and theoretical analysis of the system throughput. A dueling deep
Q-learning (DDQL) approach is used in the domain control layer in order to obtain
the best policy for maximizing the system throughput under the effects of malicious
vehicles in vehicular networks.
Thirdly, we propose a novel framework of blockchain-based mobile edge computing
for a future VANET ecosystem (BMEC-FV), in which we have adopted a hierarchical
architecture. In the underlying VANET environment, we propose a trust model to
ensure the security of the communication link between vehicles. Multiple MEC servers
calculate the trust between vehicles through computing offloading. Meanwhile, the
blockchain system works as an overlaid system to provide management and control
functions. We aim to optimize the system throughput and the quality of services
(QoS) of the users in the underlaid MEC system. The blocksize of blockchain nodes,
the number of consensus nodes, reliable features of each vehicle, and the number
of producing blocks for each block producer are considered in a joint optimization
problem, which is modeled as a Markov decision process with state space, action
space, and reward function. Simulation results are presented to show the effectiveness
of the proposed BMEC-FV framework.
iv
Acknowledgments
To pen down this section of my dissertation means I have almost arrived at the
destination that I had dreamed every night. This exciting and unbelievable journey
started when I wrote my first email to my supervisor, Prof. F.Richard Yu. Thus,
I would like to begin this section by thanking him. Without his invaluable support
and wonderful supervision, this would have been impossible. His technical insight
and on-going encouragement have been a constant source of the motivation. The
innumerable discussions with him and his ideas on the research projects have been
the most dispensable input for my research. His comments and suggestions on my
works not only increase the quality of the research but also provide a source of thought
for my career. This work is as much his contribution as mine for the fact that he has
always wholeheartedly supported every decision I have made during this wonderful
study. He has better prepared me for the world and I will always be indebted to him
for anything I achieve in my future life.
I would like to thank Dr. Ruizhe Yang, for her valuable comments and suggestions.
I am grateful for the time and patience that she donated towards teaching me. I also
would like to thank Prof. Feng Ye, Prof. Amiya Nayak, Prof. Shichao Liu, and Prof.
Changcheng Huang, for serving on my dissertation committee. Special thanks go to
Dr. Matt Ma, Dr. Chao Qiu and Dr. Fengxian Guo, who provided me valuable and
constructive suggestions on my works.
Special thanks also go to my friend Yue Shen who gives me in Canada a memorable
v
and often breathtakingly beautiful adventure. I also want to express my thanks to Di
Liu, Zhicai Zhang, and others who generously provide friendship in Ottawa to make
my life here amazing.
There is no word to express my gratitude to my family, and in particular, my
grandmother and my parents. This achievement would have not been possible without
their unconditional love, support and encouragement. Thanks for always believing in
me and making things easiser just so that I didn’t have to worry about anything but
my research. At last, I want to say thanks to my father who is an excellent professor
and gives me a lot of confidence. I believe he can see my achievement even we are
not in the same country.
vi
Table of Contents
Abstract iii
Acknowledgments v
List of Tables xi
List of Abbreviations xv
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 VANETs and Mobile Edge Computing . . . . . . . . . . . . . 1
1.1.2 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . 3
1.1.3 Blockchain Technology . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 Trust-based VANETs . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.2 Reinforcement Learning based VANETs . . . . . . . . . . . . 20
1.3.3 Blockchain based VANETs and Distributed SDN . . . . . . . 22
vii
1.3.4 MEC based VANETs . . . . . . . . . . . . . . . . . . . . . . . 24
1.4 Dissertation Organization and Contributions . . . . . . . . . . . . . . 25
1.4.1 List of Publication . . . . . . . . . . . . . . . . . . . . . . . . 29
viii
3.2 The Throughput Analysis of Block-SDV . . . . . . . . . . . . . . . . 71
3.2.1 Trust Derivation . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.2 The Theoretical Analysis of Network Throughput . . . . . . . 74
3.2.3 The Theoretical Throughput Analysis of Blockchain System . 77
3.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3.1 System State Space . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3.2 System Action Vector . . . . . . . . . . . . . . . . . . . . . . 82
3.3.3 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.4 Dueling Deep Q-Learning with Prioritized Experience Replay . . . . 85
3.4.1 Dueling Deep Q-Learning . . . . . . . . . . . . . . . . . . . . 86
3.4.2 Prioritized Experience Replay . . . . . . . . . . . . . . . . . . 88
3.4.3 Dueling Deep Q-Learning with Prioritized Experience Replay
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.5 Simulation Results and Discussions . . . . . . . . . . . . . . . . . . . 91
3.5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.5.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 95
ix
4.3 Performance Analysis of BMEC-FV . . . . . . . . . . . . . . . . . . . 115
4.3.1 Performance of MEC in VANETs . . . . . . . . . . . . . . . . 116
4.3.2 Performance Analysis of BCs . . . . . . . . . . . . . . . . . . 117
4.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.4.1 System State . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.4.2 System Action . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.4.3 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.5 Deep Compressed Neural Network for BMEC-FV . . . . . . . . . . . 123
4.5.1 Background of Deep Compressed Neural Network . . . . . . . 124
4.5.2 Dueling Deep Q-Learning Based on the Pruning Method For
BMEC-FV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.5.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 127
4.6 Simulation Results and Discussions . . . . . . . . . . . . . . . . . . . 130
4.6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.6.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 135
x
List of Tables
xi
List of Figures
xii
2.12 Comparison of the ET X delay with different learning rates. . . . . . 60
2.13 ET X delay comparison with different data rates. . . . . . . . . . . . 61
2.14 The ET X delay comparison with different numbers of vehicles. . . . . 61
3.1 Proposed framework of block-SDV. . . . . . . . . . . . . . . . . . . . 65
3.2 The interaction between domain controllers and blockchain. . . . . . 69
3.3 The consensus procedures inside of the blockchain. . . . . . . . . . . 70
3.4 An example of the state transition diagram for the trust features of
each blockchain node. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.5 The workflows of proposed DDRL in block-SDV. . . . . . . . . . . . . 85
3.6 The TensorFlow graph in TensorBoard using CNN with PER. . . . . 89
3.7 A simulation structure of proposed blockchain technology. . . . . . . 91
3.8 Training curves tracking the loss of block-SDV under different schemes. 95
3.9 Training curves tracking the throughput of block-SDV under different
schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.10 Training curves tracking the throughput of block-SDV under different
learning rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.11 Training curves tracking the loss of block-SDV under different learning
rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.12 The throughput comparison versus the number of controllers. . . . . . 99
3.13 The throughput comparison versus the number of consensus nodes. . 100
3.14 The throughput comparison versus the batch size of each block. . . . 101
4.1 Blockchain-based MEC in VANETs. . . . . . . . . . . . . . . . . . . . 103
4.2 The consensus procedures inside of the blockchain. . . . . . . . . . . 111
4.3 Blockchain-based MEC in VANETs. . . . . . . . . . . . . . . . . . . . 120
4.4 A simulation structure of proposed BMEC-FV. . . . . . . . . . . . . 131
4.5 Training curves tracking the loss of BMEC-FV under different schemes. 135
xiii
4.6 Training curves tracking the long-term reward of BMEC-FV under
different schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.7 Training curves tracking the long-term reward of BMEC-FV under
different learning rates. . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.8 The long-term reward comparison of BMEC-FV versus the average
transaction size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.9 The long-term reward comparison of BMEC-FV versus the processing
request delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.10 The long-term reward comparison of BMEC-FV versus the processing
request delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.11 The long-term reward comparison of BMEC-FV versus the different
SNR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
xiv
List of Abbreviations
DL Deep Learning
xv
MEC Mobile Edge Computing
P2P Peer-to-Peer
xvi
Chapter 1
Introduction
1.1 Background
1
2
Internet of Vehicles and has achieved consensus among multiple agents to ensure
the security and stability of data transmission. In general, we aim to optimize the
communication security and quality of IoV nodes and improve the performance of the
system architecture through the fusion of multiple technologies.
According to the definition of European Telecommunications Standards Institute
(ETSI), mobile edge computing (MEC) focuses on the environment that provides
users with IT services and cloud computing capabilities at the edge of the mobile
network and is intended to be closer to mobile users to reduce the delay in network
operations and service delivery. The mobile edge computing architecture is divided
into three levels: the system layer, the host layer, and the network layer [5–7]. The
system architecture proposed by ETSI shows the functional elements of MEC and
the reference nodes between each functional element.
The MEC is located between the wireless network access point and the wired
network. The traditional wireless access network has the advantages of service lo-
calization and close-range deployment, which brings high-bandwidth and low-latency
transmission capabilities. In the MEC mode, by sinking network services to the wire-
less network access side closer to the user, the direct benefits are that the user can
noticeably reduce the transmission delay and the network congestion is significantly
controlled. MEC provides application programming interfaces (APIs) that open basic
network capabilities to third parties, enabling third parties to complete on-demand
customization and interaction based on business needs.
The VANET technology can sense vehicle behavior and road conditions through
various sensors, improve vehicle safety, reduce the degree of traffic congestion, and also
bring opportunities for value-added services, such as vehicle positioning and finding
parking locations. This technology has not yet reached maturity, and the delay from
the vehicle to the cloud is still between 100 ms and 1 s, which is far from meeting the
3
Reinforcement Learning
The main components of the reinforcement learning (RL) include agent, environ-
ments, and action. The main consideration of reinforcement learning is a series of
tasks with which the agent interacts with the environments. Consequently, no matter
what the task is, any task includes a series of actions, observations, and rewards.
The meaning of the reward is that the environments are changed since the agent
interacts with the environments. The reward is used to represent the degree of the
environments change.
Fig. 1.1 shows the process of the agent and environment interaction. In each
time-step, the agent chooses an action from a given action set. This set can be a
continuous or a discrete action set. The number of action sets will deeply influence
the difficulty of solving the tasks.
The goal of RL is to get more rewards. In each time-step, the agent executes a new
activity through the current observation. Each observation acts as the state of the
agent. Consequently, the relationship between the state and action is the mapping,
which means that one state corresponds with one action or corresponds with the
4
3$')#-.,$0
+$,$"-./$0
!"#$ %#&'()#*"#$
+$,$"-./$120
4"5,(6--(./$7,$0
model.
Comparison of RL with supervised learning and unsupervised learning:
• Supervised learning is done from an already marked training set. The charac-
teristics of each sample in the training set can be thought of as a description of
the situation, and its label can be considered as the correct action that should
be performed. However, supervised learning cannot learn the context of inter-
action, because the example of obtaining the desired behavior in the problem
of interaction is very impractical. The agent can only learn from its own expe-
rience, and the behavior taken in the experience is not necessarily optimal. It
is very appropriate to use RL at this time, because RL does not use the correct
behavior to guide, but uses existing training information to evaluate behavior.
RL adopts the method of obtaining the sample while learning, updating its own
model after obtaining the sample, using the current model to guide the next action,
updating the model after the next action is obtained, and iteratively repeating until
the model convergence. In this process, when the current model is available, the
next action is selected to be most beneficial to the improvement of the current mod-
el. This involves two very important concepts in RL: exploration and exploitation.
Exploration refers to the selection of actions that have not been executed before, to
explore more possibilities; exploitation refers to the selection of actions that have
6
Deep Convolutional
Input
Neural Network
Output Q-value
Deep Q-network (DQN): Mnih et al. [49] combined the convolutional neural net-
work with Q-learning in traditional RL and proposed a deep Q network model. This
model is used to process visual perception-based control tasks and is a groundbreaking
work in the field of DRL.
The input of the DQN model is a pre-processed image matrix that undergoes a
nonlinear transformation of the convolutional layers and the fully connected layers.
Finally, the Q-value of each action is generated at the output layer. Fig. 1.2 shows
the model architecture of DQN.
Fig. 1.3 depicts the training process of DQN. In order to alleviate the instabili-
ty of the nonlinear network representation value function, DQN mainly made three
improvements to the traditional Q-learning algorithm.
• DQN uses the experience replay mechanism during training to obtain samples.
At each time step t, the sample obtained by interfacing the agent with the en-
vironment is stored in the replay memory D. During training, a small batch of
7
The gradient of
loss function
Loss Function
Environment
s Evaluated Q- Every N steps Target Q-
Network Copy Network
Parameters
arg max Q ( s, a;q )
s'
( s, a )
r
( s, a , r , s ' ) Experience Replay
Pool
Figure 1.3: The training process of DQN. S denotes the state space, r
denotes the reward.
samples is randomly selected from D at each time step, and the network param-
eter is updated using a stochastic gradient descent algorithm. When training
deep neural networks, it is usually required that samples are independent of each
other. This random sampling method greatly reduces the correlation between
samples, thus improving the stability of the algorithm.
• DQN reduces the reward value and error value to a limited interval, which
ensures that the Q-value and the gradient value are within a reasonable range,
which improves the stability of the algorithm. Experiments show that DQN
exhibits a competitive level comparable to that of human players when solving
complex problems such as Atari 2600 games [1]. When solving various types of
visual perception-based DRL tasks, DQN uses the same set. Generally speaking,
the DQN method has stronger adaptability and versatility compared with the
traditional reinforcement learning approaches.
Value Function
sinput
Q-value
V(s)
A(a)
.HUQHO
VL]H[
Advantage Function
Blockchain technology [3, 4] uses blockchain data structures to validate and store
data, use distributed node consensus algorithms to generate and update data, use
cryptography to secure data transfers and access, and use smart contracts to pro-
gram and manipulate data. This is a new distributed infrastructure and computing
paradigm. Briefly speaking, in the blockchain system, the transaction data gener-
ated by each participant is packaged into a data block, and then the data blocks
are arranged in chronological order to form a chain of data blocks. The subject has
the same data chain and cannot be unilaterally falsified. Any modification of the
information can only be carried out with the consent of the agreed portion of the
subject, and only new information can be added, and the old information cannot be
deleted or modified, thereby realizing the inter-subject. Information sharing and con-
sistent decision-making ensure that the identity of each subject and the transaction
information between entities are not falsified, and are open and transparent. Block-
s, accounts, smart contracts, and consensus form the common model of the current
blockchain system.
• The change history of the state is recorded by the chain structure, and the
10
The blockchain is essentially a robust and secure distributed state machine. Typ-
ical technical components include consensus algorithms, P2P communication, cryp-
tography, database technology, and virtual machines. This also constitutes the five
core competencies that are essential to the blockchain:
• Common data: derived from the consensus algorithm, the various entities par-
ticipating in the blockchain automatically reach consensus through the agreed
decision-making mechanism, sharing the same credible data ledger.
• Tamper-proof and privacy protection: Secure the identity and common infor-
mation of each subject through cryptographic tools such as public key-private
keys and hash algorithms.
• Smart contract: derived from virtual machine technology, the generated digital
smart contract is written into the blockchain system to drive the execution of
the smart contract through preset trigger conditions.
With the current blockchain system, there are still obvious shortcomings in trans-
action concurrency, data storage capability, versatility, functional completeness, and
ease of use.
Data storage capability In terms of data storage capacity, as the blockchain data
only increases, the need for more data storage continues to increase over time, and
this trend grows even more when dealing with large amounts of data.
At present, the typical blockchain system recognizes the storage of reconciliation
data. The typical implementation is based on the file system or simple key-value
database storage. Without the design of distributed storage, there is also a relation-
ship between data storage capability and actual needs. Larger gaps require more
efficient ways to store big data.
1.2 Motivation
of [19] design a distributed SDN flat control plane for OpenFlow, called HyperFlow.
It localizes decision making to an individual controller, thus minimizing the control
plane response time to data plane requests. Onix architecture for distributed SD-
N is proposed by [20]. This flat control platform uses a network information base
(NIB) to aggregate and share network-wide views. Hierarchical SDN control archi-
tecture means that the network control plane is vertically partitioned into multiple
layers depending on the requested services. Kandoo [21] is designed by a hierarchi-
cal two-layer control plane, which divides the control applications into global and
local ones. Therefore, distributed SDN control plane architectures can mitigate the
issues caused by centralized SDN control architectures such as poor scalability and
performance bottlenecks.
Although distributed SDN control platforms bring many advantages comparing
with traditional centralized SDN control architectures, they still have an important
problem, which is how to achieve synergy among multiple controllers when they are
deployed into SDVs. The existing mechanisms have the following issues: 1) the
traditional consensus mechanisms bring extra overheads in each controller; 2) those
mechanisms are difficult to ensure the safety and liveness; 3) the existing schemes
have challenges when the SDN control plane is enlarged to a large-scale scenario.
In order to address the above challenges, we consider using recent advances in
blockchain technology [22]. Blockchain can be seen as a distributed third-party system
to achieve agreement among different nodes without using centralized trust manage-
ment. In the safety and data sharing aspects of connected vehicles’ communications,
distributed SDV is a partially trusted environment. Therefore, we consider using
blockchain technology in distributed SDV. Recently, blockchain technology has at-
tracted the attention of many researchers in a wide range of industries. The main
reason for this is that, with the blockchain technique, applications can be operated
16
traffic management in VANETs. The trust information used in this paper is to mit-
igate the attack. An efficient VANET framework that is proposed in [35] depicts a
trust model called the implicit web of trust in VANET (IWOT-V) to evaluate the
trustworthiness of vehicles. Hu et al. [36] introduce a reliable trust-based platoon
service recommendation scheme, which enables the user vehicle to avoid malicious
vehicles in a VANET environment. Tan et al. [37] present a fuzzy logic approach to
establish a trust management system of VANETs. The fuzzy logic approach is used
to evaluate the trust information of vehicles. A special trust management approach
for cognitive radio for vehicular ad-hoc networks (CR-VANETs) is proposed by Ying
el al. [38]. The authors define a novel trust model in order to enhance the security
of data transmission in CR-VANETs. Guleng [39] proposes a trust management ap-
proach for VANETs. The authors use a fuzzy logic-based trust calculation scheme to
evaluate the direct trust of each vehicle. Meanwhile, a reinforcement learning method
also is used in this paper to compute any vehicle’s indirect trust value. In order to
solve the problem of energy stability, Aujla et al. [40] propose an unique conceptual
solution using electric vehicles (EVs). The proposed solution deals with the problem
of managing the miscellaneous power or power deficit at the charging stations (CSs)
by utilizing EVs-as-a-service (EVaaS). In this paper, EVaaS not only provides oppor-
tunities to the owner of EVs to earn profit but also it helps to balance the demand
and supply at the CSs. Meanwhile, this approach also uses the SDN paradigm for
enabling faster communication between the entities involved.
As we can see, there is an emerging trend to use the trust information to keep
the safety of VANETs. Since most existing works couple the trust information with
system control, the trust management for VANETs is still a big issue for improving
the security of VANETs. Hence, the control plane needs to decouple from the data
plane in VANETs. SDVs aim to manage the VANET dynamically and ensure the
20
security of VANETs. However, there is a problem for current SDVs: when there are
lots of vehicles communicating with each other in a VANET environment, a massive
amount of data (i.e., trust information) will be generated. Without the extraction
of useful information, the collected data of the SDN controller holds little or no
value. Hence, the reinforcement learning approach appears as a tool to deal with
the massive data generated in a large VANET environment. Reinforcement learning
techniques provide an efficient way to analyze and extract useful information, then
make appropriate decisions.
• The traditional distributed SDN architecture ignores the security and liveliness
of the overall system. As the backbone of the SDN architecture, if a security
problem occurs in the control plane, it will bring incalculable consequences.
In an open connected vehicle environment, many cyber-attacks are more likely
to affect secure communications. Therefore, a reliable and reliable consensus
protocol is urgently needed in SDV.
These challenges in distributed SDN have stimulated the need to explore new
consensus protocols in SDV. Inspired by the successful implementation of blockchain
and security key synchronization [54] and data sharing [27] in secure transportation
systems, we believe that blockchain technology may be a potential method to solve
the challenges in distributed SDV. Blockchain is a distributed system that provides
reliable services for a group of nodes that do not fully trust each other. It can be
regarded as a third-party system, and agreements between nodes can be reached
without a central trust agent. It has the characteristics of reliability and can largely
use the system scale. In addition, given the fact that distributed SDVs operate in
24
Researchers have proposed many different schemes based on MEC to address the
data transmission of VANETs. In the MEC framework, the MEC server is the core
of the entire system, and the MEC system covering the mobile terminal is composed
of one or more MEC servers. By deploying the MEC server between the radio access
network and the core network, the MEC system will be able to provide end-users with
more efficient, lower latency computing, storage, and communication services on the
wireless network side (near end of the network), and thus can improve the quality of
service (QoS) experience for end-users.
Recently, the MEC approach has been used to improve the QoS in VANETs. Liu
et al. [55] propose a SDN-enabled network architecture assisted by MEC to provide
low-latency and high-reliability communication in VANETs. The proposed SDN-
enabled heterogeneous vehicular network assisted by MEC can provide desired data
rates and reliability in V2X communication simultaneously. Huang et al. [56] propose
an idea and control scheme for offloading vehicular communication traffic in the cel-
lular network to vehicle to vehicle (V2V) paths. Luo et al. [57] propose a multi-place
multi-factor prefetching scheme to meet the rapid topology change and unbalanced
traffic. Zhang et al. [58] propose a cloud-based MEC off-loading framework in vehic-
ular networks. In this framework, they study the effectiveness of the computation
transfer strategies with vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V)
communication modes.
Although many researchers have already done many excellent works on MEC
based VANETs, the subject still has some problems that need to be solved. MEC
25
aims to provide high performance, high bandwidth, high storage, low latency services.
To achieve these technical parameters, the first problem to be solved is how to make
the service dynamic and adaptive. Therefore, in chapter 3, we propose a framework
based on deep Q-learning and blockchain to establish a secure and self-organized
MEC ecosystem for future VANETs.
• Based on the programmable area control plane originated from SDN, we propose
an integrated framework that can enable dynamic orchestration of VANET secu-
rity for communications of connected vehicles. The distributed SDN controllers
in the area control layer have the abilities to collect vehicles’ trust informa-
tion, and each area controller can transfer its collected trust information to the
domain control layer.
27
• In order to achieve the goal of maximizing the system throughput and to ensure
the trust information is not tampered, we propose a blockchain-based consensus
protocol that interacts the domain control layer with the blockchain system.
We aim to use this consensus protocol to securely collect and synchronize the
information received from different distributed controllers. Specifically, we give
consensus steps and theoretical analysis of the system throughput.
• In order to solve the high dynamics of the BMEC-FV system and the time-
varying nature of the system state, we use the technique of a deep compres-
sion neural network to solve this joint optimization problem. Specifically, the
BMEC-FV system uses dueling deep Q-learning to obtain the optimal strate-
gy. The deep convolutional neural network in the BMEC-FV system uses the
pruning compression method to accelerate the convergence of the system.
29
• Simulation results show the effectiveness of the proposed scheme with different
parameters by comparing with other baseline methods.
• Dajun Zhang, F. Richard Yu, and Ruizhe Yang, “A Machine Learning Approach
for Software-Defined Vehicular Ad Hoc Networks with Trust Management,” in
Proc. IEEE Globecom’18, Abu Dhabi, UAE, Dec. 2018.
• Dajun Zhang, F. Richard Yu, and Ruizhe Yang and Helen Tang, “A Deep
Reinforcement Learning-based Trust Management Scheme for Software-defined
Vehicular Networks,” in Proc. ACM DIVANet’18, Montreal, Canada, Nov.
2018.
• Dajun Zhang, F. Richard Yu, Zhexiong Wei and Azzedine Boukerche, “Trust-
based Secure Routing in Software-defined Vehicular Ad-Hoc Networks”, arXiv
preprint arXiv:1611.04012, Nov, 2016.
IEEE Trans. Cognitive Comm. and Net., vol. 5, no. 4, pp. 1086-1100, Dec.
2019.
31
32
The trust evaluation is used to estimate the misbehavior of each vehicle in VANET-
s. In our model, the trust of each vehicle is derived from its neighbor’s packet forward-
ing ratio [63]. In our trust model, the trust value of all vehicles in a link is calculated
by a method of linear aggregation. Trust application including the estimation of link
quality is described in this section.
We consider that there are N = {1, 2, ..., n, ...N } vehicles in a VANET environ-
ment and these vehicles contain their trust information to establish a connection with
their neighbors. The malicious vehicles will deeply degrade the performance of the
proposed framework, so each vehicle needs to choose a reliable neighbor to establish a
high-quality communication link. Fig. 2.1 shows our proposed SD-TDQL framework.
In this figure, if vehicle N5 is a malicious node, vehicle N1 will select another vehicle
to establish the routing path.
For many proposed trust model in VANETs, direct trust and indirect trust are two
important factors. Direct trust is obtained as the first-hand trust information from
neighbor vehicles. The indirect trust such as recommend trust is the second-hand
trust information, which is obtained from a third party. However, since the indirect
trust may lead to additional communication cost for the trust exchange [63], we only
consider the direct trust value of each vehicle in order to simplify our trust model.
33
SDN controller
Flow rules
RSU
Flow rules
Flow rules
1B
Flow rules
1B
1B
1B
1B
Inter-vehicle
1B communication
Vehicle-to-roadside
1B
communication
N_1
transfer path is decided by the control packets in the path discovery process. The
forwarding ratios of the two components determine the trust level of the vehicles in
the VANET environment. Hence, the trust definition of each vehicle is shown below:
C D
Tnn0 (t) = ω1 Tnn 0 (t) + ω2 Tnn0 (t) (2.1)
C D
where Tnn 0 (t) denotes the direct trust of the control packets and Tnn0 (t) denotes the
direct trust of the data packets. The direct trust of vehicle n0 for vehicle n is repre-
sented by Tnn0 (t). ω1 and ω2 are two weighted factors (ω1 , ω2 ≥ 0, and ω1 + ω2 = 1)
C D
that determine the importance of the two components (Tnn 0 (t) and Tnn0 (t)). In our
proposed framework, we assume ω1 = ω2 = 0.5, which means that the control packets’
forwarding ratio and data packets’ forwarding ratio are all considered to determine
the trust value of each vehicle in our proposed framework.
C
The trust value of Tnn 0 (t) is determined by the control packet switching between
two neighbor nodes in a VANET communication link. The trust computation for the
control packets can be defined as:
C cC
nn0 (t)
Tnn 0 (t) =
C
t≤W (2.2)
fnn 0 (t)
C
where fnn 0 (t) represents the number of control packets that should have been sent
vehicle correctly forwards to its neighbor node n0 at time slot t, where W represents
the width of the recent time window.
Similarly, the trust computation for the data packets is:
D cD
nn0 (t)
Tnn 0 (t) =
D
t≤W (2.3)
fnn 0 (t)
35
D
where fnn 0 (t) represents the number of data packets that should have been sent from
Q-value:
Cumulative path trust value
DRL Agent(SDN-controller)
Software-defined
Network Analysis Networking Controller
Action:
Reward: The selection of next
Path trust value hop node
State:
Current vehicle position and
forwarding ratio
to minimize the risk of packet transmission failure, nodes should interact with trusted
nodes whose trust value is above the threshold.
The sender puts itself in promiscuous mode [64] after sending any data packet
to listen for retransmissions from the forwarding node. Using this method, the node
can know whether the data packets that have been sent to the neighbors are actually
forwarded. Direct trust values can also be shared between neighbors using higher
layers (such as the Reputation Exchange Protocol [65]).
37
According to [66], several modules are deployed in our proposed SD-TDQL frame-
work. As shown in Fig. 2.3, the most important modules in SD-TDQL include a vehi-
cles’ trust information module, a storage module, a transaction management module,
and a learning module. These modules are responsible for information processing in
different layer respectively. The vehicles’ trust information module is used to collect
the trust information from each vehicle in the device layer. The storage module can
be utilized to store each vehicle and physical link information in the device layer.
The storage module aims to collect the global network link and vehicle information
coming from the device layer. The learning module aims to make some decisions, i.e.,
which vehicle can be selected as the trusted neighbor.
SDN Controller
Link information Module: This module collects the link information coming from
the device layer. The module aims to collect the information of data packets coming
from the communication of connected vehicles. Specifically, when the data packets
reach from one vehicle to its neighbor, the packets are encapsulated into Packet-In
message and transferred them to the control layer, and then the controller decapsu-
lates the data packets and gets the link information.
Topology Management Module: This module is used to manage the topology in-
formation of vehicles received from the link information module. The responsible of
this module is to add, update and delete topology information.
Routing Information Module: The routing information module is used to collect
routing information that is decided by the communications of connected vehicles in
the device layer. The routing information will be encapsulated and sent to the SDN
controller through the OpenFlow based module as shown in Fig. 2.3.
Vehicles’ Trust Information Module: This module is used to collect the trust
information from each area of the device layer. The trust definition is introduced in
this section. After finishing to collect those value, the SDN controller will send those
trust information to the learning module for the further ET X delay calculation.
Storage Module: The storage module can be utilized to store each vehicle and
physical link information in the device layer. Meanwhile, it also needs to store the
deep neural network parameters from the learning module.
Learning Module: The learning module using a deep Q-learning approach aims
to optimize the ET X delay of SD-TDQL. The vehicles’ trust information module
receives the trust information from the device management module and area topology
management module. The learning module interacts with the device layer and aims
to make some decisions, i.e., which vehicle can be selected as the trusted neighbor.
39
Count
According to [68], the expected transmission count (ET X) of a link is the number
of data transmissions and retransmissions required to send a packet over the link. The
calculation method of ET X for a route uses forwarding and reverse delivery ratios.
Because of the signal attenuation and fading of wireless links, we endeavor to make the
forwarding ratio more reliable (the trust of each vehicle) and to minimize the packet
retransmissions or losses. In SD-TDQL, the ET X of a route is estimated by the
interaction between the current forwarding vehicle and its neighbors. The forwarding
ratio of each vehicle, Tnn0 (t) is the trustworthiness of each vehicle at time slot t;
r
the reverse delivery ratio, Tnn 0 (t) is the probability of receiving an acknowledgement
(ACK).
For calculating the ET X between current forwarding vehicle n and its candidate
0 0
neighbor n , vehicle n calculates the trust value and sends it to vehicle n, which
r
calculates the reverse delivery ratio Tnn 0 (t). In SD-TDQL, each forwarding vehicle
r r
estimates the Tnn 0 (t) using Hello packets delivery ratios. Hence, Tnn0 (t) is defined
as [69]:
r count(t − w, t)
Tnn 0 (t) = (2.4)
w/τ
where count(t − w, t) is the number of Hello packets received by the vehicle n so far
during the time window w. w/τ is the number of Hello packets that should have been
received at that time. The reason why the vehicle n can detect the failed transmission
of data packets is that the number of lost and successfully received data packets by
vehicle n0 mainly depends on the packet sequence number of each transmitted Hello
packets.
40
1
ET XSD−T DQL = r
(2.5)
Tnn0 (t) × Tnn 0 (t)
State: The SDN controller (learning agent) interacts with the device layer to
sense the system state space that includes vehicle trust information and vehicle re-
verse delivery ratio. We collect the trust information of each vehicle and reverse
delivery ratio of each vehicle into two random variables: γn (t) and δn (t). In order to
enhance the communication link quality, the SDN controller interacts with the net-
work environment to collect the system state space S(t) at time slot t. Accordingly,
the trust feature of each vehicle and reverse delivery information are collected by the
SDN controller. Hence, the system state space is defined as the following matrix:
γ1 (t) γ2 (t) γ3 (t) · · · γn (t) · · · γN (t)
S(t) = (2.6)
δ1 (t) δ2 (t) δ3 (t) · · · δn (t) · · · δN (t)
Specifically, the trust feature and reverse delivery ratio of vehicles in SD-TDQL
always keep changing because of the channel instability. In our proposed framework,
41
0
there are M = {1, ..., M } states for every node’s trust feature. Let pmm
γn = p{γn (t +
0
1) = γnm |γn (t) = γnm }, m, m0 = 1, ..., M represent the transition probability of the
trust feature changing the state from m (current time slot is t) to m0 (next time slot
t + 1). Hence, the state transition probability of pγn is set to be:
p12 (t) p13
γ1 (t) p14 1M
γ1 (t) · · · pγ1 (t)
γ1
. .. .. .. ..
pγn (t) = .. . . . . (2.7)
p12
γN (t) p 13
γN (t) p 14
γN (t) · · · p 1M
γN (t)
Meanwhile, there are K = {1, ..., K} states for every node’s reverse delivery ratio.
Therefore, the state transition probability of pδn is set to be:
p12 (t) p13
δ1 (t) p14 1K
δ1 (t) · · · pδ1 (t)
δ1
. .. .. .. ..
pδn (t) = .. . . . . (2.8)
p12
δN (t) p 13
δN (t) p 14
δN (t) · · · p 1K
δN (t)
Action: The learning agent in the control layer needs to decide which vehicle
is a reliable node that can be selected as a trusted neighbor. Therefore, the system
action space is:
A(t) = AN N
γ (t), Aδ (t) (2.9)
where AN N
γ (t) and Aδ (t) represent:
1) AN 1 2 n N
γ (t) = {aγ (t), aγ (t), ..., aγ (t), ..., aγ (t)}, which means that whether vehicle
n is the reliable neighbor. Meanwhile, the value of anγ (t) is anγ (t) ∈ {0, 1}, where
42
at -1 at at +1
t -1 t t +1 t+2
1. Observation 1. Observation
2. Choose an vehicle 2. Choose a new vehicle
as its neighbor as next hop
3.Send the routing 3.Send the routing
information information
Figure 2.4: Time line for agent events: observation, state, action, reward,
and training the DQN.
anγ (t) = 1 denotes the vehicle n is the most trusted neighbor vehicle, otherwise it is
not the trusted neighbor.
2) AN 1 2 n N
δ (t) = {aδ (t), aδ (t), ..., aδ (t), ..., aδ (t)}, which means that whether vehicle n
receives the Hello packets. For example, if anδ (t) = 1, it means that vehicle n has
received the ACK packets. Otherwise, the ACK message has failed to transmit to
vehicle n.
Reward : In SD-TDQL, the reward is decided by the trust value of each vehi-
cle and reverse delivery ratio in the VANET environment. In the trust model, the
43
L−2
X 1
rVET X (t) = r
(2.10)
l=1
Tl,l+1 (t) × Tl,l+1 (t)
Ql−2 Pl−2
where Tl,l+1 (t) = l=1
r
γl (t)alγ (t) and Tl,l+1 (t) = l
l=1 δl (t)aδ (t).
In route V , the learning agent gets rVET X (t) (immediate reward) in time slot t. The
SDN controller aims to optimize the long term ET X delay. Hence, the cumulative
reward of ET X delay can be written as:
"t=T −1 #
long
X
rET X = max E β t rVET X (t) (2.11)
t=0
formulation.
1) In SD-TDQL, the agent chooses an action only related to the current time slot.
For example, when the agent chooses a vehicle as a trusted neighbor, its trust feature
has a possibility to change another trust value according to the transition probability
in the next time slot. However, the selected action may not be changed because of
the threshold of ξ. Hence, the traditional optimization methods that consider the
relationship between state space and action space are not suitable.
2) Considering the trust feature and reverse delivery ratio of each vehicle, the
proposed SD-TDQL is high-dimensional and high-dynamical because of the transi-
tion probabilities. The traditional methods are unable to address this optimization
problem.
3) In our proposed scheme, the agent aims to optimize the long-term ET X delay
by step-and-step control. The SDN controller collects the state from the VANET
environment at time t, and the next state at time slot t + 1 will be influenced by the
current state. Hence, the traditional methods, only considering the current state, are
not suitable for this joint optimization problem.
Therefore, we consider using a deep Q-learning approach to get the optimal ET X
delay policy in the next section.
In this section, we introduce the deep Q-learning method to find the optimal
communication link quality policy π ∗ . The experience replay and target Q-network
are designed to reduce the relevance of data.
45
2.3.1 Q-Learning
In the Q-learning model, there are usually two entities: an agent and an environ-
ment. The interaction between the two entities is as follows: under a state s(t) of
the environment, the agent takes the action a(t), gets the reward r(t) and enters the
next state s(t + 1). The Q-learning method usually has some characteristics like 1)
different actions produce different rewards; 2) the reward has retardance; 3) the re-
ward of action is based on the current state. The core of Q-learning is the Q-table, in
which the rows and columns of the Q-table represent the values of the states and the
actions, respectively. The Q-learning model enables the agent to obtain an optimal
policy π, which maps with the system state space and action space, to maximize the
long-term reward.
According to [70], state-value function V π (s) and action-value function Qπ (s, a)
are two main components of the Q-learning method. Specifically, V π (s) measures the
expected total reward that can be obtained under the current policy in the current
state s:
" ∞
#
X
V π (s) = E π λk rt+k+1 |st = s (2.12)
k=1
where rt+k+1 denotes the immediate reward at time slot t + k + 1, and λk is the
discount factor. Here, the discount factor means that the current feedback is more
important than historical feedback.
Meanwhile, the action-value function Qπ (s, a) indicates the expected total reward
obtained after the action a is performed in the current state s:
Q(s, a) ← Q(s, a) + α r + λ max Q(st+1 , at+1 ) − Q(s, a) (2.14)
at+1
where α is the learning rate, and α ∈ (0, 1]. In fact, each Q(s, a) is put into the
Q-table simultaneously.
There is a problem with Q-learning. The states in the real situation may be
infinite so that the Q-table will be infinite simultaneously. Meanwhile, Q-learning
still has some instability, and the causes are described in [48], thus we can use a deep
neural network (DNN) to represent the Q-value. The emergence of deep learning
provides a method to solve the challenges of traditional Q-learning. The core idea
of deep Q-learning is that the deep neural network enables the agent to get the low-
dimensional features from high-dimensional input data by using weights and bias
of the deep neural network. This feature of the DNN attracts many researchers to
approximate Q(s, a) by using DNN instead of the traditional Q-table. For example,
Q(s, a) can be approximated to a Q-value Q(s, a, ω), where ω represents the weights
and biases in DNN.
Meanwhile, DQL has two important improvements in order to solve the correlation
between states, which will cause the instability of Q-learning, including experience
replay and target Q-network [49]. The experience replay mechanism stores the data
in a replay memory, and each sample is a quad. During the training, the stored
samples are randomly sampled by the experience replay mechanism, which can re-
move the correlation between the samples to a certain extent, thereby facilitating the
convergence speed of DNN. Moreover, deep Q-learning makes another improvement,
which is the target Q-network. That is, at the initial time step, the parameters of the
47
evaluation network are assigned to target Q-network, and then the evaluation network
continues to be trained in order to update the parameters, while the parameters of
target Q-network are fixed. After a period of training, the agent assigns the param-
eters of the evaluation network to target Q-network. These two features enable the
training process to be more stable than the traditional Q-learning.
In DQL, the evaluation network is designed to minimize loss function L(ω), which
can be depicted as:
2
−
L(ω) = rt + γ max Q(st+1 , at+1 , ω ) − Q(st , at , ω) (2.15)
at+1
where ω − is the weights and biases in the target Q-network, and ω is the weights and
biases in the evaluated Q-network.
In this chapter, we apply deep Q-learning for the following reasons: 1) In SD-
TDQL, the state of the connected vehicle nodes dynamically changes with time.
Therefore, the traditional static-based optimization algorithm is not suitable for the
scenario proposed in this chapter. Deep Q-learning based on Markov decision process
can very effectively describe the dynamic change of the status of the connected car
nodes; 2) Because the status of the connected car nodes changes with time, the
dimension of the input data is high, and traditional optimization algorithms are
difficult to handle high Input data for the dimension. However, deep Q-learning can
automatically learn the abstract representation of large-scale input data, and use this
representation as the basis for self-motivated RL to optimize the problem-solving
strategy.
In this chapter, the SDN controller aims to get an optimal ET X delay policy for
the secure communication link from the source vehicle to the destination vehicle.
48
In general, each vehicle has some set of actions that can be chosen from. As
long
we described before, three-tuple {S(t), A(t), rET X } has already been defined for the
system’s states, actions, and rewards. For example, the deep Q-network feeds back
the optimal action arg max Qπ (S(t), A(t)) to the sender. The SDN controller will
get a state at next time slot after it executes a selected action, and the controller
can obtain the reward value according to the reward function as we described in the
previous section.
The reward value is an important factor for the proposed SD-TDQL since it is
directly related to the system state space and action space. Relatively speaking,
the lower the value of the system state space, the lower the probability that the
corresponding action is selected.
In order to remove correlations inside of the deep Q-network, each agent’s state-
action pair of each time slot is stored in the experience replay memory D. In the
Q-network, the agent randomly samples from D, and the parameter ω is updated
at every time instant. The protocol samples experiences from the memory D and
applies the following loss function to update:
Meanwhile, for the target Q-value y = rVET X (t) + γ maxA(t+1) Q(S(t + 1), A(t +
1), ω − ) , evaluation network assigns the coefficient ω to the target Q-network as the
coefficient ω − at every time instants. The -greedy mechanism is used to combine
the exploration and exploitation procedure, where is usually a small value, as the
probability of random action. We can change the value of to get different exploration
and production ratios. Algorithm 1 shows the training and updating procedure for
SD-TDQL framework.
In SD-TDQL, the optimal policy is achieved by the link quality estimation using
49
from D.
Compute the value in evaluated Q-network: ET X (t) + γ max
y = rV −
A(t+1) Q(S(t + 1), A(t + 1), w ).
Perform a gradient descent step on the loss function
L(ω) with respect to
weighted factor ω.
Every C steps copy weights into target network
periodically ω − ← ω, and update target deep
networks.
End For
End For
50
• SDN controller : SDN controller is the logical central intelligence of our pro-
posed SD-TDQL. The controller controls all of the behaviors of VANET nodes
in the entire network environment. It consists of all vehicles’ reverse delivery
information and trust values. For each node in the VANET environment, the
corresponding next-hop node is based on the selection policy that is decided
by the SDN controller. As shown in Fig. 2.1, the SDN controller controls all
the actions of underlying SDN-enabled vehicles and RSUs. Specifically, all the
actions that each SDN-enabled vehicle performs are explicitly defined by the
SDN controller, which will push down all the flow rules on how to treat the
traffic. For example, If the node trust is higher than the threshold ξ (trusted
neighbors), the selection of the controller is decided by applying an optimal
policy to the Q-value for the trusted neighbors. Finally, the agent obtains the
optimal Q-value and learns the optimal link quality policy accordingly.
• SDN-enabled vehicles: The vehicles in the device layer are controlled by the
agent of SD-TDQL. These vehicles receive control information from the agent
to perform the selected action of the SDN controller.
+LGGHQOD\HU
1HWZRUN,QSXW
sinput 6HFRQGFRQYROXWLRQOD\HUPD[
SRROLQJ [NHUQHO
2XWSXW
+LGGHQ
OD\HU
)LUVWFRQYROXWLRQOD\HUPD[
.HUQHO SRROLQJ [NHUQHO
VL]H[
Figure 2.5: The network architecture of CNN for SD-TDQL training.
first calculates the control packet trust value and reverse delivery ratio, and then the
information will be transferred to the SDN controller through the flow rules (Packet-
In messages). The SDN controller stores those values in the storage module. After
finishing the path discovery process, the agent aims to decide the best communication
link to establish a secure data transfer path. In this phase, each vehicle calculates the
trust values of data packets and reverse delivery ratio. Specifically, each node’s trust
value and reverse delivery ratio have the possibility of changing according to pγn and
pδn in the data forwarding process, so the optimal link selected by the SDN controller
may convert to the unreliable communication link. Therefore, a new optimal policy
is needed to decide by the agent according to the new system state space and action
space in the VANET environment.
DQN architecture: The DQN architecture using the convolutional neural net-
work (CNN) is established as shown in Fig. 2.5, where the input image is a matrix
of system state space and the final network output is the Q(sS(t), A(t), ω). Sinput
feeds to the first convolution layer that convolves the input image with 32 filters,
kernel size 2 × 2, stride size 1 and the same padding method. The output of the
first convolution layer is convolved by the second convolution layer with 48 filters, the
52
same stride size, and the padding method. The rectifier nonlinearity activation func-
tion (ReLU) is used as the activation function for two convolution layers. Moreover,
the two max-pooling layers are connected to the first and second convolution layers.
The fully connected layers that are used as two neural networks of all 512 units are
designed to connect with the second max-pooling layer. Finally, the output layer of
the proposed DQN outputs a final Q-value.
DQN training : The training algorithm is illustrated in Algorithm 1, and the
detailed process of experience replay is shown in Fig. 2.7. In each time step t,
the agent observes state-action pair Et = {S(t), A(t), rVET X (t), S(t + 1)} from the
VANET environment, and stores Et into the replay memory D = {E1 , E2 , E3 , ..., Et }.
The learning agent randomly samples from D to form the input image, which is
trained by the proposed DQN architecture. Specifically, the experience replay mem-
ory always keeps updating because the new state-action pair coming from the en-
vironment will be stored in the memory. Hence, the oldest state-action pair will
be discarded because of the finite capacity of the replay memory. The SDN con-
troller aims to obtain the optimal Q-value Qπ∗ (S(t), A(t)) by training data includ-
ing the input image Sinput . Simultaneously, the target Q-value is defined as the
y = rVET X (t) + γ maxA(t+1) Q(S(t + 1), A(t + 1), ω − ). After the agent finishes the
training, it will get an optimal estimated Q-value and the best communication link
quality policy.
Consequently, our proposed scheme gets the best ET X delay policy of the com-
munication link for the connected vehicles’ communication. The simulation results
show the proposed framework performs better than the existing approach, and they
are discussed in the following section.
53
Train_Qnet
Loss_func Eval_net
Loss_f
gradients Beta1_power
eval_net
Adam
reshape Beta2_power
Eval_net
Beta2_power Adam
Eval_net Adam
3 tensors
32 tensors
?=55
Loss_func
?=55 Eval_net Target_net
Fully_connL2 6 tensors Fully_connL2
?=55
?=512
Loss_f ?=3*3*64
?=6*6*32
Conv_l1 10 tensors
Conv_l1
?=12*12*1 ?=12*12*1
?=12*12*1 ?=12*12*1
Reshape_1
Reshape
shape
shape
?=144
?=144
Input_eval Input_eval_
([SHULHQFHUHSOD\PHPRU\'
s1 a1 rV (1) s2 7DUJHW
0LQL%DWFK 1HWZRUN max Q( st +1, at +1,w - )
st +1
s2 a2 s3
Ă
rV (2)
Ă
st at rV (t ) st +1 st (YDOXDWLRQ Q( st , at ,w)
1HWZRUN
&11
Ă
-
8SGDWHGw
5DQGRP6DPSOH
In our simulation, we use a computer, which has one Nvidia GPU with version
GTX 1080ti. The CPU is Intel(R) Core(TM) i7-6800K with 32GB memory. The
simulation tool is TensorFlow 1.13.0 with Python 3.7.2 on Ubuntu 18.04 LTS and
MATLAB R2017b on Windows 10 64-bit operating system with x64-based processor.
The proposed SD-TDQL framework is compared with an existing scheme, which
mainly focuses on trust-based software-defined networking for VANETs. Specifically,
the concept of software-defined networking is considered in this scheme, and the
scheme cannot deal with a situation in which any trusted vehicle in the device layer
converts to the untrusted node. In other words, if any vehicle in the network becomes
an untrusted node, the controller must restart the routing process.
In our simulations, we deploy a different number of SDN-enabled vehicles that are
randomly distributed in a pre-determined VANET environment. We assume that the
vehicles’ states can be trusted (Tnn0 (t) ≥ ξ) or malicious (Tnn0 (t)<ξ). Trusted vehicles
can be selected as the trusted neighbor node in order to establish the best ET X
delay communication link from the source to the destination. Conversely, malicious
vehicles are treated as untrusted vehicles, which will badly interfere with the network
55
Parameter Value
Batch size 32
Experience replay buffer size 2000
learning rate 0.00001
Discount factor 0.9
Data rates 1 Mbps, 2 Mbps, 5.5 Mbps and 11 Mbps
Total training steps 40000
Number of nodes 8, 12, 20, 24, 28 and 32
Fig. 2.8 shows the relationship between the training episodes and loss function
under different schemes. In our simulation, the loss function L(ω) is used to estimate
the degree of inconsistency between the estimated Q-value of the SD-TDQL and the
56
0.01
SD-TDRL framework with CNN
0.009 SD-TDRL framework with 6 hidden layers
SD-TDRL framework with 8 hidden layers
0.008 SD-TDRL framework with 10 hidden layers
0.007
0.006
Loss
0.005
0.004
0.003
0.002
0.001
0
0 500 1000 1500 2000 2500 3000
Episodes
target Q-value. As shown in Fig. 2.8, we can conclude that the traditional neural
network with 6 hidden layers has the highest loss compared with the other three
network structures. Meanwhile, it does not converge in the pre-determined episodes.
It means that the learning agent does not get the optimal communication link policy
in the pre-determined episodes. As the hidden layers increase to the 8 and 10 hidden
layers, the learning agent successfully gets the optimal policy because the loss function
L(ω) converges in the pre-determined episodes. From this figure, we also can see
that the best performance is obtained from the DQN architecture using CNN. The
degree of convergence using CNN is the best compared with the other three network
structures. Meanwhile, the inconsistency of the loss function using CNN is lower than
the other three network structures, and it reflects that the SD-TDQL has the best
robustness.
Fig. 2.9 and Fig. 2.10 show the convergence comparison using different DQN archi-
tecture with distinct learning rates. The learning rate is an important hyperparameter
57
×10 -3
2.5
1.5
Loss
0.5
0
0 500 1000 1500 2000 2500
Episodes
that controls the speed at which we adjust the weight of neural networks based on
the loss gradient. The learning rate directly affects how fast the model can converge
to a local minimum (i.e., to achieve the best accuracy). In general, the greater the
learning rate, the faster the neural network learns the optimal policy. If the learning
rate is too small, the network is likely to fall into the local optimum. However, if the
learning rate is too large and exceeds the extreme value, the loss will stop falling and
it will repeatedly oscillate at a certain position. As shown in these two figures, when
the learning rate is equal to 10−6 , the training curve does not converge within the
given episodes and accompanies by a high degree of oscillation. Through this figure,
When the learning rate is equal to 0.00001, SD-TDQL has the best convergence per-
formance. Meanwhile, we can see that the loss of SD-TDQL using CNN is much less
than the DQN architecture using the normal neural network. Therefore, we choose
CNN as our deep Q-network to train the SDN controller.
Fig. 2.11 shows the comparison of ET X delay of four schemes with different
58
3
Loss
0
0 1 2 3
Training Steps ×104
network architectures. From the figure, we can see that the ET X delay when using
the CNN architecture has the best performance. This is because the CNN architecture
has better training efficiency compared with other schemes. Meanwhile, the blue line
is more stable than the other three schemes. Moreover, we can see that the agent is
in the stage of continuous learning at the beginning of the training process, so the
four training curves continue to rise to certain episodes. After increasing the training
episodes, the reward value becomes stable, which means that the SDN controller has
learned an optimal policy for ET X delay.
Fig. 2.12 shows the convergence comparison of the ET X delay using CNN ar-
chitecture with different learning rates. From this figure, when learning rates are
0.00001 and 0.000001, the training curves are highly scaled, which leads it to miss the
global optimum. Comparing the two curves of yellow and purple, although the yellow
curve has a faster convergence speed, it is not very stable after convergence compared
with the purple curve. Therefore, we choose the learning rate as 1e−2 because of its
59
2.5
2
ETX Reward
1.5
0.5
0
0 500 1000 1500 2000 2500 3000 3500 4000
Episodes
3
Learning rate = 0.000005
Learning rate = 0.00005
2.5 Learning rate = 0.0001
Learning rate = 0.001
2
ETX Reward
1.5
0.5
0
0 500 1000 1500 2000 2500 3000 3500 4000
Epidoes
vehicles grows, we can see that the ET X delay of all schemes all increases. However,
the proposed schemes are still better than the existing scheme because of the trust
feature of each vehicle. This feature enables the link quality of each proposed scheme
to achieve a higher level than the existing scheme.
61
3.4
3.2
2.8
2.6
ETX Delay
2.4
2.2
1.4
1 2 5.5 11
Data Rates
3.2
SD-TDRL framework with CNN
3 SD-TDRL framework with 10 hidden layers
SD-TDRL framework with 8 hidden layers
2.8 SD-TDRL framework with 6 hidden layers
Existing scheme
2.6
2.4
ETX Delay
2.2
1.8
1.6
1.4
1.2
8 12 20 24 28 32
The Number of Vehicles
Blockchain-based Distributed
Software-defined Vehicular Networks: A
Dueling Deep Q-Learning Approach
62
63
In this section, we describe the architecture of block-SDV and the manner of mul-
tiple controllers interconnecting with each other using blockchain. We first introduce
the network model, followed by the detailed consensus steps along with the theoretical
analysis.
trollers
Blockchain System
Block Block Block Block
Virtualized
Computing Server
Domain
Domain Domain
Controller 2
Controller 1 Controller 3
Domain Control Layer
a reliable manner. The consensus nodes can be described as C = {1, 2, ..., c, ..., C}
and C ∈ N . Specifically, each controller has multiple accounts that register on the
blockchain. Each account of blockchain sends transactions to the blockchain system
and is equipped with a public and private key to interact with the blockchain. Every
domain controller needs to make a smart contract, i.e., they desire to share their
neural network parameters with each other, and sign with their private key in order
to guarantee the effectiveness of the smart contract. A smart contract is a comput-
er protocol that communicates, validates, or enforces a contract in an informational
manner. Smart contracts allow for trusted transactions without third parties, which
are traceable and irreversible. Then the signed smart contracts from different domain
controllers are transferred to the blockchain as different transactions. The verification
66
node is used to verify the data during transmission. These nodes determine the cor-
rectness of the data transmission by comparing the hash values of the blocks. If the
hash values match, the verification passes. If the hash value is inconsistent, the data
changes during the transmission and the verification fails. The validating nodes will
validate those smart contracts and try to achieve consensus with each other. If the
consensus is achieved successfully, the event in each smart contract will be executed.
Hence, several cryptographic messages are needed to sign blocks, such as signatures
and message authentication codes (MACs). According to [73], the message can be
depicted as follows:
• hkiχ 0
means that the message k is sent from node c to node c0 containing a
cc
• hkiχc means that the message k is signed with a public key from a node c.
• hkiχ~c means that the message k is authenticated by a vector of MAC with sender
c.
In the proposed block-SDV, the local events and local OpenFlow commands are
collected as T ransaction #1, T ransaction #2,..., T ransaction #D in each domain
controller.
67
aim to select appropriate consensus nodes according to the trust features of all nodes
in blockchain. The consensus nodes are chosen by preference of votes cast by the
domain controllers.
2. All controllers send request messages to all the nodes.
A controller in the domain layer sends its transactions (request message)
D E
htransactions, miχm , m to all the consensus nodes. This message contains the
χm
~
controller ID m and total number of transactions D. It is signed with m’s private key,
and then authenticated with a MAC authenticator for all nodes in the blockchain.
Once receiving the message, a consensus node first verifies the MAC. If the MAC is
valid, then the signature of the message is verified by this consensus node. If the
MAC is invalid, further request messages will not be processed.
3. All the nodes propagate the request message to all other replicas. In
this phase, each node propagates the receiving request message to all other consensus
nodes once the request has been verified. This process ensures that every correct
node will eventually receive a request as long as at least one correct node receives
the request before. For example, once a node c receives the P ROP AGAT E message
D E
P ROP AGAT E, htransactions, miχm , c from node c0 , node c first verifies the
χ ~0
c
MAC authenticator. If the MAC is valid, then c verifies the signature of the receiving
transactions. If the signature is valid, node c sends the P ROP AGAT E message to
all other nodes. Each replica generates (C − 1) MACs and (C − 1) signatures, and
each replica receives f + 1 propagate messages from other replicas.
4. Replicas of the primary node execute three phase actions to
process the receiving request. As shown in Fig. 3.3, when the (C −
1) replicas receive requests, the primary sends P RE − P REP ARE messages
hP RE − P REP ARE, P r, m, H(m)iχ ~ authenticated by MAC for replicas, where
Pr
P r is the primary’s ID and H(m) is the hashed results of issued block. The primary
69
BC node 1
BC node 2
BC node 3
...
BC node C
1 2 3 4 5 6
validity of the incoming COM M IT message hCOM M IT, P r, m, H(m), riχ~r . After
receiving incoming 2f + 1 matching COM M IT messages from distinct replicas, c0
will verify the smart contract. If valid, it will append this block to the blockchain.
5. The nodes execute the request and send reply message to all con-
trollers in the domain layer. After the request message operation is executed,
each node sends reply message hREP LY, bl, ciχc,m to a controller in the domain layer,
where bl is an ID of a valid block and c is a node’s ID. After receiving the reply
messages from the blockchain system, each controller needs to verify f + 1 valid reply
message and D transactions in the reply message in order to confirm that the smart
contract is executed successfully in each replica. If D transactions matching with the
f + 1 reply messages, the controller accepts this block and gets the information inside
the block to update its network view.
In block-SDV, we made some improvements to the traditional DPoS-BFT con-
sensus algorithm [77]. In the traditional DPoS algorithm, anyone can be chosen to
participate in block production, as long as they can convince the holder (domain con-
troller account) to vote for it, they will have the opportunity to participate in block
production. Since the proposed block-SDV is a permissioned blockchain system, the
traditional DPoS-BFT consensus mechanism is not suitable for block-SDV. Therefore,
each domain controller selects the nodes participating in the consensus according to
71
the trust degree of each blockchain node. Since the malicious nodes in blockchain
degrade the overall system performance, the consensus mechanism in block-SDV en-
sures that each consensus node does not do evil in the process of participating in the
consensus. In block-SDV, if the master primary node of the system is malicious or
the performance is degraded, each domain controller selects a new master primary
node according to the trust degree of the nodes in each protocol instance.
In this section, we introduce a VANET trust model in the device layer. The
area controller is used to collect the vehicles’ data coming from the device layer, and
transfer the data to the domain control layer.
No matter what kind of trust model is used, direct trust and indirect trust are
available. Direct trust value in the VANET environment means the first-hand rout-
ing information of neighbor nodes and easy to be obtained. Indirect trust value is
second-hand routing information about nodes, such as recommend trust information
from a third party. In order to simplify our trust model, we only use the history of
direct interactions among vehicles to compute trust. Specifically, direct trust assess-
ment is evaluated by the transmitted packets among the adjacent vehicles in their
communication range, irrelevant the destination vehicle.
We assume that the distance between the source vehicle and destination vehicle is
more than one hop, and the transmitted packets will be dropped randomly because of
some unexpected causes, such as black-hole attack and changeable channel environ-
ment. In the device layer of proposed block-SDV, trust evaluation is decided by the
72
Neighbor’s ID
Vehicle’s Postion
Velocity
Heading
Available Throughput (AT)
Trust value (Tv 0 (t))
b vb
fvC v0 (t)
b b
Tvb v0 (t) = t≤W (3.1)
b fvb v0 (t)
b
0
where Tvb v0 (t) denotes the direct trust for vehicle vb towards its neighborhood vb .
b
0
fvb v0 (t) denotes the total number of packets forwarded from vb towards node vb , and
b
fvC v0 (t) denotes the number of packets that correctly forwarded in the time period t,
b b
where W represents the width of the recent time window. Here, the packet sequence
number is used to determine the number of lost and correct forwarding packets. Ac-
cording to the sequence number of each packet, a packet is marked as either correctly
received or lost in the recent time window W . This is done by comparing the new-
ly received packet sequence number with the last received packet sequence number
in the neighbors’ routing table. Therefore, each node enables to detect the correct
forwarding.
0
After each interaction, vehicle vb checks whether vehicle vb forwards packets cor-
rectly at time slot t. If so, the trust value Tvb v0 (t) increases. Otherwise, the trust value
b
Tvb v0 (t) decreases. In the trust model of block-SDV in device layer, the trust value
b
range of each vehicle is limited from 0 to 1 (i.e., 0 ≤ Tvb v0 (t) ≤ 1). The trust value of
b
0 means complete distrust whereas the trust value of 1 implies absolute trust. If there
is no interaction between two vehicles in each area, the initial trust value is set to 0.6
(less trustworthiness vehicle). Here, we introduce a threshold θ to depict the trust
threshold, which is used to distinguish the malicious vehicles. In other words, if the
trust value of any vehicle is less than threshold θ, it can be regarded as a malicious
node. In order to minimize the possibility of transmission failure, each vehicle in
74
an area needs to establish the communication with its most trusted neighbor vehicle
whose the trust value is higher than the trust requirements as shown in Table 3.2 [78].
T vb v 0 · f vb v 0
b b
ATvb v0 = 0 0 (3.2)
b v b vb v b vb
ttrans + tpro
where Tvb v0 · fvb v0 denotes total number of packets successfully transmitted by vehicle
b b
0
vb v b
vb , and ttrans represents the total time of vehicle vb spent in transmitting packets and
0
v b vb
tpro represents propagation time of transmitting those packets. According to [69],
0
v v
b b
ttrans includes two different components: successful packet transmission time and
unsuccessful packet transmission time:
Us Uf
0
v b vb
X X
ttrans = tsz + tfz (3.3)
z=1 z=1
0
Us
X
tsz = CW Us0 + Tf + (CW u + Ts ) (3.4)
u=1
where CW u denotes the average contention window size for the packet transmission
attempts, and CW Us0 denotes the average contention window size for the last success-
ful attempt of transmitting packet z. According to [79], Ts represents the successful
transmission time for an access category, and it can be defined as follows:
where ν denotes the propagation delay, THeader and Tpacket represent the time cost
for packet header and data information. TSIF S and TACK are the time cost for short
inter-frame space and ACK acknowledgment. TAIF SACu = AIF SACu × Tslot + TSIF S
denotes the time cost of arbitration interframe space on access category (ACu ).
Similarly, the unsuccessful packet transmission time, which is dropped after a
0
maximum number of retransmissions (Uf ), is:
0
Uf
X
tfz = (CW u + Tf ) (3.6)
u=1
where Tf denotes the unsuccessful transmission time for an access category, and it
76
where CW max and CW min are the maximum and minimum contention window size
defined based on the IEEE 802.11p access category.
0
b b v v
Simultaneously, we need to define the tpro as:
v v
b b
0 d vb v 0
b
tpro = (3.9)
λ
where λ is propagation speed, which depends on the physical medium of the link, and
0
dvb v0 represents the distance between two neighbor vehicles vb (xvb , yvb ) and vb (xv0 , yv0 ),
b b b
q
d vb v 0 = (xv0 − xvb )2 + (yv0 − yvb )2 (3.10)
b b b
0
Therefore, the available throughput ATvb v0 between two vehicles vb and vb can be
b
System
As described in Section 3.1.2, there are six steps when blockchain nodes try to
achieve consensus with each other. In each step, the cost can be divided into two
parts: the cost of the primary node and the cost of each replica, which can be defined
as follows.
1) Theoretical analysis for all the controllers sending request messages
to all the nodes: We assume that there are D transactions transferred to the
blockchain system from domain control layer, and a fraction g of transactions sent by
the controller are correct [80]. We consider that each consensus node generating one
MAC, verifying one MAC and one signature needs α, α, and β cycles, respectively.
Hence, the cost of a primary node in this step can be shown as:
D
Cpreq = (α + β) + (α + β) (3.11)
g
D
Cppro = ( + C − 1)(α + β) (3.12)
g
D
Crpro = ( + 1)(α + β) (3.13)
g
78
D
Crpre = α + (α + β) (3.15)
g
4) Theoretical analysis for the prepare procedure: In this phase, the prima-
ry node needs to verify 2f matching P ARP ARE messages, and each replica generates
C − 1 MACs for all other consensus nodes as well as verifies 2f MACs. Thus, the
cost of the primary node is:
Cppar = 2f α (3.16)
Crpar = (C − 1 + 2f )α (3.17)
5) Theoretical analysis for the commit procedure: In this phase, the primary
node generates C − 1 MACs and verifies 2f + 1 MACs. Each replica also needs to
generate C − 1 MACs for all other nodes and to verify 2f + 1 MACs. Therefore, the
79
Cpcom = (C + 2f )α (3.18)
MD
Cprep = α (3.19)
g
For the replicas, the cost for each transaction can be calculated as:
most:
kη kη
BTblockchain = min , tx/s (3.22)
Cp Cr
Here, the BTblockchain is used to estimate the number of transactions that the
blockchain system can process per second.
The learning agent (each controller in the domain control layer) needs to know
the state space S(t) at time slot t. From the above description, the state space
S(t) includes the trust feature of vehicles Tvb v0 (t), trust feature of each node in the
b
blockchain φC
p (t), and computing capability µE (t). Here, since there are no centralized
security service, all nodes and controllers have distinct trust features. Moreover, γN (t)
denotes that which node is the consensus node, i.e., if γn (t) = 1, it represents that
node n can be selected as a consensus node. Hence, the system state space jointly
considers the trust feature of each node in the blockchain, the trust feature of each
vehicle in device layer, computing capabilities of each edge computing servers, and
the number of consensus nodes in the blockchain. The state transition diagram is
81
0.1
0.2
0.2 0.15
0.5
0.25
Un-trusted
0.25
Figure 3.4: An example of the state transition diagram for the trust fea-
tures of each blockchain node.
n o
S(t) = Tvb v0 (t), φcp (t), γN (t), µE (t) (3.23)
b
Here, the trust feature of each vehicle in each area of the device layer has al-
ready been described in Section 3.2.1. Since the communication channel is always
changing, we hardly know the trust feature for each vehicle in an area of the device
layer in the next time slot t. Hence, the state transition probability for Tvb v0 (t) is
b
ζT 0 (t)T 0 (t+1) , and the state transition probability matrix can be represented as
vb v vb v
b b
Υ = [ζT 0 (t)T 0 (t+1) ]J×J . We also consider the trust feature of blockchain nodes be-
vb v vb v
b b
cause the learning agent needs to know the trust feature of the nodes in the blockchain
system so as to select the primary node in a consensus period. Thus, the trust fea-
ture of a node c ∈ {1, 2, ..., C} can be modeled as random variable φcp (t) at time slot
t. Since the trust feature of each consensus node is always changing, the transition
82
probability of φcp (t) is Φφcp (t)φcp (t+1) , and the state transition probability matrix can
be defined as Ψ = [Φφcp (t)φcp (t+1) ]H×H . Meanwhile, each domain controller enables to
adjust the number of consensus nodes dynamically and can be modeled as random
variable γn (t) at time slot t. Because of the varying trust feature of each consensus
node, we do not know the number of consensus nodes at next time slot t. Hence, the
transition probability for γn (t) is Γγn (t)γn (t+1) , and the transition probability matrix
can be defined as Ω = [Γγn (t)γn (t+1) ]Y ×Y . Finally, we use virtual computing servers to
do computing tasks in the block-SDV. There are many different computing tasks in
the blockchain system, such as verifying signatures, generating MACs, and verifying
MACs. Let Ho = {do , qo } denote a computing task related to message o, where do
means the size of message o, and qo means the required number of CPU cycles to
complete this task. We do not know the computational resources for the block-SDV
at next time slot. Therefore, we model the computation resources of edge computing
server e ∈ {1, 2, ..., e, ...E} as a random variable µe (t) at time slot t. There are T
time slots, which starts when the domain controller issues un-validated block and
terminates when the controller is replied a validated block.
The learning agent in the domain control layer needs to decide which one is a
primary node in the blockchain system, which edge computing server is selected for
computing resources, how many consensus nodes and select the trusted neighbor
vehicles. Therefore, the system action vector is:
where AV (t), Ap (t), AC (t) and AE (t) are described in the following.
1) AV (t) = {a1 (t), a2 (t), ..., av (t), ..., aV (t)} represents the action vector of vehicles
83
in each area of device layer. The value of aV (t) is {0, 1}. For example, if a1 (t) = 0 at
time slot t, it means that vehicle 1 is not to be selected as trusted neighbor vehicle. If
a1 (t) = 1, it means that vehicle 1 can be selected by learning agent as trusted neighbor
vehicles. Moreover, each vehicle aims to choose a most trusted node (highest trust
value) as its neighbor for communication. Therefore, let a route P , consisting L
nodes, representing as L = {1b , ..., lb , ..., Lb } and L ∈ V, where node lb denotes the lth
node in the route. The total trust for this routing path in an area b can be depicted
as Llbb=1 Tlb ,l+1b (t).
Q
2) Ap (t) = a1p (t), a2p (t), ..., acp (t), ..., aC
p (t) represents the action vector of consen-
sus nodes in the blockchain. The value of acp (t) is {0, 1}. For example, if a1p (t) = 0 at
time slot t, it is mean that node 1 is a replica. If a1p (t) = 1, it is means that node 1 can
be selected by learning agent as primary node. As we described in Section 3.1.2, the
blockchain system only has one primary node at the same time, so C c
P
c=1 ap (t) = 1.
3) AN (t) = {a1 (t), a2 (t), ..., an (t), ..., aN (t)} represents the action vector of select-
ing the number of consensus nodes. The value of an (t) is {0, 1}. For example, if
a1 (t) = 0 at time slot t, it is means that node 1 is not to be selected as consensus
node. If a1 (t) = 1, it is means that node 1 can be selected by learning agent as
consensus node.
4) AE (t) = {a1 (t), a2 (t), ..., ae (t), ..., aE (t)} represents the action vector of the edge
computing servers. There are E edge computing servers, and the set of those comput-
ing servers is represented by E = {1, 2, ...e, ..., E}. The value of ae (t) is {0, 1}. When
the learning agent selects a edge computing server to offload, the value of the action is
1, otherwise, the value of action is 0. Therefore, E
P
e=1 ae (t) = 1. The execution time
PL PC c c
PN
where fL = l=1 fl , k = c=1 ap (t)φp (t) and C = n=1 an (t)γn (t).
From the above formulation, the immediate reward of block-SDV is served as
r(t). Specifically, the learning agent senses the system state space S(t) from the
environment at time slot t, then the agent outputs a policy π that determines which
action should be executed from the system action vector A(t). The reward value will
be returned to the learning agent at time slot t. Then the system state space changes
to the next state S(t + 1), and the learning agent outputs new policy and gets a new
immediate reward r(t + 1). The learning agent aims to find an optimal policy to
maximize the long-term reward, and the cumulative reward rlong can be written as:
"t=T −1 #
X
rlong = max E γ t r(t) (3.26)
t=1
Training
Loss Function
High Low
perience Replay
In this section, we consider using dueling deep Q-learning approach with prior-
itized experience replay to address the joint optimization problem as we described
in Section 3.3. There are some reasons to use this approach: 1) The system has
high-dimensional and high-dynamical because of the system state defined in (3.23),
and it is hard to be solved using traditional optimization methods; 2) In block-SDV,
choosing which action from action space A(t) by the learning agent has no relation-
ship with what happens in the next time slot because of the transition probability
as we described in Section 3.3. Therefore, the traditional optimization method that
considers the relationship between state and action is not suitable. Firstly, we sim-
ply introduce the mechanism of dueling deep Q-learning with prioritized experience
replay, and then we present the approach used in this chapter.
86
A deep Q-learning (DQL) concept introduced by [49] aims to solve the instabili-
ty of traditional Q-network. There are two important improvements compared with
traditional reinforcement learning method: experience replay and target Q-network.
Experience replay stores trained data and then randomly samples from the pool.
Therefore, it reduces the correlation of data and improves the performance com-
pared with the previous reinforcement learning algorithms [81]. Meanwhile, target
Q-network is another improvement in the deep Q-learning. That is, it calculates a
target Q-value using a dedicated target Q-network, rather than directly using the
pre-updated Q-network. The purpose of this is to reduce the relevance of the target
calculation to the current value.
Although deep Q-learning makes big improvements comparing with traditional
reinforcement learning methods, many researchers still make great efforts for even
greater performance and higher stability. Here, we introduce a recent improvement:
dueling deep Q-learning [2]. The core idea of this approach is that it does not need
to estimate the value of taking each available action. For some states, the choice of
action makes no influence on these states themselves. Thus, the network architecture
of DDQL can be divided into two main components: value function and advantage
function. The value function is used to represent how good it is to be in a given
state, and advantage function can measure the relative importance of a certain ac-
tion compared with other actions. After value function and advantage function are
separately computed, their results are combined back to the final layer to calculate
the final Q-value. The mechanism of DDQL would lead to better policy evaluation
comparing with DQL.
87
Prioritized experience replay [82] (PER) can make traditional experience replay
more efficient and effective. The core idea of prioritized experience replay is that
the learning agent can more effectively learn from some samples than others. In
experience replay pool of deep reinforcement learning, some samples may be more
or less redundant. In other words, these samples may not be useful to the learning
agent. Therefore, the core idea of PER is not on random sampling, but on the priority
of samples in experience replay pool. So this method is more efficient in finding the
samples when the learning agent needs to learn.
PER uses temporal-difference(TD) error to evaluate the priority of the samples.
If the TD-error of a sample is larger, the priority of this sample is higher than oth-
ers, which means that this sample has a high priority to be learned by the agent.
According to [82], the TD-error ρj can be depicted as:
where Rj represents the reward value of sample j, and Q(Sj , a) denotes the Q-value
of sample j.
The greedy TD-error has some problems like the greedy policy is not very accu-
rate and easy to be overfitting. In order to solve those problems, the probability of
sampling j is:
pτj
P (j) = P (3.28)
k pτk
where pj ≥ 0 represents the priority of sample j. The exponent τ denotes how much
the prioritization is used.
After we get the priority of a sample j, we can get the importance-sampling weight
89
Train_Qnet
Loss_func Eval_net
Loss_f
gradients Beta1_power
eval_net
Adam
reshape Beta2_power
Eval_net
Beta2_power Adam
Eval_net Adam
3 tensors
32 tensors
Target_net
Loss_func
Q
Eval_net
?=55 Q
Value Advantage
Value Advantage
Loss_f IS_Weights
6 tensors Fully_connL2
Fully_connL2
Fully_connL1
9 tensors Conv_l2
Conv_l2
10 tensors
Conv_l1
Conv_l1
Reshape Reshape_1
shape
shape
?=144 ?=144
Input_eval
Input_eval_
Figure 3.6: The TensorFlow graph in TensorBoard using CNN with PER.
(IS-weight) as:
κ
1 1
wj = · (3.29)
G P (j)
Replay Approach
In block-SDV, there are a lot of states. No matter what action the learning
agent takes, it doesn’t affect the next state transition. According to [2], dueling deep
Q-network (DDQN) has better performance than nature deep Q-network. On the
other hand, according to [1], experience replay mechanism uses a uniform sampling
method. However, in block-SDV, some samples in the experience replay pool have
a larger amount of information. For example, there is only one primary node in
block-SDV, which means that only one state in ζa (t) has the highest reward. If using
uniform sampling, the learning efficiency of the agent is not very well. Therefore, we
use DDQN with PER in our proposed block-SDV. The workflow of DDRL is shown
in Fig. 3.5.
In the DDQN, the value function A(S(t), A(t)) and advantage function
V (S(t), A(t)) are aggregated as final output Q(S(t), A(t)). The dueling network ar-
chitecture is shown in Fig. 3.6. The final Q-value under policy π at time slot t can
be denoted as:
1 X
(3.31)
Aπ (S(t), A(t + 1); ω, ι)
|A|
A(t+1)
91
Python Script:
Localcontroller.py RPC 8545
Implementcontract.py
duelingDQN.py
Local domain controllers
Blockchain System
The simulation has been implemented in a GPU-based server with the processor
of Intel(R) Core(TM) i7-6600 CPU and 16GB memory. The software environment
that we use is TensorFlow 1.6.0 [71] with python 3.6 on Windows 10 64-bit operating
system. These two simulation tools have been widely used in industry and academia.
TensorFlow is an open source software library for machine learning, and it has been
widely used to deploy new machine learning algorithms and experiments. Therefore,
we can confirm that the system performance of our proposed block-SDV can estimate
and approximate the performance in realistic networks.
First, we introduce the throughput model in the area control layer, in which we
92
assume that there are six areas in the device layer. As shown in Fig. 3.1, each
area includes several SDN-enabled vehicles and one RSU. And some distributed S-
DN controllers are randomly deployed within the area control layer to manage the
corresponding area. We assume that each vehicle’s state can be trustworthy (trust
level higher than 0.6), suspect (trust level between [θ, 0.6]), and malicious (trust level
lower than θ). In our simulation, we assume that the threshold of trust level θ is 0.5.
Trustworthy vehicles are trusted nodes that aim to establish a secure routing path
from the source to the destination. The proposed block-SDV provides a method for
good entities to avoid working with suspected and malicious vehicles. Each vehicle
has a possibility to remain its state unchanged or change its state at next time slot
t. Therefore, we set the transition probability of each vehicle in each area. An ex-
ample of the transition probability matrix for each vehicle in an area can be set as
Υ = ((0.1, 0.3, 0.5, 0.1), (0.5, 0.2, 0.2, 0.1), (0.6, 0.2, 0.1, 0.1), (0.3, 0.6, 0.05, 0.05)).
In our simulation, the trust feature of each node in the blockchain can be il-
lustrated as trustworthy, suspect and un-trusted. The most trusted node can
93
comparison:
• Proposed DDQL-based scheme with PER using CNN. In this scheme, each
learning agent selects most trusted vehicles, an appropriate number of consensus
nodes, the most trusted consensus node in the blockchain system as the primary
node for the consensus of the smart contract, and the virtualized computing
server with better computing capability. The architecture of deep Q-network
using in this scheme is CNN. Comparing with traditional deep Q-network, CNN
can perform more efficient feature extraction, which improves training efficiency.
Therefore, this scheme should have the best performance.
• Proposed DDQL-based scheme with PER using 8, 9, and 10 hidden layers neural
network. This scheme also selects the most trusted controller, the most trusted
node as the primary node, and a better computing capability of the server. The
performance should be worse than the proposed scheme using CNN.
• Existing scheme 2 [45] with local computational capabilities, without the se-
lection of trusted vehicles, and with traditional view changes. Meanwhile, the
DDQL approach is not employed in this scheme. This scheme allows domain
95
×10 6
18
Exsiting scheme
16 Proposed scheme with 8 layers neural network
Proposed scheme with 9 layers neural network
14 Proposed scheme with 10 layers neural network
Proposed scheme with CNN
12
10
Loss
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Training steps ×10 4
Figure 3.8: Training curves tracking the loss of block-SDV under different
schemes.
×10 4
8
6
Throughput (tx/s)
3
Proposed scheme with CNN
2 Proposed scheme with 10 layers neural network
Proposed scheme with 9 layers neural network
Proposed scheme with 8 layers neural network
1
Exsiting scheme 1
0
0 500 1000 1500 2000 2500 3000 3500 4000
Episodes
effectiveness of deep neural networks. Moreover, from Fig. 3.8, we can see that our
proposed schemes are better than the existing scheme. This is because that as the
number of deep neural network layers increases, the neural network has a higher
degree of abstraction of features, which leads to better performance. Meanwhile,
in our simulation, the importance of the samples is different. Some samples are
important, and the priority is high. So we use prioritized experience replay method
to improve the learning efficiency.
Fig. 3.9 shows the system throughput comparison under different schemes. In
our simulation, each point is the average throughput per episode, and the agent uses
Adamoptimizer [72]. As shown in this figure, we can also conclude that the system
throughput of our proposed schemes is better than the existing scheme. There are
two reasons for the higher throughput: 1) our proposed schemes use RBFT consen-
sus mechanism. Comparing with PBFT, RBFT has added important transaction
verification links, ensuring a consensus on the order of transaction execution while
97
×10 4
6
Learning rate = 0.001
Learning rate = 0.0001
5 Learning rate = 0.00001
Learning rate = 0.000001
4
Throughput (tx/s)
0
0 500 1000 1500 2000 2500 3000 3500 4000
Episodes
ensuring consensus on the results of block verification, thus improving the consensus
efficiency and system throughput; 2) since the existing scheme does not consider the
trust feature of each vehicle, the malicious vehicles will degrade the system through-
put. Meanwhile, this figure shows the convergence performance of DDQL with PER.
At the beginning of training, DDQL with PER has some trials and errors. After
increasing the training episodes, the system throughput becomes stable, which means
that the agent has learned the optimal policies to maximize the long-term rewards.
Fig. 3.10 shows the throughput convergence performance of the proposed scheme
using DDQL-based scheme using CNN with different learning rates. Learning rate
determines the updating speed of weighted factor in deep Q-network. If the learning
rate is too big, it will cause the result to exceed the optimal value. Otherwise, it will
deeply interfere with the Q-network convergence. From this figure, when learning
rates are 0.001 and 0.0001, the training curves are highly scaled, which lead to miss
the global optimum. Compared with the two curves of yellow and purple, although the
98
×10 6
14
Learning rate = 0.000001
Learning rate = 0.00001
12
Learning rate = 0.0001
Learning rate = 0.001
10
8
Loss
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Episodes ×10 4
Figure 3.11: Training curves tracking the loss of block-SDV under different
learning rates.
yellow curve has faster convergence speed, it does not very stable after convergence
comparing with the purple curve. Therefore, we choose the learning rate as 1e−5
because of its acceptable convergence speed and better learning stability.
Fig. 3.11 shows the loss convergence performance of each proposed scheme using
the deep neural network with ten hidden layers. Through this figure, we can see that
the learning rate has effects on convergence performance. The learning rate means the
stride to optimize the loss function. The bigger learning rate denotes longer learning
steps and bigger vibration. When learning rate equal to 0.001, the loss has the biggest
vibration. As the learning rate decreasing, the vibration of training curves decrease
simultaneously. According to this figure, we choose the learning rate as 1e−5 in our
simulation because it has the lowest vibration and final converge after 2500 episodes.
After the effective training using CNN and deep neural network, we use these
results for further simulation. Fig. 3.12 shows the throughput comparison versus
the number of controllers in the domain control layer. Specifically, this figure also
99
×10 4
5
4.5
Throughput of Block-SDV
3.5
DuelingDQL-based scheme using CNN
DuelingDQL-based scheme using 10 hidden layers
DuelingDQL-based scheme using 9 hidden layers
3
DuelingDQL-based scheme using 8 hidden layers
Existing scheme 1
Existing scheme 2
2.5
2
10 12 14 16 18 20 22 24 26 28 30
The number of controllers
can simulate the throughput performance in large realistic SDV environment because
there are up to 30 domain controllers to interact with area control layer and the
blockchain system. According to this figure, we can see that as the number of domain
controllers increases, the system throughput decreases at the same time. This is
because that more domain controllers need more computation resources to verify
signatures and MACs. However, our proposed scheme, using hierarchical SDN control
platform, RBFT consensus mechanism, and DDQL-based scheme with PER, has
better performance as shown in the blue curve. Meahwhile, we can conclude that our
proposed scheme has better performance in large SDV environment.
Fig. 3.13 shows the throughput comparison versus the number of consensus nodes
in the blockchain. As shown in this figure, as the the number of consensus nodes
increases, the system throughput decreases. This is because that more signatures and
MACs are generated as the consensus nodes increase. These signatures and MACs
need more CPUs to handle, resulting in decreased system throughput. However, our
100
×10 4
5.5
Proposed DDQL-based scheme with PER using CNN
Proposed DDQL-based scheme with PER using 10 hidden layers
5 Proposed DDQL-based scheme with PER using 9 hidden layers
Proposed DDQL-based scheme with PER using 8 hidden layers
Existing scheme 1
4.5
Throughput of Block-SDV
Existing scheme 2
3.5
2.5
1.5
6 8 10 12 14 16 18 20
The number of consensus nodes
proposed scheme using CNN is still better than the existing schemes.
Fig. 3.14 shows the throughput comparison versus the batch size of each block.
From this figure, we can see that the system throughput increases as the block en-
larges to different sizes in our simulation. This is because the larger block size enables
to contain more transactions to synchronize more local network events among domain
controllers, which leads to the increase of system throughput. Meanwhile, our pro-
posed scheme using CNN with PER is still better than the existing schemes.
101
×10 4
6
5.5
5
Throughput of Block-SDV
4.5
2
1 2 3 4 5 6 7 8 9 10
Batch size of each block
Figure 3.14: The throughput comparison versus the batch size of each
block.
Chapter 4
102
103
Blockchain System in
Cloud
Block Block Block Block
Inter vehicle
communication
Vehicle-to-RSU
communication
As shown in Fig. 4.1, we assume that there are R RSUs and C block producers
in our BMEC-FV system. It should be specially pointed out that the block producers
are selected by multiple RSUs, and each RSU is connected to the MEC server. Here,
the block producer can be expressed as B = {B1 , B2 , ..., Bc , ..., BC }. In our proposed
architecture, U vehicles around multiple RSUs communicate with each other, denoted
by V = {V1 , V2 , ..., Vu , ..., VU }. In order to ensure the safety of communication between
vehicles, the trust value of each vehicle is considered to be the computing task that
needs to be calculated by each RSU. In the VANET environment, the trust calcula-
tion between vehicles needs to be offloaded to the adjacent RSU. In the BMEC-FV
architecture, we do not consider the local computation of each vehicle.
The consensus mechanism is a very important part of the blockchain. In general,
the consensus mechanism ensures that all blockchain nodes comply with the protocol
rules and ensure that all transactions are performed in a reliable manner in our
104
by the receiving node, such as ET X [84] for collecting tree protocols. Let α denote
the estimated node reliability metric, which is defined as follows according to our
previous definition:
1 1
α= = (4.1)
(1 − A)(1 − B) 1−γ
In fact, the receiving node will need to measure α on the packet window. In this
chapter, we record the moving average of α and we expect that there will be no large
fluctuations. Therefore, we use an exponentially weighted moving average (EWMA)
to track the α value, which is defined as follows:
1
αi = β + (1 − β)αi−1 (4.2)
(1 − γi )
where the subscript i reflects the window for calculating the packet loss rate γi , and
the coefficient β indicates the rate at which the weighting decreases. Now we can
define the reliability estimates of conventional VANET nodes and malicious nodes as
αAi and αABi , respectively.
1
α Ai = β + (1 − β)αAi−1 (4.3)
(1 − Ai )
1
αABi = β + (1 − β)αABi−1 (4.4)
(1 − Ai )(1 − Bi )
After defining the VANET communication model, we need to define the commu-
nication model between the vehicles and the RSUs. As stated earlier, the reliability of
each vehicle is transmitted to the RSUs. In this process, we need to consider how to
107
a constraint as follows:
U
X C
X
SV u + SBc ≤ S (4.5)
u=1 c=1
Similarly, the information transmission rate IBc ,Bc0 between RSUs can be expressed
as:
As we know, the data transfer rate of each RSU is the sum of the rate at which the
VANET node transmits to the RSU and the rate at which other RSUs transmit to the
RSU. We assume that there are u VANET nodes transmitting reliable information
to this RSU, and there are c0 RSUs transmitting information to this RSU. The data
transmission rate of this RSU should not exceed the receiving data transmission rate.
Therefore, we have a second constraint as follows:
X X
SVu H 0 log2 (1 + δ Vu ,Bc ) + SBc H 0 log2 (1 + δ Bc0 ,Bc )
u∈U c0 ∈C (4.8)
≥ IBc ,Bc00
In BMEC-FV, BCs has many different computing tasks. For example, they may
execute smart contracts, generate and verify block signatures and reach consensus
109
between blocks. When each RSU receives a message (a VANET node sends a reliable
message to the RSU), it contains some specific computing tasks. Let Co = {do , qo }
denote a computing task related to message o, where do means the size of message o,
and qo means the required number of CPU cycles to complete this task. We do not
know the computational resources for BMEC-FV at the next time slot. As mentioned
earlier, each RSU is connected to the MEC server. We assume that at time t, the
computing capacity of each MEC server is Bc (t). There are T time slots, which start
when each RSU issues un-validated block and terminates when the RSU receives a
reply of a validated block.
For simplicity, at time t, we assume that the computing capacity assigned to the
message o is customized. Since the computing capacity of each MEC server changes
over time, we need to define a transition probability matrix for each MEC server’s
computing capacity. Let gxy represent the transition probability of Bc (t) moving
from state x to next state y. Hence, the X × X transition probability matrix for gxy
is [gxy ]X×X , where gxy = P r{Bc (t + 1) = y|Bc (t) = x}.
Based on the above model, the execution time of computing task Co at RSU Bc
can be calculated as:
qo
T Bc = (4.9)
Bc (t)
The existing BFT algorithm (prime, Aardvark, Spinning) does not really perform
Byzantine fault tolerance, mainly because there is a “primary” for sorting. If the
primary becomes a malicious node, the performance of the entire system will drop
significantly and will not be discovered. RBFT proposes a new mechanism. In RBFT,
a multi-core machine parallel executes multiple PBFT protocol instances, only the
results of the primary instance will be actually executed. Each protocol instance will
be monitored for performance and will be compared with the primary instance. If the
performance of the primary node degrades significantly, this node will be considered
as a malicious node. RBFT will initiate the replacement process. According to [75],
the performance degradation of RBFT compared with presence of BFT attacks is up
to 3%, while the other protocols are: Prime (78%), Aardvark (87%), and Spinning
(99%). RBFT has been used in some real scenarios, such as Hyperchain project
[85]. It is hosted by the Linux Foundation, and develop applications with a modular
architecture. Thus, we consider that the RBFT can be used in some real scenarios
when deploying MEC servers.
As shown in Fig. 4.2, there are six steps when blockchain nodes try to achieve
consensus with each other. In each step, the cost can be divided into two parts:
the transmission latency of the primary node and the transmission latency of each
replica. We consider that each consensus node executing smart contracts, generating
one message authentication codes (MACs), verifying one MAC and one signature
needs ν, ζ, ζ, and ξ cycles, respectively.
We assume that there are D transactions transferred to the blockchain system
from each vehicle, and a fraction g of transactions sent by the RSUs are correct [80].
111
BC node 1
BC node 2
BC node 3
...
BC node C
1 2 3 4 5 6
latency from each vehicle to each replica in the BCs can be defined as:
L σ
Trel = max{ } (4.10)
Vu ∈V IVu ,Br
where IVu ,Br denotes the transmission rate from the vehicle to each replica in the BCs.
Once each replica receives the message, a consensus node first verifies the MAC.
If the MAC is valid, then the signature of the message is verified by this consensus
node. If the MAC is invalid, further request messages will not be processed. Hence,
the cost of the primary node is ( Dg + 1)(ζ + ξ). We assume that computation capacity
assigned to one reliable message is Bp . The computation latency of the primary node
can be calculated as:
L
( Dg + 1)(ζ + ξ)
Crel = (4.11)
Bp
L
and the computation latency of each replica is the same as Crel .
112
L D×σ
Tpro = max { } (4.12)
Bp ,Br ∈B IBp ,Br
L
( Dg + C+2
3
)(ζ + ξ)
Cpro = max { } (4.13)
Bp ,Br ∈B Bp ,Br
is the primary’s ID and H(m) is the hashed results of the issued block. The primary
node generates (e − 1) MACs for all other replicas. In this phase, the primary node
113
packs the transactions into blocks according to the received time sequence, and per-
forms verification, executes the smart contract, and finally writes the sequenced trans-
action information together with the verification results into the P RE − P REP ARE
message and broadcasts it to all consensus nodes. We assume that the block size is Sb
and the maximum number of transactions that can be stored in each block is Sb /σ.
The transmission latency in this phase is:
L Sb
Tpre = max { } (4.14)
Bp ,Br ∈B IBp ,Br
2C−2
The cost of the primary node is 3
ζ. After each replica receiving the block, the
Sb
computation cost of each replica is σ
(ν + ζ + ξ) + ζ. Hence, the computation latency
of each replica is:
Sb
L σ
(ν + ζ + ξ) + ζ
Cpre = max { } (4.15)
Br ∈B Br
L Sb
Tpar = max { } (4.16)
Br ,B 0 ∈B
r
IBr ,Br0
In this phase, the primary node needs to verify 2f matching P REP ARE messages,
114
and each replica generates e − 1 MACs for all other consensus nodes as well as verifies
2C−2
2f MACs. Thus, the cost of the primary node is 3
ζ. The cost of each replica can
4C−4
be denoted as 3
ζ. Hence, the computation latency in this phase is:
4C−4
L 3
ζ
Cpar = max { } (4.17)
Br ∈B Br
L Sb
Tcom = max { } (4.18)
Br ,B 0 ∈B
r
IBr ,Br0
4C−1
Therefore, the costs of the primary node and each replica are 3
. The compu-
tation latency in this phase is:
4C−1
L 3
ζ
Ccom = max { } (4.19)
Br ∈B Br
6) Theoretical analysis for the reply procedure: After the request message
operation is executed, each node sends reply message hREP LY, bl, ciχc,m to the pri-
mary node, where bl is an ID of a valid block and c is a node’s ID. After receiving
115
Sb
the reply messages, the primary node needs to verify 2f valid reply messages and σ
transactions in a block in order to confirm that the smart contract is executed suc-
Sb
cessfully in each replica. If σ
transactions match with the f + 1 reply messages, the
primary node accepts this block and gets the information inside the block to update
its network view. In this phase, the transmission latency can be defined as:
L Sb
Trep = max { } (4.20)
Br ,Bp ∈B IBr ,Bp
sb
In this phase, the primary and replicas need to generate σ
MACs for the trans-
sb
actions’ operation. Therefore, the costs of the primary node and each replica are σ
ζ.
Hence, the computation cost in this phase is:
sb
L σ
ζ
Crep = max { } (4.21)
Br ∈B Br
Here, we need to point out that these variables (Br , Bp ) belong to the set B.
Meanwhile, these variables (Bp , Bp ,Br , Br ) are all time-varying variables. As shown
in equation 4.9, these variables follow the transition probability matrix [gxy ]X×X .
In this section, we analyze the performance of the BMEC-FV system. For the
VANET MEC system, the QoS of the VANET node depends on the delay time
that the VANET node sends the requests to RSUs until it receives the blockchain
processing result. For the performance analysis of the blockchain system, we mainly
discuss three aspects: 1) the throughput of the blockchain system; 2) the blockchain
processing request delay, and 3) decentralization. In this section, we mainly analyze
these aspects.
116
For the VANET MEC system, its performance mainly depends on the three stages
of message transmission: 1) the VANET node transmits the request to the RSU MEC
server; 2) the selected block producer performs the calculation offload task (smart
contract), and 3) the task processing result is sent back to the VANET node. As
shown in Section 4.2.2, the transmission latency from the VANET node to each RSU
L
is Trel .
Based on the previous analysis, the primary node will package the offloading
request into blocks in the pre-prepare phase, and perform verification to execute the
smart contract. Therefore, the cost of the primary node handling a block request at
Sb
this stage is σ
(ν + ζ + ξ). Furthermore, the processing delay of the primary node
includes two parts: execution delay and queue delay. The execution delay can be
defined as follows:
Sb (ν + ζ + ξ)
Ted = (4.22)
Bp σ
As described before, there are L blocks in the RBFT consensus mechanism. The
queuing delay is:
1 Sb (ν + ζ + ξ)
Tqu = (L − 1) (4.23)
2 Bp σ
In this analysis, we do not consider the sending back procedure as in [86], because
the size of the output may be much smaller than the input data. Therefore, the
L
processing delay of the primary node is Tpd = Trel + Ted + Tqu .
117
2D Sb 2D C + 5 Sb Sb
Cp = ( + 3C + )ζ + ( + + )ξ + ν (4.24)
g σ g 3 σ σ
2D 2Sb 2D C + 5 Sb Sb
Cr = ( + 3C − 1 + )ζ + ( + + )ξ + ν (4.25)
g σ g 3 σ σ
hη hη
BTblockchain = min , tx/s (4.26)
Cp Cr
Here, the BTblockchain is used to estimate the number of transactions that the
blockchain system can process per second.
2) The blockchain processing request delay: In order to ensure the security of
transactions in the block, it is critical to ensure that the transaction is not arbitrarily
altered or revoked. Handling request latency is the time to ensure that transactions
in the block are not maliciously tampering or revoked. In the actual VANET envi-
ronment, if the node request is maliciously falsified or revoked, the entire processing
118
request delay will become longer, which cannot be tolerated by the low latency re-
quired in a VANET environment. Therefore, we consider the processing request delay
as an important factor in measuring the performance of the BMEC-FV system.
We assume that the timeout of transmission latency in RBFT consensus is φtl .
Hence, the transmission latency Ttl can be defined as follows:
L L L
Ttl = min{Trel , φtl } + min{Tpro , φtl } + min{Tpre , φtl }+
(4.27)
L L L
min{Tpar , φtl } + min{Tcom , φtl } + min{Trep , φtl }
Similarly, we assume that the timeout of computation delay is φcl . The computa-
tion latency Tcl can be depicted as follows:
L L L
Tcl = min{Crel , φcl } + min{Cpro , φcl } + min{Cpre , φcl }+
(4.28)
L L L
min{Cpar , φcl } + min{Ccom , φcl } + min{Crep , φcl }
The processing request delay Tprd includes two parts: the transmission latency Ttl
and computation latency Tcl in the BCs.
It is worth noting that the value range of G(t) is [0, 1]. The smaller the value of
G(t), the better the decentralization of the blockchain nodes. Conversely, it represents
the higher degree of centralization of the blockchain nodes. In particular, when the
Gini coefficient of the blockchain system is equal to zero, it represents that the number
of blocks produced by each block producer in the consensus phase is equal. In other
words, this situation represents absolute fairness in the blockchain system. If the Gini
coefficient is equal to 1, it means that only one block producer produces a block. In
other words, the blockchain system is now a centralized system, which is contrary to
the nature of the blockchain. Therefore, we need to avoid the Gini coefficient being
too large. In order to ensure the decentralization of the blockchain system, we define
the following restrictions:
G(t) ≤ κ (4.31)
Loss Function
Training
Evaluated Q-value Target Q-value
Before Pruning
g After Pruning Value Function Before Pruning
g After Pruning Value Function
Evaluated Q-value Evaluated Q-value
Return an action
Experience Replay Pool
Samples
s1, a1, r1, s2 s2,a2,r2, s3 ... sn,an,rn,snn+1+1
Let S = {s(t), t ∈ T } represent the state space of BMEC-FV. Here, s(t) denotes
the system state at time slot t. The state of the system includes the reliability of each
VANET node, the SNR between the VANET node and the RSU, the trust feature of
each block producer, SNR between the RSUs, and the computing capacity assigned to
different messages by RSUs. Hence, the system state space can be defined as follows:
1 2 u U
S(t) = {αAB i
(t), αAB i
(t), ..., αAB i
(t), ..., αAB i
(t);
δV1 (t), δV2 (t), ..., δVu (t), ..., δVU (t);
δB1 (t), δB2 (t), ..., δBc (t), ..., δBC (t); (4.32)
φB1 (t), φB2 (t), ..., φBc (t), ..., φBC (t);
B1 (t), B2 (t), ..., Bc (t), ..., BC (t)}
where δVu (t) = {δVu ,Bc (t), Bc ∈ B}, δBc (t) = {δBc ,Bc0 (t), Bc ∈ B}, and Bc (t) =
{Bc (t)}. Here, we need to point out that due to the fault tolerance of RBFT, each
block producer participating in consensus may become a Byzantine node (notation
121
h in equation 4.26), so we consider the trust feature φBC (t) of each block producer.
Considering the time correlation of the trust feature of each block producer, we apply
the Markov state transition matrix to model the trust of block producers. For any
block producer c, let jxy represents the state transition matrix of φBC (t) from state
x to the next state y. Therefore, the J × J state transition matrix of φBC (t) can be
expressed as [jxy ]J×J , where jxy = P r{φBc (t + 1) = y|φBc (t) = x}.
In our proposed BMEC-FV, the action space of the system includes the selection of
trusted VANET nodes, the allocation of subchannels, the block size, and the number
of blocks successfully generated by the blockchain nodes. Let A = {A(t), t ∈ T }
denote the action space of BMEC-FV, which can be defined as follows:
SV1 (t), SV2 (t), ..., SVu (t), ..., SVU (t); (4.33)
SB1 (t), SB2 (t), ..., SBc (t), ..., SBC (t); Sb (t); L(t)}
where AαuAB (t) represents the selection of a trusted VANET node. The reliability of
i
each VANET node should be higher than the threshold ξ, which can ensure the trusted
communication between the VANET node and the RSUs. After establishing trusted
communication links, SVu (t) and SBc (t) represent sub-channel allocations between the
trusted VANET nodes and the RSUs, respectively. In particular, equation (4.5) states
that the sub-channels to which both are allocated cannot exceed the total channel
capacity S. Sb (t) represents the block size generated at time t, and L(t) represents the
number of successfully generated blocks at time t. In particular, the replicas generate
blocks in order, and only one primary node exists while replicas spawn blocks. In
other words, for the primary node Bp (t) at time t, the number of blocks produced by
122
In this research, we need to solve the joint optimization problem of RSU MEC
and the blockchain system in the BMEC-FV system. We aim to maximize the perfor-
mance of the entire system through decision making in action space and state space.
Therefore, the reward function can be defined as:
P 1 : max Q(S, A)
A
C1 : AαuAB (t) ≥ ξ, ∀u ∈ U
i
l
C2 : Tprd ≤ Tmax , ∀l ∈ L(t)
C3 : G(t) ≤ κ (4.34)
U
X C
X
C4 : SV u + SB c ≤ S
u=1 c=1
X X
C5 : IVu ,Bc + IBc0 ,Bc ≥ IBc ,Bc00
u∈U c0 ∈C
Pt=T −1
where Q(S, A) = t=1 γ t r(t) denotes the long-term reward of the BMEC-FV sys-
tem. Here, γ t is a discount factor that approaches zero when t is large enough. The
learning agent senses the system state space S(t) from the environment at time slot
t, then the agent outputs a policy π that determines which action should be executed
from the system action vector A(t). The reward value will be returned to the learning
agent at a time slot t. Then the system state space changes to the next state S(t + 1),
and the learning agent outputs new policy and gets a new immediate reward r(t + 1).
According to (4.34), AαuAB (t) in the constraint C1 represents the reliability of the
i
VANET node. RSUs prefer to select VANET nodes above the threshold ξ for channel
l
allocation. Tprd in the constraint C2 represents the processing request delay for the
123
lth block, which should not exceed the timeout Tmax . C3 determines the degree of
decentralization of the blockchain nodes. C4 shows that the allocation of sub-channels
should not exceed the total channel capacity S. C5 illustrates the qualification of the
data transmission rate. We can see that the performance of the BMEC-FV system is
very low when C1 ∼ C5 is not satisfied. Therefore, we define the immediate reward
function as follows:
1
ΥBTblockchain + Ω , C1 ∼ C5satisf ied,
Tpd
r(t) = (4.35)
0, otherwise
where Υ and Ω are two weighted factors (Υ, Ω ∈ [0, 1]) corresponding to the BCs
and MEC servers in RSUs. And there is Υ + Ω = 1. It is worth noting that these
two weight coefficients are dynamically changing, which means that the BMEC-FV
system dynamically selects these two parts.
FV
In this section, we focus on the application of the deep compressed neural network
in our proposed BMEC-FV system and the complexity analysis of the algorithm. In
particular, we use the algorithm of Dueling Deep Q-learning (DDQL) to establish a
deep convolutional neural network for the BMEC-FV system.
124
A deep Q-learning (DQL) concept introduced by [49] aims to solve the instabil-
ity of a traditional Q-network. There are two important improvements compared
with the traditional reinforcement learning method: experience replay and target Q-
network. Experience replay stores trained data and then randomly samples from the
pool. Therefore, it reduces the correlation of data and improves the performance
compared with the previous reinforcement learning algorithms [81]. Meanwhile, tar-
get Q-network is another improvement in the deep Q-learning. That is, it calculates
a target Q-value using a dedicated target Q-network, rather than directly using the
pre-updated Q-network. The purpose of this is to reduce the relevance of the target
calculation to the current value.
Although deep Q-learning makes big improvements compared with traditional
reinforcement learning methods, many researchers still make great efforts for even
greater performance and higher stability. Here, we introduce a recent improvement:
dueling deep Q-learning [2]. The core idea of this approach is that it does not need
to estimate the value of taking each available action. For some states, the choice of
action makes no influence on these states themselves. Thus, the network architecture
of DDQL can be divided into two main components: value function and advantage
function. The value function is used to represent how good it is to be in a given
state, and the advantage function can measure the relative importance of a certain
action compared with other actions. After value function and advantage function are
separately computed, their results are combined back to the final layer to calculate
the final Q-value. The mechanism of DDQL would lead to better policy evaluation
compared with DQL.
In the DDQN, the value function A(S(t), A(t)) and advantage function
126
V (S(t), A(t)) are aggregated as final output Q(S(t), A(t)). The dueling network ar-
chitecture based on a pruning method is shown in Fig. 4.3. The final Q-value under
policy π at time slot t can be denoted as:
1 X
(4.37)
+ Aπ (S(t), A(t + 1); ω, ι)
|A|
A(t+1)
reserved convolution kernel is also associated with the number of convolution kernels
from which the previous convolutional layer was clipped, thus indirectly causing the
reduction of the layer convolution kernel in the channel dimension. In particular, Fx,y
represents a convolution kernel of the xth layer, and we characterize the convolution
P
kernel in each layer by |Fx,y | (the sum of the absolute values of the weighted values
in a convolution kernel - the L1 regular term) importance. In summary, the pruning
method we use can reduce the complexity of the algorithm. Theorem 2 is proved.
The simulation has been implemented in a GPU-based server with the processor
of Intel(R) Core(TM) i7-6600 CPU and 16GB memory. The software environment
that we use is TensorFlow 1.6.0 [71] with python 3.6 on Windows 10 64-bit operating
system. These two simulation tools have been widely used in industry and academia.
TensorFlow is an open-source software library for machine learning, and it has been
widely used to deploy new machine learning algorithms and experiments. Therefore,
we can confirm that the system performance of our proposed BMEC-FV can estimate
and approximate the performance in realistic networks.
In the BMEC-FV simulation environment, there are 4 RSUs and the number of
VANET nodes is 32. Each RSU is equipped with a MEC server. In the simulation,
131
Python Script:
LocalRSUMEC.py
RPC 8545
Implementcontract.py
duelingDQNPRUNING.
RSU with MEC servers py
Blockchain System
we consider that these 4 RSUs are all block producers. For each VANET node, the
reliability of each node changes with time. Therefore, the reliability of each node can
be trustworthy (trust level higher than 0.6), suspect (trust level between [ξ, 0.6]), and
un-reliable (trust level lower than ξ). In our simulation, we assume that the threshold
of trust level ξ is 0.5. Trustworthy vehicles are trusted nodes that aim to establish
a secure communication among multiple RSUs and VANET nodes. The proposed
BMEC-FV provides a method for good entities to avoid working with suspected and
malicious vehicles. Each vehicle has a possibility to remain its state unchanged or
change its state at next time slot t. Therefore, we set the transition probability
of each vehicle. An example of the transition probability matrix for each vehicle
in an area can be set as Λ = ((0.1, 0.3, 0.5, 0.1), (0.5, 0.2, 0.2, 0.1), (0.6, 0.2, 0.1, 0.1),
(0.3, 0.6, 0.05, 0.05)).
The computing capability of each RSU MEC server can be set as high level,
medium level, and low level. the transition probability of each server can be defined
as Q = gxy = ((0.7, 0.1, 0.2), (0.65, 0.15, 0.2), (0.35, 0.5, 0.15)).
In our simulation, the trust feature of each node in the blockchain can be il-
lustrated as trustworthy, suspect and un-trusted. The most trusted node can
be selected as the primary. The transition probability matrix of each node is
Ψ = ((0.6, 0.1, 0.2, 0.1), (0.65, 0.1, 0.15, 0.1), (0.5, 0.25, 0.15, 0.1), (0.6, 0.1, 0.1, 0.2))
For channel state, the state of each channel follows the MDP. In the simulation,
we assume four different scenarios, each with different levels of SNR values, labeled
132
We use three convolution layers, three max-pooling layers, and two fully connected
layers in an evaluation network and a target network. Specifically, the plan for pruning
in CNN is to start pruning in 2000 steps, stop at 4000 steps, and once every 100 steps.
In our simulation, the pruning method is to add a binary mask variable to each
selected layer for pruning. Specifically, the shape and the weight tensor shape of the
layer are exactly the same. The mask variable determines which weights participate
in the weight update. The mask update algorithm needs to inject a special operator
for the TensorFlow training calculation graph and sort the current layer weights by
the absolute value. For the weights whose amplitude is less than a certain threshold,
the corresponding mask value is set to 0. The back-propagation gradient also passes
through the mask variable, and the masked weight (the mask is 0) does not get
the updated value in the back-propagation step. We use this method to achieve
the pruning of the deep Q-network parameters. In our simulations, the architecture
of the evaluation network is the same as the target Q-network. However, only the
evaluation network is trained depending on the gradient descent method, and we
replace the target Q-network by trained Q-value every 5 steps (after 200 steps).
As shown in Fig. 4.4, each RSU with the MEC server will run a blockchain node
and will form the network. Meanwhile, the communication between each RSU and
blockchain system is done via RPC (remote procedure call). In our simulation, we
need to establish our smart contract logic to the blockchain system. It contains a
mapping for every account (inside the local MEC servers) to message, transactions
and tags. Specifically, implementcontract.py is a Python script for sending our con-
tract to the blockchain system; localRSUMEC.py is used to store transactions and to
send these transactions to the blockchain system; duelingDQNpruning.py aims to use
dueling DQN with pruning method to get the BMEC-FV system state space and to
learn an optimal policy in order to maximize the system performance.
134
In our simulation, there are four schemes simulated for the performance compar-
ison:
• Proposed DDQL-based scheme with pruning method using CNN. In this scheme,
each learning agent (RSU with MEC server) selects the most reliable vehicles, an
appropriate number of sub-channels allocation, the optimal block size, and the
number of consecutively produced blocks for each replica. The architecture of
deep Q-network using in this scheme is CNN with pruning method. Comparing
with traditional deep Q-network, CNN with pruning can perform more efficient
feature extraction, which improves training efficiency. Therefore, this scheme
should have the best performance.
• Existing scheme 1 [73] with local computational capabilities, without the selec-
tion of reliable vehicles, random block size, and a fixed block number. Mean-
while, the DDQL approach without the pruning method using traditional deep
neural network is also employed in this scheme. This scheme allows the con-
nected vehicles to communicate with each RSU randomly without considering
the reliable features. Compared with the DDQL-based scheme with pruning
method, the advantage of using DDQL with pruning can be shown.
135
×10 7
2.5
Proposed DDQL-based scheme with 10 hidden layers
Proposed DDQL-based scheme with CNN without pruning
Proposed DDQL-based scheme with CNN and pruning
2
1.5
Loss
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Training steps ×10 4
Figure 4.5: Training curves tracking the loss of BMEC-FV under different
schemes.
• Existing scheme 2 [89] with local computational capabilities, without the selec-
tion of reliable vehicles, random block size, and a fixed block number. Mean-
while, the DDQL approach is not employed in this scheme. This scheme uses
the random selection algorithm (RSA) [89] for resource allocation. Compared
with the DDQL-based scheme with pruning, the advantage of using DDQL with
pruning method can be shown.
4.5
Proposed DDQL-based scheme with CNN and pruning
4 Proposed DDQL-based scheme with CNN without pruning
Proposed DDQL-based scheme with 10 hidden layers
Existing scheme 1
3.5
3
Long-term Reward
2.5
1.5
0.5
0
0 500 1000 1500 2000 2500 3000 3500 4000
Training Steps
training loss decreases at the same time. Such increasing and decreasing performance
of the training loss illustrates the effectiveness of deep neural networks. Moreover,
from Fig. 4.5, we can see that our proposed schemes with pruning method are better
than the schemes without pruning. This is because the pruning method eliminates
unnecessary values in the weight tensor of the CNN network, reduces the number of
connections between the neural network layers, and reduces the parameters involved
in the calculation, thereby reducing the number of operations.
Fig. 4.6 shows the system performance comparison under different schemes. In
our simulation, each point is the average long-term reward per episode, and the
agent uses Adamoptimizer [72]. At the beginning of training, the proposed schemes
have some trials and errors. After increasing the training episodes, the system long-
term reward becomes stable, which means that the agent has learned the optimal
policies to maximize the long-term rewards. As shown in this figure, we can also
conclude that the system reward of our proposed schemes is better than the existing
137
3.5
Learning rate = 0.01
Learning rate = 0.001
3 Learning rate = 0.0001
Learning rate = 0.00001
2.5
Long-term reward
1.5
0.5
0
0 500 1000 1500 2000 2500 3000 3500 4000
Training steps
scheme 1. There are two reasons for the higher system performance: 1) our proposed
schemes use the RBFT consensus mechanism. Comparing with PBFT, RBFT has
added important transaction verification links, ensuring a consensus on the order of
transaction execution while ensuring consensus on the results of block verification,
thus improving the consensus efficiency and system throughput; 2) since the existing
scheme does not consider the reliable feature of each vehicle, the malicious vehicles
will degrade the system throughput. Meanwhile, the proposed scheme uses dynamic
adaptive subchannel allocation, which can effectively reduce the communication delay.
Moreover, the proposed scheme can adaptively adjust the block size and the number
of generated blocks, thereby improving the performance of the entire system.
Fig. 4.7 shows the reward convergence performance of the proposed DDQL-based
scheme using the pruning method with different learning rates. The learning rate
determines the updating speed of the weighted factor in deep Q-network. If the
learning rate is too big, it will cause the result to exceed the optimal value. Otherwise,
138
×10 4
4
DDQL-based scheme with CNN and pruning method
3.5 Existing scheme 1
DDQL-based scheme with CNN and without pruning
DDQL-based scheme with 10 hidden layers
Long-term reward of BMEC-FV
3 Existing scheme 2
2.5
1.5
0.5
0
100 150 200 250 300 350 400 450 500
The average transaction size (Byte)
it will deeply interfere with the Q-network convergence. From this figure, when
learning rates are 0.01 and 0.001, the training curves are highly scaled, which leads
to miss the global optimum. Compared with the two curves of yellow and purple,
although the yellow curve has faster convergence speed, it does not very stable after
convergence comparing with the purple curve. Therefore, we choose the learning rate
as 1e−4 because of its acceptable convergence speed and better learning stability.
Figure. 4.8 depicts the relationship between BMEC-FV system performance and
average transaction size. This figure can be used to show the performance of the pro-
posed method with different transaction sizes that correspond to the offloading tasks
in BMEC-FV systems. From this figure, we can see that as the average transaction
size increases, the system performance drops significantly. There are two reasons for
this: 1) When the sizes of transactions increases, a block can contain fewer trans-
actions. In other words, the number of blocks in the system increases significantly.
When the system processes more blocks, the performance of the system will decrease.
139
×104
4
Existing scheme 1
3 Existing scheme 2
2.5
1.5
0.5
2 4 6 8 10 12 14 16 18 20
The processing request delay
2) The transmission delay from VANET nodes to RSUs increases, which reduces the
performance of the MEC system. From the long-term benefits of BMEV-FV system,
our proposed scheme achieves the highest long-term reward. Existing schemes have
the worst performance due to the inability to ensure the security of the communication
link, the uncertainty of the transaction size, and the number of fixed blocks.
Fig. 4.9 shows the relationship between the performance of the BMEC-FV and the
processing request delay of the blockchain. From this figure, we can see that the long-
term reward of the system increases with the increase of the processing request delay
and eventually remains stable. We specify delay constraints for processing request
delays in the constraints of the reward function, in order to add fewer penalties to
the rewards, which naturally leads to higher performance of the BMEC-FV system.
However, when the threshold for processing the request delay is large enough, the
impact on the reward is small. Meanwhile, we can see that our proposed pruning-
based scheme is better than existing solutions in performance. The reason is that it
140
×10 4
3
Proposed DDQL-based scheme with CNN and pruning
Proposed DDQL-based scheme with CNN without pruning
Existing scheme 1
2.5 Existing scheme 2
Long-term reward of BMEC-FV
1.5
0.5
5 10 15 20 25 30
The number of vehicles
not only ensures the secure communication and channel allocation between the car
networking node and the RSU but also dynamically adjusts the block size and the
number of generated blocks, thereby adjusting the block interval to strictly define
the threshold for processing the request delay and avoiding the loss of the block. In
addition, we found that existing schemes have the worst performance, which reveals
the superiority of the DDQL-based method. The proposed DDQL-based scheme with
CNN and pruning also maintains all the best performance.
Fig. 4.10 plots the rewards to show the scalability of the BMEC-FV system and
the ability of the system to handle dynamic changes in the number of VANET nodes.
In our simulation, we first train the BMEC-FV framework with 32 VANET nodes and
4 RSUs in an off-line mode. This off-line training framework is then used to process
the VANET environment online, where the VANET node U changes from 2 to 32 and
the number of RSUs C is fixed at 4. In other words, the curve obtained by this figure
is obtained online. From this figure, we can conclude that even with a large number
141
×10 4
6
DDQL-based scheme with CNN and pruning with SNR1
DDQL-based scheme with CNN and pruning with SNR2
DDQL-based scheme with CNN and pruning with SNR3
5
DDQL-based scheme with CNN and pruning with SNR4
Long-term reward of BMEC-FV
0
0 500 1000 1500 2000 2500 3000 3500 4000
Training steps
of VANET nodes, the proposed methods can get the highest long-term reward, while
RSA can get the lowest return. The performance of the existing 1 is better than the
former but still lower than our proposed methods. This figure shows the superiority
of the proposed schemes and also shows the scalability of the proposed framework.
Meanwhile, as the number of VANET nodes increases, the reward from these schemes
is reduced. The reasons are: 1) the average resource is reduced, resulting in lower
latency, and 2) more blocks may be ignored, resulting in smaller throughput for the
entire system. Therefore, long-term rewards are reduced. Finally, this figure shows
that the off-line training method can actually be used online, so it can be applied to
actual VANET scenarios.
In Fig. 4.11, the performance of the proposed scheme is verified by comparing the
proposed scheme with the existing algorithm 1 in three cases where the SNR state
setting is different. In the simulation, once the SNR changes dynamically, we need to
train the proposed BMEC-FV architecture in order to verify the adaptability of the
142
proposed scheme to SNR dynamics. First, the scene with SN R4 gets the best return.
This is because in this case, the system can obtain a better SNR state, which leads
to greater throughput and lower latency, resulting in a greater long-term reward.
Secondly, it can be seen that when the number of SNR states increases, the proposed
scheme has different convergence conditions. The reason is that because the state
space dimension rises, the agent needs to spend a longer learning process to get the
optimal policy. Third, the convergence speed of these algorithms is different. This is
because the state, action, and number of steps in an episode are much larger than the
dimensions of the state set, which can be ignored by the law of large numbers. Fourth,
due to the superiority of DRL, the proposed scheme achieves the best performance,
which also indicates the ability of the proposed scheme to adapt to different network
dynamics. It should be pointed out that since the existing scheme 2 does not apply
the DDQL method, it does not compare its performance here.
Chapter 5
5.1 Conclusions
143
144
In summary, first, we applied the DQL method to the centralized SDV, aiming
to use the DQL method to obtain the best ETX link strategy. However, due to
the shortcomings of the centralized SDV (the specific defects have been clarified in
the first sheet), we further adapted the distributed SDV method to make up for the
shortcomings of the centralized SDV. At the same time, we introduced blockchain
technology to ensure the interaction of distributed SDV controllers. Finally, we added
MEC on the basis of distributed SDV and blockchain to complete more complex
tasks and give a new and more reliable trust calculation method through computing
offloading to improve system performance.
A number of interesting research problems arise during the course of the investi-
gations reported in this dissertation.
• In our research, we have introduced blockchain technology into the current con-
nected car system. However, due to some problems in the blockchain technology
itself, there are still many directions worth studying in the future. The first is
the storage of blockchain data. There is cloud storage in the internet era. How-
ever, where does the massive data in the blockchain era exist? At present, the
IPFS protocol and FileCoin project that solve the blockchain storage problem
have attracted much attention. The second is the transaction per second (TPS)
problem of the blockchain. We all know that the traditional centralized sys-
tem has a very high transaction processing speed, which is tens of thousands of
transactions per second, but the blockchain is different. Single-digit Bitcoin and
double-digit Ethereum. Although in the process of demonstration, consensus
security and transaction processing speed in everyone’s blockchain system are
contrary, it is necessary to sacrifice one party. The DPOS mechanism seems to
solve the problem of transaction processing speed, but in practice, whether it
can run safely for a long time and receive any from the public, it will take a
long time to verify. Consensus resolution speed of transaction processing is also
a direction, but the market still needs to verify the feasibility. Therefore, in
future research, storing and sharing the data of the Internet of Vehicles in the
blockchain system will still need to improve TPS and better blockchain data
storage technology.
List of References
[3] Z. Zheng, S. Xie, H.-N. Dai, X. Chen, and H. Wang, “Blockchain Challenges
and Opportunities: A Survey,” International Journal of Web and Grid Services,
vol. 14, no. 4, pp. 352–375, Jan. 2018.
[4] R. Yang, F. R. Yu, P. Si, Z. Yang, and Y. Zhang, “Integrated Blockchain and
Edge Computing Systems: A Survey, Some Research Issues and Challenges,”
IEEE Comm. Survey and Tutorials, vol. 21, no. 2, pp. 1508–1532, Jan. 2019.
147
148
[12] H. Shafiq, R. A. Rehman, and B.-S. Kim, “Services and Security Threats in SDN
Based VANETs: A Survey,” Wireless Comm. and Mobile Computing, vol. 124,
pp. 1–14, Apr. 2018.
[13] A. Alnasser, H. Sun, and J. Jiang, “Cyber Security Challenges and Solutions
for V2X Communications: A Survey,” Elsevier Computer Networks, vol. 151,
pp. 52–67, Mar. 2019.
[14] Z. Lu, G. Qu, and Z. Liu, “A Survey on Recent Advances in Vehicular Network
Security, Trust, and Privacy,” IEEE Trans. Intell. Transp. Sys., no. 99, pp. 1–17,
Apr. 2018.
[15] Q. Yan, F. R. Yu, Q. Gong, and J. Li, “Software-Defined Networking (SDN) and
Distributed Denial of Service (DDoS) Attacks in Cloud Computing Environ-
ments: A Survey, Some Research Issues, and Challenges,” IEEE Comm. Survey
and Tutorials, vol. 18, no. 1, pp. 602–622, Oct. 2015.
[21] S. Hassas Yeganeh and Y. Ganjali, “Kandoo: A Framework for Efficient and Scal-
able Offloading of Control Applications,” in Proc. ACM HotSDN’12, (Helsinki,
Finland), Aug. 2012.
[23] P. K. Sharma, M.-Y. Chen, and J. H. Park, “A Software Defined Fog Node
Based Distributed Blockchain Cloud Architecture for IoT,” IEEE Access, vol. 6,
pp. 115–124, Sept. 2018.
[27] M. Singh and S. Kim, “Blockchain Based Intelligent Vehicle Data Sharing Frame-
work,” arXiv preprint arXiv:1708.09721, June 2017.
[29] X. Tao, K. Ota, M. Dong, H. Qi, and K. Li, “Performance Guaranteed Com-
putation Offloading for Mobile-Edge Cloud Computing,” IEEE Wireless Comm.
Letters, vol. 6, no. 6, pp. 774–777, Dec. 2017.
Reinforcement Learning Approach,” IEEE Comm. Mag., vol. 55, no. 12, pp. 31–
37, Dec. 2017.
[37] S. Tan, X. Li, and Q. Dong, “A Trust Management System for Securing Data
Plane of Ad-Hoc Networks,” IEEE Trans. Veh. Tech., vol. 65, no. 9, pp. 7579–
7592, Sept. 2016.
[38] Y. He, F. R. Yu, Z. Wei, and V. Leung, “Trust Management for Secure Cognitive
Radio Vehicular Ad Hoc Networks,” Elsevier Ad Hoc Networks, vol. 86, pp. 154–
165, Apr. 2019.
[43] S. Chettibi and S. Chikhi, “Dynamic Fuzzy Logic and Reinforcement Learning
for Adaptive Energy Efficient Routing in Mobile Ad-Hoc Networks,” Applied Soft
Computing, vol. 38, pp. 321–328, Jan. 2016.
[45] G. M. Borkar and A. Mahajan, “A Secure and Trust Based On-Demand Mul-
tipath Routing Scheme for Self-Organized Mobile Ad-Hoc Networks,” Wireless
Networks, vol. 23, no. 8, pp. 2455–2472, May. 2017.
[46] L. Xiao, X. Lu, D. Xu, Y. Tang, L. Wang, and W. Zhuang, “UAV Relay in
VANETs Against Smart Jamming with Reinforcement Learning,” IEEE Trans.
Veh. Tech., vol. 67, no. 5, pp. 4087–4097, May. 2018.
[48] S. S. Mousavi, M. Schukat, and E. Howley, “Traffic Light Control Using Deep
Policy-Gradient and Value-Function-Based Reinforcement Learning,” IET In-
tell. Transp. Sys., vol. 11, no. 7, pp. 417–423, Sept. 2017.
[50] T. Y. He, N. Zhao, and H. Yin, “Integrated Networking, Caching and Comput-
ing for Connected Vehicles: A Deep Reinforcement Learning Approach,” IEEE
Trans. Veh. Tech., vol. 67, no. 1, pp. 44–55, Jan. 2018.
152
[53] H. Yao, C. Qiu, C. Zhao, and L. Shi, “A Multicontroller Load Balancing Ap-
proach in Software-Defined Wireless Networks,” International Journal of Dis-
tributed Sensor Net., vol. 11, no. 10, p. 454159, Oct. 2015.
[55] J. Liu, J. Wan, B. Zeng, Q. Wang, H. Song, and M. Qiu, “A Scalable and
Quick-Response Software Defined Vehicular Network Assisted by Mobile Edge
Computing,” IEEE Comm. Mag., vol. 55, no. 7, pp. 94–100, Jul. 2017.
[56] C.-M. Huang, M.-S. Chiang, D.-T. Dao, W.-L. Su, S. Xu, and H. Zhou, “V2V
Data Offloading for Cellular Network Based on the Software Defined Network (S-
DN) Inside Mobile Edge Computing (MEC) Architecture,” IEEE Access, vol. 6,
pp. 17741–17755, Mar. 2018.
[57] G. Luo, Q. Yuan, H. Zhou, N. Cheng, Z. Liu, F. Yang, and X. S. Shen, “Cooper-
ative Vehicular Content Distribution in Edge Computing Assisted 5G-VANET,”
China Communications, vol. 15, no. 7, pp. 1–17, Jul. 2018.
[66] Y. Fu, J. Bi, Z. Chen, K. Gao, B. Zhang, G. Chen, and J. Wu, “A Hybrid Hier-
archical Control Plane for Flow-Based Large-Scale Software-Defined Networks,”
IEEE Trans. Network and Service Management, vol. 12, no. 2, pp. 117–131, June.
2015.
[70] C. J. Watkins and P. Dayan, “Q-Learning,” Machine learning, vol. 8, no. 3-4,
pp. 279–292, May. 1992.
154
[74] F. R. Yu, J. Liu, Y. He, P. Si, and Y. Zhang, “Virtualization for Distributed
Ledger Technology (vDLT),” IEEE Access, vol. 6, pp. 25019–25028, Apr. 2018.
[76] M. Castro and B. Liskov, “Practical Byzantine Fault Tolerance and Proactive
Recovery,” ACM/Transactions on Computer Systems (TOCS), vol. 20, no. 4,
pp. 398–461, Nov. 2002.
[79] H. Wu, X. Wang, Q. Zhang, and X. Shen, “IEEE 802.11e Enhanced Distributed
Channel Access (EDCA) Throughput Analysis,” in Proc. IEEE ICC’06, (Istan-
bul, Turkey), June. 2006.
[83] B. Cui and S. J. Yang, “NRE: Suppress Selective Forwarding Attacks in Wireless
Sensor Networks,” in Proc. IEEE Communications and Network Security, (San
Francisco, USA), Oct. 2014.
[86] X. Chen, L. Jiao, W. Li, and X. Fu, “Efficient Multi-User Computation Offload-
ing for Mobile-Edge Cloud Computing,” IEEE/ACM Trans. Networking, vol. 24,
no. 5, pp. 2795–2808, Oct. 2015.