0% found this document useful (0 votes)
39 views10 pages

Blockchain and Federated Learning For Privacy-Preserved Data Sharing in Industrial IoT

This document discusses a blockchain-enabled architecture for secure data sharing in the Industrial Internet of Things (IIoT) that integrates federated learning to preserve privacy. It addresses the challenges of data leakage and privacy concerns by allowing data owners to share models instead of raw data, ensuring that sensitive information remains protected. The proposed method demonstrates good accuracy and efficiency while maintaining data security through a permissioned blockchain framework.

Uploaded by

xieshengyuan3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views10 pages

Blockchain and Federated Learning For Privacy-Preserved Data Sharing in Industrial IoT

This document discusses a blockchain-enabled architecture for secure data sharing in the Industrial Internet of Things (IIoT) that integrates federated learning to preserve privacy. It addresses the challenges of data leakage and privacy concerns by allowing data owners to share models instead of raw data, ensuring that sensitive information remains protected. The proposed method demonstrates good accuracy and efficiency while maintaining data security through a permissioned blockchain framework.

Uploaded by

xieshengyuan3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 16, NO.

6, JUNE 2020 4177

Blockchain and Federated Learning for


Privacy-Preserved Data Sharing in Industrial IoT
Yunlong Lu , Student Member, IEEE, Xiaohong Huang , Member, IEEE,
Yueyue Dai , Student Member, IEEE, Sabita Maharjan , Member, IEEE,
and Yan Zhang , Senior Member, IEEE

Abstract—The rapid increase in the volume of data gener- privacy [1]. Data leakage may take place during data storage,
ated from connected devices in industrial Internet of Things data transmission and data sharing, which may lead to serious
paradigm, opens up new possibilities for enhancing the issues for both owners and providers. In this regard, existing
quality of service for the emerging applications through
data sharing. However, security and privacy concerns (e.g., work mainly focuses on utilizing aggregate information about
data leakage) are major obstacles for data providers to the data, without breaking the privacy of the participants. They
share their data in wireless networks. The leakage of address the problem by making some modifications to the
private data can lead to serious issues beyond financial key contributions of original data, such as k-anonymity [2],
loss for the providers. In this article, we first design a l-diversity [3]. But most of the methods assume that the attack-
blockchain empowered secure data sharing architecture for
distributed multiple parties. Then, we formulate the data ers only have limited background knowledge, where the data
sharing problem into a machine-learning problem by incor- are still vulnerable to algorithm-based attacks or background
porating privacy-preserved federated learning. The privacy knowledge attack. Differential privacy [4] provides the most
of data is well-maintained by sharing the data model instead reliable privacy guarantee, which is generally considered strong
of revealing the actual data. Finally, we integrate feder- enough to protect data from privacy attacks. A machine learning
ated learning in the consensus process of permissioned
blockchain, so that the computing work for consensus can differentially private [5] was proposed to publish data struc-
also be used for federated training. Numerical results de- tures instead of publishing queries and answers directly, in the
rived from real-world datasets show that the proposed data constraint of differential privacy.
sharing scheme achieves good accuracy, high efficiency, Data from IIoT applications may include sensitive infor-
and enhanced security. mation. In this regard, protecting data privacy is a key issue.
Index Terms—Data sharing, federated learning, indus- In [6], the authors proposed a protection method that satisfies
trial Internet of Things (IIoT), permissioned blockchain,
privacy-preserved.
differential privacy to protect location data privacy, without
reducing much utility of data in IIoT. There are also some
I. INTRODUCTION works exploring the use of blockchain to enhance data security
HE AMOUNT of data generated by the connected de- in IIoT. In [7], the authors also integrated blockchain into
T vices in the industrial Internet of Things (IIoT) paradigm
has witnessed a massive growth in Industry 4.0. Along with
edge intelligence for resource allocation in IIoT. Though the
combination is promising, the machine-learning methods can
the value the data brings, comes serious concerns about data be further improved. Therefore, some works exploited Markov
models, which can illustrate activity transactions without having
Manuscript received May 31, 2019; revised August 13, 2019 and knowledge of the problem in hand [8], for resource allocation.
September 3, 2019; accepted September 3, 2019. Date of publication For example, in [9], the authors leveraged deep reinforcement
September 18, 2019; date of current version February 28, 2020. This learning (DRL) for task offloading and transmission scheduling.
work was supported in part by the Joint Funds of National Natural
Science Foundation of China and Xinjiang under Project U1603261, The using of new machine-learning technology also brings new
and in part by the National Natural Science Foundation of China under security threats such as cyber stealth attacks [10], which imposes
Project 61602055. Paper no. TII-19-2282. (Corresponding author: Yan new security requirements [11] for protecting data privacy in
Zhang.)
Y. Lu and X. Huang are with the Institute of Network Technology, sharing process. In [12], the authors implemented a protocol that
Beijing University of Posts and Telecommunications, Beijing 100876, turns a blockchain into an automated access-control manager,
China (e-mail:, [email protected]; [email protected]). to ensure that users can own and control their data. In [13], the
Y. Dai is with the University of Electronic Science and Technology of
China, Chengdu 611731, China (e-mail:, [email protected]). authors proposed a blockchain-enabled efficient data collection
S. Maharjan is with the Simula Metropolitan Center for Digital Engi- and secure sharing scheme combining Ethereum blockchain and
neering, Norway, and also with the University of Oslo, 1325 Oslo, Norway DRL to create a reliable and safe environment. Among these
(e-mail:, [email protected]).
Y. Zhang is with the Department of Informatics, University of Oslo, works, consensus protocols are a core technical component to
1325 Oslo, Norway (e-mail:, [email protected]). achieve consensus among all participating nodes. In proof-of-
Color versions of one or more of the figures in this article are available work (PoW) [14], the miner that solves a mathematical puzzle
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TII.2019.2942190 first wins the right to produce a block. However, heavy resource

1551-3203 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on April 16,2021 at 03:05:13 UTC from IEEE Xplore. Restrictions apply.
4178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 16, NO. 6, JUNE 2020

requirement for solving those puzzles, limit the applicability of


PoW-based consensus mechanisms.
The concept of private multiparty data sharing has drawn
much attention recently, as a promising approach to address the
issue of computing and storage resource constraints. Several
works on data sharing applications with multiple distributed
data owners, have been published recently, including the sharing
of horizontally partitioned data [15] and monitoring over dis-
tributed data streams [16]. There is a wide range of applications
on the collaborative use of data in distributed scenarios. For
instance, the authors in [17] proposed LDPGen, a multiphase
technique, to ensure local differential privacy in decentralized
social graphs. Mobile edge computing (MEC) is a powerful tech- Fig. 1. Example of distributed data sharing.
nique for such resource constrained applications on distributed
data. In [18], to efficiently and fairly allocating the resources
in industrial IoT-based applications, the authors proposed a 2) We propose a new blockchain empowered collaborative
forward central dynamic and available approach by adapting architecture to share data over distributed multiple par-
the running time of sensing and transmission processes in IoT ties to reduce the risk of data leakage, through which
devices. In [19], a method for conserving position confidentiality data owners can further control the access to shared
of roaming position-based services users was proposed based on data.
ˇMEC techniques in real-time industrial informatics. Recently, 3) We integrate differential privacy into federated learning
ˇfederated learning [20] has emerged for multiple ˇ data owners ˇ to further protect data privacy. We also evaluate the
to train a global model collaboratively without sharing their effectiveness of our proposed model with benchmark,
raw data, respecting the privacy concerns of sharing data. open real-world datasets for data categorization.
In [21], the authors proposed an algorithm for client sided The rest of this article is organized as follows. In Section
differential privacy-preserving federated optimization, to hide II, we present our system model. In Section III, we describe
clients’ contributions during the training process. Based on permissioned blockchain and federated learning for data sharing
a hierarchical architecture in which the server aggregates the in detail. In Section IV, we present security analysis for our
users’ training updates, the authors in [22] proposed a federated proposed scheme, and provide illustrative numerical results.
learning based proactive content caching scheme. Finally, Section V concludes this article.
Yet, the presence of a centralized curator in most of the
existing data sharing schemes increases the risk of data leakage,
especially in the application of distributed multiple parties. II. SYSTEM MODEL
There are mainly two obstacles: one is, there can be a high In this article, we consider a common distributed data sharing
volume of aggregated data from different parties to be processed scenario with multiple parties involved. Each participant owns
by the curator, including some unknown fresh data; The other his own data and is willing to share it. Contributors can make
is, none of these parties fully trust others (including the curator), better use of their data by combining them together to implement
thus fearing data leakage. a collaborative task. For example, traffic prediction can make
To this end, the application of collaborative data sharing over greater progress by utilizing data from multiple sensors. An
distributed multiple parties in IIoT faces several challenges. New illustration of data sharing among various devices is shown
collaborative mechanisms for distributed data sharing among in Fig. 1. Our goal is to design a secure data sharing mecha-
multiple untrusted parties, are therefore for IIoT applications. nism, which can share data among distributed multiple users
In this article, we propose a differentially private multiparty intelligently while also maintaining data privacy effectively.
data model sharing method based on permissioned blockchain. We consider N parties (data holders) and a union dataset D.
Instead of sharing raw data directly, we incorporate federated For any party Pi , it holds a local dataset Di ∈ D. Each of
learning algorithms to map raw data into corresponding data the N parties agrees on sharing its data without revealing any
models, which addresses privacy concerns in the learning phase private information. Let R = r1 , r2 , . . ., rm be the data sharing
through distributed training by local users. We also design requests with queries ri submitted by a requester, instead of
a distributed architecture for data sharing between multiple return the raw data, we provide the computed results toward
parties based on blockchain, in which blockchain enables secure these queries for sharing. Then, all the participants related to the
data retrieval and ensures accurate model training. Our main request work together with the corresponding learning algorithm
contributions in this article are as follows. to train a global model M, without leaking any private data.
1) We transform the data sharing problem into a machine Finally, the trained global data model M will be returned to the
learning problem by leveraging federated learning to data recipient. Leveraging the received model, data recipients
build data models and sharing the data models instead can get answers R(M) toward their data sharing requests
of raw data. locally.

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on April 16,2021 at 03:05:13 UTC from IEEE Xplore. Restrictions apply.
LU et al.: BLOCKCHAIN AND FEDERATED LEARNING FOR PRIVACY-PRESERVED DATA SHARING IN INDUSTRIAL IoT 4179

Fig. 2. Architecture of secure data sharing scheme.


Fig. 3. Working mechanism of proposed method.

A. Treat Model
results toward request Req. The cached results are then sent
We focus on collaborative data sharing, where K data
to the requester as a reply. Otherwise, for a new data sharing
providers (owners) and one data requester work together to
request, the multiparty data retrieval process is executed to find
accomplish a data sharing task. The data providers and data
the related parties according to the registration records. We
requester are considered as dishonest. The proposed mechanism
regard these parties as committee nodes, which are responsible
is vulnerable to three types of threats. The first is the quality of
for driving the consensus in permissioned blockchain. Then,
the provided data. Dishonest providers may provide biased and
the committee nodes train a global data model M jointly by
inaccurate results to the requester, reducing the usability of the
federated learning. Once the model is trained, the data requester
entire shared data. The second is data privacy. Providers and
r uses Req = {f1 , f2 , . . ., fx } as the input of the M and gets
receivers may try to infer the private data of others from shared
the corresponding sharing results M(Req). Data model M can
data, which may lead to unwanted sensitive data leakage from
accept any query fx in the query set Fx and provide a result
data providers. The threat of collusion also exists if a group of
M(fx ) for the query. In addition, as a machine learning model,
participants try to infer the data of other participants. The third
M can also make predictions on the fresh query that f y ∈ / Fx .
is data authority management. Once the raw data is shared, the
data owner will lose control over these data and the data may be
shared to other unauthorized entities by a dishonest participant. III. BLOCKCHAIN AND FEDERATED LEARNING
FOR SECURE DATA SHARING

B. Our Proposed Architecture In this article, we consider the problem of privacy-preserving


data sharing in decentralized multiple parties. Due to the con-
Our proposed data sharing architecture is shown in Fig. 2. strained resources of edge users and their privacy concerns,
The proposed system consists of two modules: permissioned we share the federated data model learned over decentralized
blockchain module and federated learning module. Permis- multiple parties instead of the original data. The data model
sioned blockchain establishes secure connections among all contains valid information toward the requests and minimized
the end IoT devices through its encrypted records, which private data of participants.
is maintained by the entities equipped with computing and
storage resources, named super nodes, such as base stations
A. Normalized Weighted Graph
and road side unites. There are two types of transactions in
our permissioned blockchain: retrieval transactions and data It is challenging for the IIoT end devices to output and
sharing transactions. For the privacy concerns and due to storage maintain structured data due to limited computing and storage
limitation, we use permissioned blockchain only to retrieve resources. Instead, they are more likely to generate unstruc-
the related data and manage the accessibility of data, instead tured data, e.g., in the form of text files. Study regarding
of recording the raw data. Moreover, permissioned blockchain such data is limited. To fill this gap, we focus on the unstructured
records all the sharing events of data, which can trace the use of data—textual data in our data sharing scenarios. We define a
data for further audit. two-step distance metric learning scheme for retrieval of the
Suppose all the parties that agreed for data sharing have textual data, which can quantify the similarity of specified data.
been registered in the permissioned blockchain, by uploading To improve the computing and storage efficiency with con-
retrieval records to the blocks. A data requester launches a strained resources, we leverage the graphs to represent original
data sharing request Req containing a set of queries Fx = data for further process, as defined in Definition 1, which can
{f1 , f2 , . . ., fx } to its nearby super node SNreq . Fig. 3 shows retain more structure and context information.
the working mechanism of our scheme. Nearby super node Definition 1 (Weighted Context Graphs): A weighted con-
SNreq first searches permissioned blockchain to check whether text graph G = {V, E} comprises a set of nodes (key terms)
the request has been processed before. If there is a hit, the V and a set of edges E ⊆ V × V . Each node ni contains a text
request will be forwarded to the node that has cached the term tn i and its weight win , (ni , wn i ). Each edge eij connects

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on April 16,2021 at 03:05:13 UTC from IEEE Xplore. Restrictions apply.
4180 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 16, NO. 6, JUNE 2020

Fig. 5. Local retrieval table.


Fig. 4. Records of blockchain.

The retrieval of related participants on blockchain toward a


data sharing request is a fundamental problem to be addressed
node ni and nj with a weight we i j denoting the correlation
in the proposed model. Since there are many participants, those
degree.
who possess data related to the request, should participate
We use weight matrix A = [aij ] to represent the graph, where
in data sharing to increase the accuracy of response results.
aij = wn i if i = j and aij = we i j if i = j. We leverage term
Nonetheless, the retrieval process should not break the privacy
frequency-inverse document frequency to construct our graphs.
of each participant. A distributed retrieval scheme is needed to
Thus, all the textual files are transferred into weighted context
quickly locate the requested data distributed among participants,
graphs {G1 , G2 , . . ., Gn }.
which can collaboratively response to the request.
In the second step, we serialize the graphs. Although graphs
Inspired by the previous work of Kademlia [25], we design a
keep much context information, they are difficult to be fur-
multiparty retrieval mechanism in blockchain. All participants
ther processed as input by machine learning algorithms. We
P are partitioned into various communities according to their
map the graphs into liner vectors by serializing them into a
data categories, that is, members of a community hold similar
sequenced vector. The graphs are first merged into a global
categories of data. Each community maintains a local retrieval
graph G = G1 ∪ G2 , . . ., ∪Gn . For global graph G = {V, E},
table of log(n) records toward log(n) different communities.
let k be the number of representative vertices. Then, the
As for each node in the community, it stores the IDs of all
size of normalized attributes for nodes will be k and the
its community members, together with the log(log(n)) nodes
size of normalized attributes for edges will be k × (k − 1)/2.
for each of its log(n) closest (most related in data categories)
Thus, the normalized vector Seq = V ∪ E = {V1 , . . ., Vk } ∪
community. In this way, the most related participants will be
{E1 , E2 , . . ., Ek (k −1)/2 }. We leverage Jaccard similarity [23]
kept locally at the local retrieval table of Pi , as shown in Fig. 5.
as the distance function to cluster documents with k-means
We extract a list of key terms from the data of every participant
algorithm. With the assistance of the normalized weighted graph
as the representative features in the form of hash values. Fur-
and the defined distance metric, we cluster dataset {D1 , . . ., Dn }
thermore, due to the limited communication resource of IIoT
into various categories according to textual similarity. We also
devices, the physical distance between two nodes dp (Pi , Pj )
divide the participated users into different groups according to
should also be considered in the retrieval process. Then, based
their data.
on Jaccard distance, the logic distance between their key terms
will be
B. Multiparty Data Retrieval  Pj
m ,n ∈{P i ∪P j −P i ∩P j } (am n + am n )
Pi

Since most of the data are sensitive and its amount is large, it is di (Pi , Pj ) =  Pj
m ,n ∈P i ∪P j (am n + am n )
Pi
a resource intensive and risky task to put data on the blockchain
with its limited storage space. Thus, we utilize blockchain to · log(dp (Pi , Pj )) (1)
retrieve data, while the real data is stored locally by its owners.
P
When a new data provider participates in, its unique identity (ID) where aPmin , amjn are the elements of weighted matrix for node
is recorded as a transaction in the blockchain, together with the Pi and node Pj , respectively. The ID of each participant (device)
profiles of its data, including data categories, data types, and is generated according to the logic distance. That is, the more
data size. All the profiles of data from multiple participants relative two nodes are, the longer their common ID prefix will be.
will be recorded in forms of transactions, and will be verified Given two nodes Pi and Pj with IDs Pi (id) and Pj (id)),
by the blockchain nodes through adopting Merkle tree [24]. respectively. The relevance distance between them is defined as
Each data sharing event is also stored in the blockchain as a
d(Pi , Pj ) = Pi (id) ⊕ Pj (id). (2)
transaction. The detailed forms of the two transactions—the
retrieval transactions recording data profiles of permissioned When a user submits a data sharing request to its nearby node
participants and the data sharing transactions recording all the Pi , all nodes in the same community with Pi send the request
ˇ data sharing events— shown in Fig. ˇ4. ˇ to the nodes in their local routing table with a certain distance

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on April 16,2021 at 03:05:13 UTC from IEEE Xplore. Restrictions apply.
LU et al.: BLOCKCHAIN AND FEDERATED LEARNING FOR PRIVACY-PRESERVED DATA SHARING IN INDUSTRIAL IoT 4181

to start the retrieving process. This process will be implemented 3) Launching Data Sharing Requests: Data requester r
recursively until all nodes within the relevant distance have been posts a data sharing request Req = {f1 , f2 , . . ., fx } to
traversed. At the end of retrieval, we get the related subset nodes its nearby super node SNreq . Request Req contains the
toward the request, Ps ⊆ P, which are also the committee nodes ID of r, the requested data category and the timestamp,
for running a consensus process to approve the data sharing which is signed by r with its private key SKr .
results. 4) Data Retrieval: Once a nearby node receives the data
sharing request, it verifies the ID of requester r. Then, it
searches permissioned blockchain to confirm whether the
C. Data Sharing Process request has been processed before. If there is a record, the
Existing methods use encryption for data security. However, cached model is returned as the reply. Otherwise, it runs
in data sharing scenarios, it is still risky for data holders to share the multiparty retrieval process to find the related parties.
the original data due to various attacks toward the encryption. A 5) Data Model Learning: The related parties work collab-
more secure method is to share the answers toward the requests, oratively to respond to the sharing request. They run
which can provide the requesters with valid information and federated learning to train a global data model M toward
protect privacy of data holders. The data providers share learned the request Req. The training set is generated based on
data models with the requesters instead of the original data. local data D and corresponding query results f x(D),
When the data requester r initiates a sharing request Req, DT = < fx , fx (D) >. The learning global model M is
it submits the request to its nearby super node SNreq of the then returned to the requester as a reply and is cached by
permissioned blockchain. SNreq first searches the blockchain to a node locally for future requests.
find out whether the request has been processed before. If there 6) Generating Data Sharing Records: The data sharing
is a lookup hit, then the cached data model M calculated before events between data requesters and data providers are
is returned to the requester directly. Otherwise, node SNreq looks generated as transactions and broadcasted in permis-
up the blockchain for related nodes—committee nodes, toward sioned blockchain. All the records are collected into
the sharing request r, through the aforementioned multiparty blocks, which are encrypted and signed by the collecting
data retrieval process. The committee nodes are responsible for node.
executing the consensus process and learn the federated data 7) Carrying Out Consensus: The consensus process is ex-
model M collaboratively. A committee node Pi learns a local ecuted by the related nodes selected for data retrieval.
data model mi for the requests from requesters, then it will send Each node competes for the opportunity to write blocks
model mi to other related participants, according to the local to the blockchain through PoW protocol. The node who
retrieval table of Pi . This process is repeated jointly on various wins the competition broadcasts its block to other nodes
related parties, until all related parties are traversed. The trained for verification. Once the verification is passed, the
data model M will be returned to requester r, as the answer to block is added to the permissioned blockchain, which
its data sharing requests. is tamper-proof.
The detailed steps of our data sharing scheme are as follows. Combining federated learning with permissioned blockchain,
1) Initialization: Before a data provider Pi joins, a local the requested data can be retrieved and shared securely in
clustering based on Jaccard similarity is executed to industrial IoT scenarios with distributed multiple data providers,
cluster its textual data into various categories. We se- which can improve the scale and quality of shared data. How-
rialize the key terms as vectors in a certain order to ever, the PoW consensus protocol incurs both high energy
represent a category. The more similar two datasets are, consumption and computation overhead, thus making it less
the closer their distance is. Then, its nearby super nodes, practical for IoT devices to adopt. To address this issue, we
to which Pi belongs, will search the blockchain to find further propose a new consensus to improve the utility and
the records which are logically close (according to XOR efficiency of computing work in the consensus protocol.
distance) to it. For each participant, we generate its ID
based on the hashed vectors to ensure that participants
D. Consensus: Proof of Training Quality (PoQ)
holding similar datasets have similar IDs. In addition, to
enhance computing efficiency, we divide the participants Transferring the data sharing problem into model sharing
in advance by running a community partition process, brings many benefits in data sharing. Sharing the data model
where all nodes are partitioned into various communities only instead of original data, helps protect privacy of data
according to their distances toward each other. owners. In addition, the machine learning data models are more
2) Registering Retrieval Records: Once Pi joins, it first effective to provide the required information for new sharing
sends its public key PKr and its data profiles to the nearby requests.
super node for registration. Then, a data retrieval record Directly using existing consensus such as PoW for data
for Pi is generated and broadcasted by the node to other sharing either brings high cost of computing and communi-
nodes in permissioned blockchain for verification. Other cation resources, or makes limited additional contribution to
nodes collect all received records and verify them before data sharing. To address this problem, we propose a federated
writing them into permissioned blockchain. learning empowered consensus—PoQ protocol. PoQ combines

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on April 16,2021 at 03:05:13 UTC from IEEE Xplore. Restrictions apply.
4182 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 16, NO. 6, JUNE 2020

Algorithm 1: Differential Private Federated Learning.


Input: data request Req, related participants P, iteration
times iter = 0
Output: data model M
1: for each participant pi ∈ P do
2: while accuracy ≤ T hreshold do
3: if iter = 0 then
4: Construct a new differential data model m̂i based on
its noise-added vector data Veci
5: Broadcast m̂i to other related participants according
Fig. 6. Overview of training in distributed scenario. to local retrieval tables
6: else
7: Construct differential private data model m̂i
with previously received models
data model training with the consensus process, which can make 8: Broadcast m̂i to the other participants who are
better use of the nodes’ computing resources. engaged in the data sharing process
For a specific data sharing request, we select members of 9: iter = iter + 1
the consensus committee by retrieving the related nodes for 10: end if 
a request in the blockchain. The committee is responsible for 11: M = k1 k m̂i
driving the consensus process, as well as for learning of data 12: end while
models for requested data. The objective of federated learning 13: end for
is to train a global data model M, which can provide the valid 14: return M to the requester
response M(Req) for data sharing requests Req. Model M can
be trained by using a series of machine learning algorithms, e.g.,
random tree, random forest, and gradient boosting decision tree 1) Selecting training samples: The data owner selects related
(GBDT). Once constructed, model M can generate the answers data D toward the request set R and transfer them into
toward data queries, even if the queries are fresh. normalized graph vectors V ec.
1) Differentially Private Federated Learning: To protect data 2) Differential private local model training: The noise cal-
privacy during multiparty decentralized learning, we incorpo- ibrated by sensitivity s is added to local data V eci . The
rate a differential privacy preserved mechanism into federated local data model m̂i is trained locally at Pi , by using
learning. For two neighboring datasets D and D with at most machine learning algorithm on the selected noisy data
one different record, and a set of outcomes S, a randomized V eci .
algorithm A achieves -differential privacy if 3) Collaborative multiparty learning: The Laplace mech-
anism is applied on local data model mi to achieve
P r[A(D) ∈ S] ≤ exp() · P r[A(D ) ∈ S] (3) differential privacy
m̂i = mi + Laplace(s/) (4)
where  is the privacy budget. The training process over dis-
where s is the value of sensitivity, as shown in (5)
tributed data owners is shown in Fig. 6. The related parties
{P1 , P2 , . . ., Pn } are selected through multiparty retrieval in s = max ||f (D) − f (D )||1 . (5)
D ,D
blockchain. The textual data they hold is transferred into normal-
ized graph vectors V ecg = v1 , v2 , . . ., vk , e11 , e12 , . . ., ek k . Then, the noise-added model m̂i is broadcasted as a
When there is a data sharing request R (including a series of transaction of the blockchain to other participants for
queries), Pi will train a local data model mi based on V ecg i federated learning. This process is repeated iteratively
toward the request first. The V ecg i , composed of sensitive terms until the performance of the federated model achieves
and weights, is used to train a local model in the federated the threshold or the training time runs out.
learning process. Since the local model will be shared to other Algorithm 1 illustrates the overall process of our proposed
participants, to protect the privacy of V ecg i , we incorporate scheme.
differential privacy in the learning phase to train a m̂i from 2) Training Quality Based Consensus: The consensus pro-
noised data. Then, Pi will send model m̂i to other participants. cess is executed by the selected committee based on the work
Once m̂i is received, Pi+1 will train a new local data model mˆi+1 of collaborative training. The committee nodes are a subset of
based on received m̂i and its local data, then broadcast mˆi+1 to all the participants. The communication overhead is reduced
other participants. The data models are trained iteratively among by sending consensus messages only to the committee nodes
participants. Finally, the global data model M will be generated instead of all the nodes. However, the reduction in the number
where M = {m̂1 ∪ m̂2 . . . ∪ mˆn }. of nodes also makes it more challenging to achieve consensus.
For any participant Pi , there are following three steps to learn In order to balance the overhead and the security, we provide
the local model m̂i . the proof of training work for consensus in data sharing.

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on April 16,2021 at 03:05:13 UTC from IEEE Xplore. Restrictions apply.
LU et al.: BLOCKCHAIN AND FEDERATED LEARNING FOR PRIVACY-PRESERVED DATA SHARING IN INDUSTRIAL IoT 4183

TABLE I
RECORD TUPLE OF A MODEL TRANSACTION

Fig. 8. Consensus process of PoQ.


Fig. 7. Illustration of model transactions.

As mentioned earlier, the leader gathers all the transactions it


The committee leader is selected based on the quality of the received at the beginning, including the final data model M,
trained model. Since each committee node trains a local data to form a block Bk = (Hk , tm i , M), where Hk is the header
model, the quality of the model should be verified and measured of Bk . Then, the leader broadcasts Bi to all members of the
during the consensus process. We leverage prediction accuracy committee for approval. In addition to the regular verification
to quantify the performance of the trained local model. More on a block (e.g., the header format, block size, and timestamp),
specifically, in the classification during training, the accuracy the committee nodes also audit the block by verifying the model
is denoted by the fraction of correctly classified records. While transaction track, as what the verification nodes do to verify
in the task of regression, the accuracy is measured by mean a transaction’s amount in bitcoin. Each verifying node calculates
absolute error (MAE) the MAE(mi ) for each model transaction and MAE(M). If the
1 calculated MAE is within a certain range, an approval will be
n
MAE(mi ) = |yi − f (xi )| (6) sent to the leader. If the block containing all transactions is
n
i=1 approved by every committee node, the leader will send the
where f (xi ) is the prediction value of model mi and yi is the block data signed with its signature to all nodes. Then, the
real value of the records. The lower the MAE of model mi is, records will be stored in the blockchain, which are tamper-proof.
the higher the accuracy of mi will be. The process for training work based consensus is illustrated
After the differentially private collaborative training, we get in Fig. 8.
the trained global data model M and local model mi for each
committee node. The consensus process is executed by the com- IV. SECURITY ANALYSIS AND NUMERICAL RESULTS
mittee. During responding to a data sharing request, a committee
A. Security Analysis
node Pi transmits its trained model mi to the next committee
node. The transmissions are recorded as model transactions tm i , The use of permissioned blockchain establishes a secure
together with its MAE(mi ). The record tuple is shown in Table I. mechanism for multiple parties without mutual trust. We in-
The illustration of model transactions in training process is tegrate federated learning into the consensus process of per-
shown in Fig. 7. To encrypt and sign the messages, Pi has a pair missioned blockchain to address the aforementioned security
of public and private keys (PKi , SKi ). Then, Pi broadcasts the threats.
encrypted transmission E(SKi (tm ), PKi ) to other committee 1) Achieving Differential Privacy: According to the defini-
nodes. A committee node Pj collects all model transactions and tion of differential privacy, if each step of data processing
stores them locally as candidate blocks. As a proof of training follows the requirement of differential privacy, i.e., (3),
work, Pj verifies all transactions it receives by calculating the final results will satisfy differential privacy [26].
the MAE defined in (6). The MAE for Pj , MAEu (Pj ) is Algorithm 1 shows that the privacy budget is only con-
calculated as sumed in step 4, where noise is added to data vectors.
1 The other steps, using the training methods (e.g., trained
MAEu (Pj ) = γ · MAE(mj ) + MAE(mi ) (7) tree) for calculating residuals, are only mapping oper-
n i
ations that are data-independent and will not disclose
where MAE(mj ) is the MAE of the locally trained model mj , any private information. Therefore, Algorithm 1 satisfies
and γ is the weight parameter denoting the contribution of Pj -differential privacy.
 training data size of Pj and
to the global model, decided by the 2) Removing Centralized Trust: The permissioned
other participants γ = 1 + |dj |/ i |di |. blockchain takes the place of a trusted curator to connect
When the consensus process starts, the committee node with each participant through multiparty data retrieval.
the lowest MAEu at that time will be elected as the committee The centralized trust, which incurs high risk of data
leader through MAE-based voting. The leader is responsible leakage, is no more required in the proposed blockchain
for driving the consensus process among participated nodes. empowered data sharing scheme.

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on April 16,2021 at 03:05:13 UTC from IEEE Xplore. Restrictions apply.
4184 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 16, NO. 6, JUNE 2020

3) Guaranteeing the Quality of Shared Data: To prevent the


dishonest provider from sharing invalid data, the PoQ
consensus process validates the quality of learned data
models by other data providers, and only the qualified
models are preserved.
4) Secure Data Management: Only the data retrieval is
uploaded to the permissioned blockchain while real data
is stored locally by each data provider. Data owners
can control the authority of their own data. Moreover,
permissioned blockchain uses a series of cryptographic
algorithms such as elliptic curve digital signature al- Fig. 9. AUC in various datasets (20News).
gorithm and asymmetric cryptography to guarantee the
security of data.

B. Evaluation Setup
We conduct evaluations of the proposed secure data sharing
scheme on two real-world data sets, which is widely used for
evaluating text-related machine learning algorithms. The first
one is Reuters dataset [27], a benchmark dataset for classifi-
cation tasks. The dataset consists of a series of short files in
various topics appeared on Reuters newswire. It contains a total
of 15 732 files in 116 categories. The second one is the 20
newsgroups dataset [28], which is a collection of approximately
20 000 newsgroup files. The data are partitioned into 20 different
groups, where each group is related to one topic. The data in the
two datasets is unstructured short text, which is quite different Fig. 10. AUC in various datasets (Reuters).
from the structured data in databases. We use the two datasets
to simulate the large amount of unstructured short data pieces
in IIoT, such as the configuration files generated from various
vehicular applications and the status log files.
We divide the sorted dataset into shards and recombine
the shards into subsets to simulate the distributed multiple
participants in our data sharing scheme. Classification anal-
ysis is used to simulate the data sharing tasks. We im-
plement our improved distributed GBDT on textual data
to execute the federated learning process over distributed
datasets.
Fig. 11. Running time in various datasets (20News).
C. Numerical Results
Receiver operating characteristic (ROC) curve is widely used and 9). It can be seen that the AUC results change little as
to illustrate the diagnostic ability of a classifier scheme, whose the number of data providers increases, which indicates our
discrimination threshold is varied. We use the area under the proposed model is scalable. Since the global model in federated
ROC curve (AUC) to evaluate the accuracy of the model. The learning is aggregated from all local models, and each local
performance of our proposed mechanism in various distributed model is trained on local data, the number of data providers
datasets, compared with a benchmark method, text graph convo- has little effect on the performance of the aggregated model.
lutional networks [29], is shown in Figs. 9 and 10. From Fig. 9, Fig. 11 shows that the running time of our proposed mechanism
we can conclude that compared with the benchmark method, varies from milliseconds to seconds with an average of 880 ms
most of our testing groups obtain high accuracy with an average in different subdatasets. From Fig. 12, we can see that, for
AUC value of 0.918, which indicates that our proposed federated the same dataset, the running time increases with the number
learning mechanism achieves high diagnostic ability. We can of data providers involved. The reason is that the more data
also observe that the accuracy changes not so smoothly. The providers, the more time it takes to implement the collaborative
reason is that the number of files in each subset is a fixed discrete working process. Moreover, the results of running time show
value, and the accuracy is also related to the characteristics of that the performance of the proposed federated learning based
files from different subset, which can not be a continuously data sharing scheme is near real time.
changing value as the size of data increases. Fig. 10 shows Through the above evaluation, we can observe that the
the AUC results with various number of data providers (3, 6, increase in data providers has little effect on the accuracy of our

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on April 16,2021 at 03:05:13 UTC from IEEE Xplore. Restrictions apply.
LU et al.: BLOCKCHAIN AND FEDERATED LEARNING FOR PRIVACY-PRESERVED DATA SHARING IN INDUSTRIAL IoT 4185

REFERENCES
[1] E. Sisinni, A. Saifullah, S. Han, U. Jennehag, and M. Gidlund, “Industrial
Internet of Things: Challenges, opportunities, and directions,” IEEE Trans.
Ind. Informat., vol. 14, no. 11, pp. 4724–4734, Nov. 2018.
[2] L. Sweeney, “k-anonymity: A model for protecting privacy,” Int. J.
Uncertainty, Fuzziness Knowl.-Based Syst., vol. 10, no. 05, pp. 557–570,
2002.
[3] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam,
“l-diversity: Privacy beyond k-anonymity,” in Proc. 22nd Int. Conf. Data
Eng., 2006, pp. 24–24.
[4] C. Dwork, “Differential privacy in new settings,” in Proc. 21st Annu.
ACM-SIAM Symp. Discrete Algorithms, 2010, pp. 174–183.
[5] T. Zhu, G. Li, W. Zhou, and S. Y. Philip, “Differentially private data
publishing and analysis: A survey,” IEEE Trans. Knowl. Data Eng.,
vol. 29, no. 8, pp. 1619–1638, Aug. 2017.
[6] C. Yin, J. Xi, R. Sun, and J. Wang, “Location privacy protection based
on differential privacy strategy for big data in industrial Internet of
Fig. 12. Running time in various datasets (Reuters). Things,” IEEE Trans. Ind. Informat., vol. 14, no. 8, pp. 3628–3636,
Aug. 2018.
[7] K. Zhang, Y. Zhu, S. Maharjan, and Y. Zhang, “Edge intelligence and
blockchain empowered 5G beyond for industrial Internet of Things,” IEEE
Netw. Mag., to be published.
proposed scheme, while the running time increases evidently. [8] C. Alcaraz, L. Cazorla, and G. Fernandez, “Context-awareness using
The stable accuracy is due to that our scheme can execute anomaly-based detectors for smart grid domains,” in Proc. Int. Conf. Risks
parallel local training, where the federated mode ensures the Sec. Internet Syst., 2014, pp. 17–34.
[9] K. Zhang, S. Leng, X. Peng, P. Li, S. Maharjan, and Y. Zhang, “Artifi-
stable learning accuracy. Yet the increase in data providers cial intelligence inspired transmission scheduling in cognitive vehicular
brings more local models to be updated and computed, which communications and networks,” IEEE Internet Things J., vol. 6, no. 2,
increases the time for training and update transmission. Thus, pp. 1987–1997, Apr. 2019.
[10] L. Cazorla, C. Alcaraz, and J. Lopez, “Cyber stealth attacks in critical
the running time increases with the increase of data providers. information infrastructures,” IEEE Syst. J., vol. 12, no. 2, pp. 1778–1792,
Despite the slightly increased running time, the participation of Jun. 2018.
multiple data providers enlarges the scale of data for computing [11] C. Alcaraz and J. Lopez, “Analysis of requirements for critical control
systems,” Int. J. Crit. Infrastruct. Protection, vol. 5, nos., 3–4, pp. 137–145,
results, which enables the sharing of data to improve the quality 2012.
of vehicular applications. [12] G. Zyskind, O. Nathan, and A. Pentland, “Decentralizing privacy: Using
blockchain to protect personal data,” in Proc. IEEE Sec. Privacy Work-
shops, May 2015, pp. 180–184.
[13] C. H. Liu, Q. Lin, and S. Wen, “Blockchain-enabled data collection and
sharing for industrial IoT with deep reinforcement learning,” IEEE Trans.
V. CONCLUSION Ind. Informat., vol. 15, no. 6, pp. 3516–3526, Jun. 2019.
[14] S. Nakamoto, Bitcoin: A Peer-to-Peer Electronic Cash System, 2008.
In this article, we proposed a privacy-preserving data sharing [Online]. Available: https://bitcoin.org/bitcoin.pdf
mechanism for distributed multiple parties in IIoT applica- [15] S. Goryczka, L. Xiong, and B. C. Fung, “m-privacy for collaborative data
publishing,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 10, pp. 2520–
tions, which incorporates federated learning into permissioned 2533, Oct. 2014.
blockchain. The illustrative numerical results, showed that our [16] A. Friedman, I. Sharfman, D. Keren, and A. Schuster, “Privacy-preserving
blockchain empowered data sharing scheme enhances the secu- distributed stream monitoring,” in Proc. 35th Annu. IEEE Netw. Distrib.
Syst. Secur. Symp., 2014, pp. 1–9.
rity during sharing process without requiring centralized trust. [17] Z. Qin, T. Yu, Y. Yang, I. Khalil, X. Xiao, and K. Ren, “Generating
Moreover, by integrating federated learning into the consensus synthetic decentralized social graphs with local differential privacy,” in
process of permissioned blockchain, we not only improved Proc. ACM SIGSAC Conf. Comput. Commun. Sec, 2017, pp. 425–438.
[18] A. H. Sodhro, S. Pirbhulal, and V. H. C. de Albuquerque, “Artificial
the utilization of computing resource but also increased the intelligence-driven mechanism for edge computing-based industrial ap-
efficiency of the data sharing scheme. Numerical results on two plications,” IEEE Trans. Ind. Informat., vol. 15, no. 7, pp. 4235–4243,
benchmark real-world datasets corroborated that our proposed Jul. 2019.
[19] A. K. Sangaiah, D. V. Medhane, T. Han, M. S. Hossain, and G. Muham-
mechanism can enable secure data sharing with high efficiency mad, “Enforcing position-based confidentiality with machine learning
and utility. paradigm through mobile edge computing in real-time industrial infor-
The combination of blockchain and federated learning is a matics,” IEEE Trans. Ind. Informat., vol. 15, no. 7, pp. 4189–4196,
Jul. 2019.
promising way to enable secure and intelligent data sharing [20] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and
in IIoT. However, how to efficiently guarantee data privacy by D. Bacon, “Federated learning: Strategies for improving communication
applying blockchain technique is still an open issue, which needs efficiency,” 2016. [Online]. Available: https://arxiv.org/abs/1610.05492
[21] R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federated
to be further explored by analyzing more security threats and learning: A client level perspective,” 2017, arXiv:1712.07557. [Online].
developing more effective solutions. Moreover, how to improve Available: https://arxiv.org/abs/1712.07557
the utility of data models mapped from raw data, regardless of [22] Z. Yu et al., “Federated learning based proactive content caching in edge
computing,” in Proc. IEEE Global Commun. Conf., 2018, pp. 1–6.
the specific computing tasks and machine learning algorithms, [23] S. Niwattanakul, J. Singthongchai, E. Naenudorn, and S. Wanapu, “Using
is a critical problem to be addressed in data sharing. New of jaccard coefficient for keywords similarity,” in Proc. Int. MultiConf.
intelligent mechanisms are required to improve data utility. In Eng. Comput. Scientists, 2013, pp. 380–384.
[24] A. Kosba, A. Miller, E. Shi, Z. Wen, and C. Papamanthou, “Hawk:
addition, the limited resource of devices imposes new challenges The blockchain model of cryptography and privacy-preserving smart
for improving the efficiency of data sharing in IIoT. contracts,” in Proc. IEEE Symp. Sec. Privacy, 2016, pp. 839–858.

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on April 16,2021 at 03:05:13 UTC from IEEE Xplore. Restrictions apply.
4186 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 16, NO. 6, JUNE 2020

[25] P. Maymounkov and D. Mazières, “Kademlia: A peer-to-peer information Sabita Maharjan (M’09) received the Ph.D. de-
system based on the XOR metric,” in Peer-to-Peer Systems, P. Druschel, gree in networks and distributed systems from
F. Kaashoek, and A. Rowstron, Eds., Berlin, Germany: Springer, 2002, Simula Research Laboratory and University of
pp. 53–65. Oslo, Oslo, Norway, in 2013.
[26] C. Dwork, “A firm foundation for private data analysis,” Commun. ACM, She is currently a Senior Research Scientist
vol. 54, no. 1, pp. 86–95, 2011. with the Simula Metropolitan Center for Digital
[27] D. D. Lewis, “Reuters dataset,” 2019. [Online]. Available: http://www. Engineering, Oslo, and an Associate Profes-
daviddlewis.com/resources/testcollections/ sor with the Department of Informatics, Uni-
[28] “20newsgroups,” 2019. [Online]. Available: http://qwone.com/jason/ versity of Oslo. Her research interests include
20Newsgroups/ wireless networks, network security and re-
[29] L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for text silience, smart grid communications, Internet of
classification,” in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 7370– Things, machine-to-machine communication, software-defined wireless
7377. networking, and the Internet of Vehicles.

Yunlong Lu (S’18) received the B.S. degree in


electronic information science and technology
from Beijing Forestry University, Beijing, China,
in 2012 and the M.S degree in computer tech-
nology from School of Computer Science, Bei-
jing University of Posts and Telecommunications
(BUPT), Beijing, China, in 2015. He is currently
working toward the Ph.D. degree in computer
science and technology with the Institute of Net-
work Technology, BUPT.
He is currently a Visiting Ph.D. Student with
the University of Oslo, Oslo, Norway. His current research interests
include blockchain, wireless networks, and privacy-preserving machine
learning.

Xiaohong Huang (M’17) received the B.E. de-


gree in mechanical and electronic engineer-
ing from the Beijing University of Posts and
Telecommunications (BUPT), Beijing, China, in Yan Zhang (M’05–SM’10) received the Ph.D.
2000, and the Ph.D. degree in mechanical and degree in electrical and electronic engineer-
electronic engineering from the School of Electri- ing, from Nanyang Technological University,
cal and Electronic Engineering, Nanyang Tech- Singapore, in 2006.
nological University, Singapore, in 2005. He is currently a Full Professor with the De-
Since 2005, she has been with BUPT and partment of Informatics, University of Oslo, Oslo,
she is currently a Full Professor and the Director Norway. His research interests include next-
of Network and Information Center, Institute of generation wireless networks leading to 5G Be-
Network Technology of BUPT. She has authored or coauthored more yond, green and secure cyber-physical systems
than 50 academic papers in the area of wavelength division multiplexing (e.g., smart grid and transport).
(WDM) optical networks, IP networks, and other related fields. Her cur- Dr. Zhang is an Editor of several IEEE publica-
rent interests include Internet architecture, software-defined networking, tions, including the IEEE COMMUNICATIONS MAGAZINE, IEEE NETWORK,
and network function virtualization. IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, IEEE TRANSACTIONS
ON INDUSTRIAL INFORMATICS, IEEE TRANSACTIONS ON GREEN COMMUNICA-
Yueyue Dai (S’17) received the B.Sc. degree TIONS AND NETWORKING, IEEE COMMUNICATIONS SURVEYS & TUTORIALS,
in communication and information engineer- IEEE INTERNET OF THINGS, IEEE SYSTEMS JOURNAL, and IEEE VEHICULAR
ing in 2014, from the University of Electronic TECHNOLOGY MAGAZINE. He serves as Chair positions in a number of
Science and Technology of China, Chengdu, conferences, including the IEEE Global Communications Conference
China, where she is currently working toward the 2017, IEEE International Symposium on Personal, Indoor and Mobile
Ph.D. degree in communication and information Radio Communications 2016, and IEEE International Conference on
engineering. Communications, Control, and Computing Technologies for Smart Grids
She is currently a Visiting Ph.D. Student 2015. He is the IEEE Vehicular Technology Society Distinguished Lec-
with the University of Oslo, Oslo, Norway. Her turer. He is a Fellow of Institution of Engineering and Technology. He
research interests include wireless network, was the Chair of IEEE Communications Society Technical Committee
mobile edge computing, Internet of Vehicles, on Green Communications & Computing. He is recognized as a “Highly
blockchain, and deep reinforcement learning. Cited Researcher” in 2018 according to Web of Science.

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on April 16,2021 at 03:05:13 UTC from IEEE Xplore. Restrictions apply.

You might also like