0% found this document useful (0 votes)
0 views

BSP 5 Distributed Sandboxed Data Processing

This document proposes a framework for privacy-preserving federated machine learning focused on recycling data, enabling collaboration between a product-designing company and a recycling company while maintaining data confidentiality. The approach utilizes secure multi-party computation (SMPC) and sandboxed virtual machines to allow local training of machine learning models without sharing sensitive information. The research addresses challenges related to data privacy and the verification of model performance while exploring potential applications beyond recycling.

Uploaded by

XQC5939
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

BSP 5 Distributed Sandboxed Data Processing

This document proposes a framework for privacy-preserving federated machine learning focused on recycling data, enabling collaboration between a product-designing company and a recycling company while maintaining data confidentiality. The approach utilizes secure multi-party computation (SMPC) and sandboxed virtual machines to allow local training of machine learning models without sharing sensitive information. The research addresses challenges related to data privacy and the verification of model performance while exploring potential applications beyond recycling.

Uploaded by

XQC5939
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Privacy Preserving Federated Machine Learning on

Recycling Data
Bachelor Semester Project S5 (Academic Year 2024/25), University of Luxembourg

Sven Kuffer Dr. Jeff Mangers Prof. Thomas Engel


BICS Student Extrernal Advisor Supervisor
University of Luxembourg CRAB University of Luxembourg
[email protected] [email protected] [email protected]

Abstract—Data privacy and secure data sharing are essential in In the current industry, companies often need to collaborate
today’s data-driven economy, particularly when multiple stake- and share data to optimize processes and improve workflows.
holders require sensitive information to achieve mutually benefi- In the automotive industry, autonomous vehicles with AI
cial outcomes. This proposal outlines a framework for secure,
distributed, and sandboxed multi-party computation (SMPC) technology often need to share image data to enhance their
for data processing between two distinct entities: a product- training on object and obstacle recognition. The focus of
designing Beginning-of-Life (BOL) company and an End-of- research has been to train YOLO models locally per vehicle
Life (EOL) recycling and sorting company. These companies and compute a combined model by all actors, to prevent
aim to securely share data to analyze and improve product sensitive location and image data to be shared among actors.
recyclability, with each retaining ownership of their proprietary
information. The proposed solution involves a secure computation However, in the scenario for this paper, recycling facilities
environment leveraging sandboxed virtual machines and MPC create their own image datasets of recycling streams, with the
protocols, ensuring that sensitive data is processed in situ (on goal of training an image recognition model to help improve
site) within each company’s environment without compromising sortability and predictions. While they can all locally train
confidentiality. Using open-source tools such as ZeroVM, a their own models, the end-goal is a superior model from the
distributed framework that processes each company’s data locally
is poposed, producing insights that guide the BOL company combination of all datasets and the combined computational
in optimizing product design for recyclability based on the power. The following two research questions highlight the
EOL company’s operational metrics. By enabling secure, private, extend of the described problem:
and decentralized data analysis, this approach has potential (1) This scenario introduces a critical challenge: How can
applications beyond recycling, offering a scalable solution for the companies involved reach their combined computational
collaborative data analysis in supply chains and product life cycle
management. goal, without revealing their data and, by that, their company
Index Terms—distributed processing, data, security, sandbox, secrets?
virtual machine (VM) (2) How can a semi-trusted third party reconstruct the output
and get the averaged model and verify the claims about each
I. I NTRODUCTION dataset, without revealing any sensitive information.
In 2016, the EU released a new legislation, the General
Data Protection Regulation [2] (GDPR), for data protection. Different secure multi-party computation (SMPC) techniques
This directive was meant to protect sensitive user data can be considered to reach the goal of privacy preserving
from third-party companies processing them. The way the federated machine-learning, some of which are discussed in
directive was written, users have to consent to their data the next section.
being processed and stored, accounting for a potential future
II. R ELATED W ORK
use of fully homomorphic encryption [12] as a possible
solution to the problem of third party data processors. As A. Machine Learning
Fully Homomorphic Encryption schemes (FHE) are still in Machine Learning (ML) is an umbrella term for many
the early days and come with increasing overhead the more different implementations of algorithm classes that learn and
the data is processed and data having to adhere to very optimize from the given data, instead of following a rigid
specific formats, for the time being, one has to result to using function. Machine learning is a set of converging optimization
other methods. problems. We generally consider three categories of ma-
Although this directive was tailored for sensitive consumer chine learning, Supervised-Learning, Unsupervised-Learning
data, companies want to protect their data as well from third and Reinforcement-Learning. In this paper we mainly focus
parties accessing them. Data are a very valuable asset for on supervised-learning, where our data is labeled and the
these companies. algorithm is learning to predict those labels. More precisely,
we focus on computer vision algorithms, where the inputs are b) Malicious adversaries: on the other hand may also
a labeled dataset of images. The labels in this case are class want to harm the computation itself, this requires a higher
labeled boxes with coordinates and size parameters. level of security.
1) Computer Vision: Computer vision describes the iden- 2) Passive and Active Security: In a semi-honest environ-
tification of certain features in image, video or LIDAR data. ment, passive security is sufficient, as malicious actors do
Mostly, it describes the use och machine learning and AI to not want to harm the resulting computation output, but might
achieve these goals. seek insight into the input data. In a malicious environment,
a) You only look once: You only look once (YOLO) however, active security is required. Luckily, the problem
describes a family of computer vision machine learning al- statement of this paper only requires passive security with a
gorithms that is prominent for a fast learning implementation semi-honest model.
on image data. It achieves this speed and efficiency by only 3) Malicious-Majority: A malicious majority describes a
looking once over each image. The yolo models used in our state where the majority of parties act maliciously. One has to
testing are based on ultralytics [14] models that enable ob- make sure that, if this is the case, no majority can reconstruct
ject detection, segmentation, classification, and poses. Object the secret without the involvement of the other parties.
detection is the focus of this paper. 4) Secret Sharing: While there are many different schemes
for smpc computation, such as garbled circuits [5] or homo-
2) Federated Learning: Whilst there are many ways to
morphic encryption [12], the function that should be computed
achieve federated learning [4] [11], in order to achieve pri-
only requires a summation of parameters. For this, secret
vacy, the need for cryptographic methods arises, such as
sharing schemes, such as Shamir-Secret sharing [3] come
homomorphic-encryption or secure multi-party computation
in handy. The idea is to plot a secret on a t-dimensional
schemes, as described in the next part. To achieve this,
function, where the secret is at f(0) and each party receives
typically one would need to alter the functions used in training
a random point on this function. To reconstruct the secret,
to work with MPC [13], however, this is not always feasible.
at least t shares are needed. The reconstruction happens using
For once, it can add significant overhead to the computation,
Lagrange Interpolation. Each share computes a function where
the increased computational power need cannot be carried by
g(x) equals 1 for their own share and 0 for all other points
all actors, in addition to adding significant training time. On
that have been shared. By summing all these functions, we
the other hand, one does not have the ability to change the
receive again our secret at g(0) = f(0). This is the basis of
architecture of closed source models that easily. In order to
the later discussed implementation. Usually this happens in
not having to opt for a potentially inferior architecture and
a finite field, using modulo operations, to prevent over- and
reducing the added computational cost, a different method can
underflow. This however means that it only works on positive
be applied. [5]
integers, while the parameters in a machine learning model are
In a slow offline phase, the actors will perform their normal
floats. This means that normalization has to be applied to split
model training on their own dataset. After this phase follows a
the sign, flag zeros and get a fixed precision representation of
faster online-phase, using cryptographic protocols to keep the
a float using mantissa and exponent, that can then be scaled
data protected, where the actors calculate a combined model
to an integer. Alternatively, the finite field constraint can be
by averaging their parameter weights, weighed by their input-
omitted, if one can be sure that over- or underflow cannot
dataset size [10] [9] to increase fairness and ensure a robust
happen.
superior model output.
a) Floating Point Arithmetic in Secret Sharing: is not an
easy task. Usually, we perform secret sharing over a finite set
B. Secure Multi-Party-Computation (MPC) of integers using modulo operations. For this, a large enough
prime number P is selected to create a modulo field FP .
Secure Multi-Party-Computation (MPC) [7] is a term that However, this lets us only compute over positive integers. In
describes the involvement of multiple parties, trying to achieve machine learning, our parameters are mostly small floating
a combined computational goal, whilst involving crypto- point numbers, that can also be negative. To deal with this,
graphic schemes to protect each party from malicious inter- a solution is to scale the floating point numbers to integers
vention by other parties. There are many different schemes for and reverse the scaling later. A framework such as crypten [6]
different applications and one has to also consider the problem by Meta can be used. It encrypts a floating point number for
statement and setup to derive a security model to chose the federated machine learning to an integer. However, this alone
correct scheme wisely. is not secure, as it only scales it using a semi-random large
1) Adversaries: First, consider two different types of ad- number, leaving the need for secret sharing. To pre-process
versaries that result in different security models that can be the parameters, one can extract from the float the sign, a flag
applied. [8] for the zero as well as the mantissa and exponent as a tuple
a) Semi-honest adversaries: are not interested in harm- ¡s,z,m,e¿. The mantissa and exponent give us a fixed point
ing the computation and its output, they do however have an representation of the number f = m ∗ 2p . The float can be
interest in gaining information about the input of a computa- written as u = (1 − 2s)(1 − z)m ∗ 2e . [1]. With this extraction,
tion from other parties. we can get a number that we can scale to an integer with
the sign kept separate. Finally, all operations need to work on local dataset sizes can be sent to the mediator, as they are not
tensors. considered sensitive data. The mediator, after re-construction,
5) Homomorphic Encryption: Homomorphic encryption can then divide by the summed sizes without any need of
[12] is another encryption that can be used to secure further encryption and communication. This leaves a simple
a secret and perform computations on said secret. Fully- summation, using only addition, for the secret sharing online-
homomorphic encryption is still largely a theory that can phase.
so far only be achieved by scaling partially-homomorphic
B. Actors
encryption schemes, resulting in an increase of entropy with
each operation on either multiplication or addition. Somewhat- This section describes all actors involved in this problem.
homomorphic encryption schemes on the other hand have been • Dataset-Owners: The owner of a dataset has an incentive
know for a while, such as RSA, which was the first one to keep their valuable data private, while taking part in
to be discovered by Shamir. These schemes however, only the combined training of a superior model.
enable one type of operation. This would be enough for this • Semi-Trusted Third Party: The semi-trusted third party
use case, as during the secure phase only addition is needed. may or may not have their own dataset and, or participate
In this paper secret sharing was chosen over homomorphic in the online phase of the federated learning process.
encryption. Most importantly though, their goal is to verify the claims
by each party and validate and rate the trained model
III. D ESIGN checkpoints.
The goal is to implement a privacy preserving federated • Processing company: A processing company is an ana-
learning scheme for a You-Only-Look-Once convolutional lytics entity, only participating in the online-phase of the
neural network. In combination with MPC, each input is being federated learning process.
processed to produce an output of both that satisfies all the
C. Input
security constraints for all entities.
In the offline phase, each actor will train their own model This section describes the parameters involved in the com-
locally. They will add a hash ’H’ about the dataset and a putation. The computer vision machine learning task involving
commitment ’C’ about the claimed dataset size as metadata a YOLO cnn is a supervised learning algorithm. This means
within the models checkpoint. The verifier can prove using that the input consists of both the image data and the cor-
a zero-knowledge-proof scheme that the claims were in fact responding labels. The labels for each object in the image
correct. consist of a tuple L =< c, x, y, w, h >, where c is the class
In the online phase, the actors will extract the models weights index, x and y are the Cartesian coordinates, and w and h are
and average them using mpc protocols, weighed by their the width and height. In addition a base model needs to be
claimed dataset size, to prioritize superior models. The output shared as a foundation to start the training on. The metadata
will be given to all participating actors, without revealing sen- of this model has again the class names listed, as well as some
sitive data. Finally, the semi-trusted third party can verify the hyperparameters for the model to use in training.
claims and validate each models performance on a validation • Labeled dataset of recycling stream images or videos, that
set. fit the YOLO format.
• Base model with predefined classes to start the offline
A. Computational goal training process.
The goal is to compute a weighted average of all learnt • Hash and Commitment about the dataset size.
models. Normally the average of Pna parameter Pi for n parties IV. MPC AST F RAMEWORK
can be calculated as Pi = ( j=1 Pij )/n. To weight this
average, to increase importance of better performing models, From the problem statement, the requirements and the
we chose to weigh the average using the local design for a solution follows the implementation of a Peer-to-
Pndataset size
dj with the global dataset size being D = Peer framework for training local models and averaging the
j=1 dj . The
weights can be played around with, an extra weight α, from learned parameters in a secure environment, meet MPCast.
a performance test of each model on a validation set could A. Architecture
be introduced for P example. The resulting weighted average
n
equation is Pi = ( j=1 Pij ∗ dj )/D. This equation involves From the design section can be derived the architecture for
both addition and multiplication, however, this complicates such a framework.1
things for secret sharing as multiplication would involve
multiple rounds of communication between nodes and sharing
triples. To eliminate the need for both multiplication steps, one
can firstly perform the weighting by the local dataset size in
the offline pre-processing step, locally. In order to deal with
the multiplication by the inverse of the global dataset size,
the output can first be re-constructed by the mediator. The
The framework consists of two different parties with specific
roles. The first party is the mediator, a (semi-)trusted party
that initiates and orchestrates communication, shares the initial
base-model for training, receives the end-result, reconstructs
it and averages it. The second party could potentially be split
into two sub-parties, a data holder and a computation node,
but for the sake of this implementation, both are fused into
one single node type. This party receives the base-model and
performs a training session on the model. The training can
be independently implemented, straight forward or even as
complex batch processing, with checkpoints, fail-safes and
orchestration. The implementation is left to the user. After the
training has finished, a completion statement is issued back
to the mediator, who then initiates the next phase. Still in
the offline-phase, the nodes prepare their data. The model is
preprocessed by extracting the state, scaling it by the dataset
size and pre-processing it, by normalizing it, to prepare it for
generating shares. Normalization in this case means splitting
the sign and ensuring the tensor values are all integers, so
that it can work in a field of FP , where P is a prime large
enough for all potential values. The nodes then generate shares
using a (n,t)-Shamir sharing scheme, where n is the number of Fig. 2: figure
nodes and at least t shares are required to reconstruct a secret. P2P Network
Now moving to the online-phase, the nodes distribute their
shares peer-to-peer. Once all shares have been received, they
sum the shares that they have received and report the output multi threading, file reads and parallel incoming of requests.
back to the mediator. The mediator receives all the shares and This is solved with thread locking, setting some flags and
can now reconstruct the summed secret. This sum can now be redundancy, such as retries in the sending of files. Flask comes
averaged to reveal the newly learned model. This model can with a development server and is not production ready, for
then be validated and if needed, learning can continued. that a Web Server Gateway Interface (wsgi) is needed and
https certificates to secure the communication across channels.
If the communication is not secure, then the whole smpc
calculation is pointless. In the conclusion section of the report,
the choice of the communication framework is again reflected
upon. To help with the secret sharing part, crypten by Meta
is used. It is a machine learning focused, torch friendly small
encryption library that helps with the normalization of our
tensors. The rest of the shamir secret sharing scheme is
implemented natively in python, notably the generation of
shares, the distribution, summation and reconstruction using
Lagrange Interpolation. The averaging has been simplified to
remove multiplication and division from the equation, as this
would have required multiple communication rounds between
Fig. 1: figure nodes. It is nonetheless the same calculation, just that the first
Architecture multiplication happens in the pre-processing step, locally and
the division by the sum of all dataset sizes happens after the
reconstruction. This is achievable as we do not consider the
B. Implementation dataset size to be sensitive data to share.
When it comes to the implementation, python was chosen The project includes two nodes, mediator.py and node.py. Each
as a main language, as it is the standard in machine learning node can be started with a port as an argument. The IP of the
applications. While we use a YOLO model using torch, any mediator is supposed to be known beforehand as a way to
model could potentially work. The peer-to-peer communica- access the network. For testing, all nodes can be auto started
tion between nodes was also implemented in python, using by running the launch script, which will first start the mediator,
flask.2 While flask is mainly a web server framework, modi- then all other nodes with auto-assigned port numbers. For
fying it a bit using multi threading, results in a p2p network deployment, https would be needed for secure communication
over http(s). Some race conditions apply when working with channels, without this, the project is not secure and it does not
work properly. The nodes spawn worker threads for certain many ways. Future research might look into homomorphism,
tasks, that use the files trainer.py, distributor.py, calculator.py, data or label obfuscation, or the effects of noise on the output
in that order. The trainer is just a simulation of a training parameters.
environment and can be replaced by any training protocol, as is
VI. ACKNOWLEDGMENTS
or using parallelization and containerization for optimization.
For the sharing, the tool uses the SecureSharedModel class that I would like to thank Prof. Dr. Thomas Engel for their
itself uses the SecureSharedTensor class. The SecureShared- guidance, expertise and feedback and Dr. Jeff Mangers for
Model performs the operations on the entire model, while the opportunity to research for this paper in the first place. I
delegating each tensor operation to the SecureSharedFloat would like to thank both of them for their invested time and
class, having implemented a tensor friendly calculation for enabling me to pursue this project.
the sharing. A base-model in the models folder is needed LATEX [?] [?].
to run the program as a mediator. Some additional files are R EFERENCES
generated, that were used for debugging purposes. These can
[1] Mehrdad Aliasgari, Marina Blanton, Yihua Zhang, and Aaron Steele.
be streamlined in production. A testing and inspector file are Secure computation on floating point numbers.
included to inevstigate the models. [2] European Parliament and Council of the European Union. Regulation
(EU) 2016/679 of the European Parliament and of the Council.
V. C ONCLUSION [3] Timothy Finamore. SHAMIR’s SECRET SHARING SCHEME USING
FLOATING POINT ARITHMETIC.
To conclude, we will first identify, which goals from the [4] Deepthi Jallepalli, Navya Chennagiri Ravikumar, Poojitha Vurtur Badar-
proposal have been reached or not. We will then discuss inath, Shravya Uchil, and Mahima Agumbe Suresh. Federated learning
for object detection in autonomous vehicles. In 2021 IEEE Seventh Inter-
what was good or bad or could have been improved. Lastly, national Conference on Big Data Computing Service and Applications
we propose some topics that may be interesting for further (BigDataService), pages 107–114. IEEE.
research. [5] Marcel Keller. MP-SPDZ: A versatile framework for multi-party
computation. In Proceedings of the 2020 ACM SIGSAC Conference
The scenario in the first research question was later redefined on Computer and Communications Security, pages 1575–1590. ACM.
and refined to better reflect the use case. The goal of this [6] Brian Knott, Shobha Venkataraman, Awni Hannun, Shubho Sengupta,
question has been reached, the computation has been securely Mark Ibrahim, and Laurens van der Maaten. CrypTen: Secure multi-
party computation meets machine learning.
achieved using secret sharing. [7] Fengxia Liu, Zhiming Zheng, Yexuan Shi, Yongxin Tong, and Yi Zhang.
The second research question redefined the role of the mediat- A survey on federated learning: a perspective from multi-party compu-
ing party from a trusted party to a semi-trusted party, however, tation. 18(1):181336.
[8] Yun Luo, Yuling Chen, Tao Li, Yilei Wang, Yixian Yang, and Xiaomei
the goal remained the same. The goal has been partially Yu. An entropy-view secure multiparty computation protocol based on
reached, reconstruction of the output is possible with Lagrange semi-honest model:. 34(10):1–17.
interpolation, the last step of calculating the average can than [9] Michael Matena and Colin Raffel. Merging models with fisher-weighted
averaging.
be done by the mediator, as no sensitive data is involved here. [10] Vaikkunth Mugunthan, Antigoni Polychroniadou, David Byrd, and
However, while it would be possible to verify the dataset size Tucker Hybinette Balch. SMPAI: Secure multi-party computation for
claim, it would be hard to prove that the actual size was used federated learning.
[11] Gaith Rjoub, Omar Abdel Wahab, Jamal Bentahar, and Ahmed Saleh
in the weighting of the parameters. For this reason, the claim Bataineh. Improving autonomous vehicles safety in snow weather
verification is not performed, as it provides no purpose. using federated YOLO CNN learning. In Jamal Bentahar, Irfan Awan,
While the initially proposed solution used a sandboxed en- Muhammad Younas, and Tor-Morten Grønli, editors, Mobile Web and
Intelligent Information Systems, volume 12814, pages 121–134. Springer
vironment to reach security, the final implemented solution International Publishing. Series Title: Lecture Notes in Computer
used a different approach, using secret sharing, used in many Science.
federated machine learning papers. [12] Shefer Roper and Plessner Bar. Secure computing protocols without
revealing the inputs to each of the various participants.
The final solution provides an easy way to reach the goal, [13] Ajith Suresh. MPCLeague: Robust MPC platform for privacy-preserving
it can be done relatively fast, as opposed to more complex machine learning.
solutions, that alter the model architectures to share parameters [14] Ultralytics. YOLO11 NEW.
in between layers. The solution generally works, but it might
not perform well, if datasets and classes used between parties
are largely different, this could be further examined in further
research and how to make models more robust. The peer to
peer communication was achieved using python and flask with
some additional threads to handle asynchronous tasks, this
generally works, but performance could be increased by using
a compiled language and establishing a proper peer to peer
network. What this paper did not answer, is whether the data
is actually the sensitive data that requires protection. Further
research could acquire, whether the data is more valuable, or
the learnt parameters. There are many different ways to protect
parameters, labels or data, this paper just explored one of

You might also like