Hardening Function For Large Scale Distributed Computations: Doug Szajda Barry Lawson Jason Owen

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 39

Hardening Function for

Large Scale
Distributed
Computations
Doug Szajda
Barry Lawson
Jason Owen

1
Large Scale Distributed
Computations

• Easily Parallelizable Compute Intensive


• Divide into independent tasks to be
executed on participant PCs
• Significant results collected by
supervisor
• Participants may receive credits
– Money, e-cash, ISP fees, fame and glory
2
Application Domains

• DNA sequencing • Monte Carlo


• Protein folding Simulation
• Graphics • Data Mining
rendering • Genetic
• Exhaustive Algorithms
Regression • Finding Martians
Examples
• seti@home • SmallPox study
• folding@home – United Devices
• GIMPS (Entropia) – IBM
– Department of
• Distributed.net Defense
– Sloan-Kettering
Cancer Center
– University of
Western Ontario
The Problem(s)
• Code is executing in untrusted
environments
– Results may be corrupted either
intentionally or unintentionally
– Significant results may be withheld
– Cheating : credit for work not performed
• Code may be operating on confidential
data
An Easy(?) Solution
• Assign Tasks Redundantly
• Collusion may seem unlikely but…
– Firms solicit participants from groups
such as alumni associations and large
corporations
• Processor cycles are primary
resource of firm providing the
platform
Related Work
• Golle and Mironov (2001)
• Golle and Stubblebine (2001)
• Monrose, Wyckoff, Rubin (1999)
• Body of literature on protecting
mobile agents from malicious hosts
– Sander and Tschudin, Vigna, Hohl, and
others
Why Isn’t This Same As
Mobile Agent Problem?
• Mobile agent task execution often
cannot be performed a priori
– Required data is typically lacking
– I.e. gathering airfare information
• Metacomputation tasks can be
executed a priori
• Key issue is verification
Adversary
• Assumed to be intelligent
– Can decompile, analyze, modify code
– Has knowledge of algorithms assigned task
and measures used to prevent corruption
• Motivation may not be rational…
– I.e. gaining credits may not be important
– E.g. business competitor
• But does not wish to be caught
Our Approach
• Hardening functions
– Verb, not noun
• Does not guarantee resulting
computation returns correct results
• Does not prevent an adversary from
disrupting a computation
• Significantly increases likelihood that
abnormal activity will be detected
Formally
• Computation is evaluation of
algorithm f : D -> R for every input
value x in D
• Tasks created by partitioning D into
subsets Di
• Each task assigned filter function Gi
with domain power set of R and range
power set of f(Di)
Formally
• For x in Di, f(x) is significant if and
only if f(x) is in Gi(f(Di))
• Generality required for cases where
significance of computed value is
relative to values of f at other
elements of Di
– E.g. Return your ten best results
Two General Classes
• Non-sequential
– Computed values of f in task are
independent of one another
• Sequential
– Participant given single value x0 and
asked to compute first m elements of
sequence xn = f (xn-1)
Hardening Non-sequentials

• IOWF: Given f and y, find x s.t. f(x) = y


• Extend Golle and Mironov’s ringer
strategy for IOWF
• Before transmitting task, supervisor
chooses n random values xj from Di and
precomputes each f(xj)
Hardening Non-sequentials
• Participant given all the f(xj) along with
y and instructed to return any x that
maps to any of these values
• Works because IOWF has some nice
properties
– Limited ways it can be disrupted
– Results are easily verified
Our Extension
Plant each tasks data set with values ri
such that the following hold:
1. Supervisor knows f(ri) for each i
2. Participant cannot distinguish ri
from other data values regardless
of number of tasks a participant
completes
Our Extension
3. Participants do not know number of
ri in data space
4. For some known proportion of ri,
f(ri) is a significant result
5. It is at least as easy to modify
computation than to redundantly
assign tasks
Our Extension
Nice, but not necessary:
• Same set of ri can be used for many
different partitions of the data
space so effort of computing ri is
amortized
Difficulties
• ri are indistinguishable only if they
generate truly significant results
• What is indistinguishable in theory may
not be in practice
– E.g. DES key search: Tasks given
ciphertext C and subset Ki of key space,
told to decrypt C with each ki and return
any key that generates plausible plaintext
Even Filter Function Can Be
Revealing...
• E.g. Traveling Salesperson with five
precomputed circuits of length 100,
105, 102, 113, 104
– Return any circuit whose length is any of
the above or less than 100
– Return the ten best circuits found
– Return any circuit with length less than
120
Optimization Problems
• Designate small proportion of tasks
as initial distribution
• Distribute each of these tasks
redundantly
• Check returned values Handle non-
matches appropriately
• Retain k best results and use them as
ringers for remaining tasks
Collusion
• If task in initial distribution is
assigned to colluding adversaries,
supervisor will initially miss this
• Honest participants not in initial
distribution will eventually return
results that do not match
• Supervisor can then determine which
participants have been dishonest
Size of Initial Distribution
• Probability that at least k of n best
results are in proportion p of space is

 n pi (1  p)n i
ik  k 
n

n k p prob For exhaustive


regression on 32
50 8 0.25 0.9547 variables, best 105
results represents
150 5 0.1 0.9967 better than 99.99999th
percentile
105 100 0.02 ≈1
Caveat
• Previous figures assume:
– n, k much less than size of data space
– proportion of incorrect results is small
• Probability should really be adjusted
to reflect expected number of
incorrect results returned in initial
distribution
The Good
• No precomputing required
• Hardening is achieved at fraction of
cost of simple redundancy
• Ringers can be used for multiple tasks
• As additional good results are
obtained, these can be used as ringers
• Collusion resistant since ringers can be
combined in many ways
The Bad
• Time cost of an individual compute
job is at least doubled, assuming each
task requires same time to complete
• But, by running multiple projects
concurrently, overall throughput
rates can be reduced to factor of 1+p
times rate of unmodified job
The Ugly
• In some cases, implementation details can
give away identities of ringers (or require
significant changes to app)
• E.g. Exhaustive Regression
– Specify variables on which to regress using bit
strings
– Would like to be able to specify a start and end
bit string as opposed to listing all of them. This
implies some regularity
– It is difficult to hide ringers in any systematic
distribution scheme
Sequential Computations
• Seeding the data is impractical
• Often the validity of returned results
can only be checked by performing
the entire task
• Ex: Mersenne Primes
– nth Mersenne Number, Mn, is 2n-1
– Mn is prime  S(n-1) = 0 (mod Mn) where
S(0) = 4, S(k+1) = S(k)2-2
The Strategy
• Share the work of computing N tasks
among K participants
• K > N is very small proportion of total
number of participants in computation
• Assume:
– Each task requires roughly m iterations
– K/N < 2, else simple redundancy is cheaper
The Algorithm
1. Divide tasks into S segments, each
containing roughly J = m/S
iterations
2. Each participant in group is given an
initial value and computes first J
iterations using this value
3. When J iterations complete, results
returned to supervisor
The Algorithm
4. Supervisor checks correctness of
redundantly assigned subtasks
5. Supervisor permutes N values and
assigns these values to K
participants as initial value for next
segment
6. Repeat until all S segments
completed
The Numbers
• If K/N < 2 and each task assigned to no
more than two participants, in absence of
collusion, probability of catching cheater
in a segment is 2(K-N)/K
• If adversary cheats in fraction L of S
segments, then
L
 2N  K 
P(nabbed)  1   
 K 
More Numbers
• Last equation is not independent of S because
S is upper bound for L
• In order to have at least P probability of
catching cheater, need
2N
K 1
1  (1  P) L

• So small value of S (limiting L) means more


redundancy required for given security level
Probabilities
K N S L P(caught)
5 4 5 5 0.9222
5 4 10 10 0.9939
5 4 10 2 0.64
5 4 20 4 0.8704
10 9 10 10 0.8926
10 9 10 2 0.36
Redundancy vs. P values
L=1
P 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
K/N 1.05 1.11 1.18 1.25 1.33 1.43 1.54 1.67 1.82 2.0

L=2

P 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
K/N 1.03 1.06 1.09 1.13 1.17 1.23 1.29 1.38 1.52 2.0
Advantages
• Far fewer task compute cycles than simple
redundancy
• Values need not be precomputed
• Method is relatively collusion resistant
(unless supervisor picks an entire group of
colluding participants)
• Method is tunable
• Can also be applied to non-sequential case
Disadvantages
• Increases coordination and
communication costs for supervisor
• Need for synchronization increases
time cost of job
– Especially problematic if participants are
connected via dial-up line or operate
sporadically (PC owners using them)
Disadvantages
• Strategy does not protect well
against adversary who cheats once
• Cheating damage can be magnified
because undetected incorrect results
become inputs to subsequent stages
of calculation
Conclusions
• Presented two strategies for hardening
distributed metacomputations
• Non-sequential: Seed data with ringers
• Sequential: Share N tasks among K > N
participants
• Small increase in average execution time of
modified task
• Overall computing costs significantly less
than redundantly assigning every task

You might also like