Cloud Scheduling
Cloud Scheduling
Improving MapReduce Performance in Heterogeneous Environments, M. Zaharia et al, OSDI 2008 Quincy: Fair Scheduling for Distributed Computing Clusters, M. Isard et al, SOSP 2009 CA-NFS: A Congestion-Aware Network File System, A. Batsakis et al, FAST 2009
CS 525
Motivation - MapReduce
Straggler Task
Poorly perform due to faulty hardware or mis-configuration Minimize jobs response time Speculation copy (backup task) for straggler Homogeneous cluster nodes Constant progress rate of task Copy, sort, and reduce phases take the same amount (1/3) work of reduce task Tasks finish in waves - low progress score represents a straggler
CS 525 4
Speculative Execution
Assumptions
Multiple VMs on a same physical host Multiple HW generations Copy phase of reduce task is slowest due to network communication Tasks from different generations run concurrently Too many speculative tasks can run (80% of reducers) Wrong (fast and new) task can be selected as straggler Speculative task can be assigned to slow node
CS 525 5
Problems
LATE Scheduler
Speculate (backup) the task with the largest estimated time left
Progress Score Progress Rate = Execution Time 1 Progress Score Estimated Time Left = Progress Rate
Time (min)
CS 525 - Figures are borrowed from the slide of the authors
2 min
Time (min)
LATE Example
A job with 5 tasks Node 1 Node 2 Node 3
Progress = 66%
2 min
Estimated time left: (1-0.66) / (1/3) = 1 Estimated time left: (1-0.05) / (1/1.9) = 1.8
Progress = 5.3%
Time (min)
Assuming constant progress rate may be incorrect Heterogeneity impacts appear in the past progress
Looking Forward
CS 525
10
Based on how much the tasks hurt the response time SlowTaskThreshold 25th percentile Based on total work performed SlowNodeThreshold 25th percentile SpeculativeCap - 20% Avoid unnecessary speculations Limit contention and hurting throughput
Experimental Environments
Environments
Amazon EC2 (200-250 nodes) 2 replicas of each chunk on Hadoop Distributed File System Up to 2 mappers and 2 reducers (Hadoop default)
Heterogeneity Setup
Contention on resources
CS 525
12
Best
Average
Best
Average
Remarks
Contributions
Analysis about heterogeneity that makes speculation of Hadoop worse than native setting LATE speculatively executes the tasks that hurts the response time the most on fast nodes
Considering heterogeneity
Limitations
Lack of considerations about data locality and fairness Tasks may require different amount of computation Speculation reduces the throughput of cloud
CS 525
15
Discussion Points
CS 525
16
Michael Isard Vijayan Prabhakaran Jon Currey Udi Wieder Kunal Talwar Andrew Goldberg Microsoft Research
Outline
Evaluation
Motivation
Scheduling concurrent jobs on clusters Sharing of the cluster among short jobs
High bandwidth between computers: expensive Computations are placed close to the input data
Cluster Architecture
Queue-based scheduler
Min-cost flow
Flow network:
Feasible flow: assigns a non-negative integer flow fe ye, such that for every node v,
Data Locality
Application data is stored on the computing nodes. Scheduling computations close to their data is crucial for performance. Hadoop:
Computer storing one of the replica On the same rack Random computer
N computers, J jobs Each job gets at least N/J computers Fine-grain sharing:
Multiplex all computers in cluster between jobs. When a task completes computer may be assigned to another job. Job uses N/J computers at a time but set in use varies over lifetime.
Varying workload Not able to adjust to workload changes Low system throughput and resource utilization under nonuniform workloads
w1 w2 w3 w4 w5 r1 w6
U1
w1 C1 R1 w6 C2 w3 C3 X S
C4 w2 R2 C5 R C6 w4
Fairness Policies
Q: Quincy, Unfair sharing without Preemption QF: Quincy with Fairness, without Preemption QP: Quincy with Preemption, Unfair sharing QFP: Quincy with Fairness and Preemption
Evaluation
Cluster of 240 computers Cluster runs the Dryad distributed execution engine Applications:
Running Times
Running Times
Makespan: total time taken by an experiment until the last job completes.
Data transfer
Discussion
CS 525
Problems in NFS
Congestion of Resources
Selfish clients want to maximize throughput Difficult to represent the congestion of multiple resources as a unified metric Requests of clients have same priority The benefit of clients increases by maximizing throughput even in congestion
False Assumptions
CS 525
35
CA-NFS Overview
Congestion-Aware NFS
Usages of resources are monitored as price Asynchronous operations can be deferred depending on server and client states (depending on the price) Try asynchronous operations not to interfere with on-demand synchronous operations
CS 525
36
Pricing Mechanism
Congestion Price
Pi price of resource i, Pmax - max price, represents bottleneck ui utilization of resource i (0< ui <1) ki performance degradation parameter due to congested resource
37
Scheduling
Pricing
Price is increased or decreased corresponding to resource usages of client and server Increased price represents congestion of a resource
By comparing advertised server price with local price, client schedules asynchronous operations
CS 525
38
39
40
Server Price
Based on server memory, disk, and network utilization Based on client memory Clients flushes writes immediately when server load is low Save client memory more cache hit Reduce write latency Clients keep writes in local memory when server load is high Save server memory and I/O (disk and memory) Increase write latency and client memory usage
CS 525 41
Client Price
Write Acceleration
Write Deferral
CPU
Utilization at a given time Average bandwidth over hundreds milliseconds Sampling the length of devices dispatch queue at regular small time interval Calculate the projected cache hit rates using the distribution of read requests
Server Disk
CS 525
42
43
Discussion Points
Limitations
Not Scalable - works only for small number of clients Price can be fluctuated
Multiple clients on separate VMs can run on same node Multiple servers run on separate nodes with replicated data