0% found this document useful (0 votes)
9 views22 pages

A Client-Centric Grid Knowledgebase: George Kola, and Miron Livny

This document discusses a client-centric grid knowledgebase that aims to prevent unexpected behavior in grids. It focuses on learning from past job experiences to avoid issues like "black holes", where resources accept jobs but never complete them. The knowledgebase would parse job logs, load them into a database, and analyze the aggregated experience of different jobs to provide feedback to help schedulers and users. This approach relies only on job log files from the client perspective, making it scalable without needing access to full grid resource and scheduler data.

Uploaded by

rajashekarpula
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views22 pages

A Client-Centric Grid Knowledgebase: George Kola, and Miron Livny

This document discusses a client-centric grid knowledgebase that aims to prevent unexpected behavior in grids. It focuses on learning from past job experiences to avoid issues like "black holes", where resources accept jobs but never complete them. The knowledgebase would parse job logs, load them into a database, and analyze the aggregated experience of different jobs to provide feedback to help schedulers and users. This approach relies only on job log files from the client perspective, making it scalable without needing access to full grid resource and scheduler data.

Uploaded by

rajashekarpula
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 22

A Client-centric Grid Knowledgebase

George Kola, Tevfik Kosar and Miron Livny


University of Wisconsin-Madison

September 23rd, 2004

Cluster 2004
San Diego, CA
A Client-centric Grid Knowledgebase
Grid Trivia

 How many of you have submitted a job to the


Grid resources and did never hear back from
it?
 How many of you got mad by the inconsistent
behavior of some grid resources?
• Completing successfully some jobs and failing
others..
• Similar jobs performing completely different..

... We did!

George Kola, Tevfik Kosar and Miron Livny 2


A Client-centric Grid Knowledgebase
Goal: Prevent Unexpected Behavior in a Grid
 Learn from experience and prevent them from repeating
in the future again.
 Causes for unexpected behavior in a Grid:
• Black holes
• Resources with
– Faulty hardware
– Buggy or misconfigured software
• Extremely slow computational sites
• Memory leaks
..etc

George Kola, Tevfik Kosar and Miron Livny 3


A Client-centric Grid Knowledgebase

Black holes

George Kola, Tevfik Kosar and Miron Livny 4


A Client-centric Grid Knowledgebase
Black holes
 Definition: “A black hole is a region of spacetime
from which nothing can escape, even light.”
 If you send a light beam to a black hole, you never
hear back from it.
 You can only know it after you have encounter it. Is
it too late?
• No. You should learn from experience..

George Kola, Tevfik Kosar and Miron Livny 5


A Client-centric Grid Knowledgebase

Black holes in the Grid


 Resources that accept jobs but never complete them
• You send a job to a resource, but never hear back from it.

George Kola, Tevfik Kosar and Miron Livny 6


A Client-centric Grid Knowledgebase
Black hole examples from real life:
 In the WCER educational video processing pipeline:
• A specific pool was accepting and processing our jobs for
a couple of hours, but evicting before completion.
• A machine accepted a job, but due to a memory leak it
kept throwing “shadow exceptions” and retrying the job
forever.
• Some thirdparty (GridFTP, DiskRouter) transfers hang
occasionally and never returned.
• A machine caused an error because of a corrupted FPU.
It successfully completed MPEG-1 encoding but failed
MPEG-4.

George Kola, Tevfik Kosar and Miron Livny 7


A Client-centric Grid Knowledgebase
Grid is good.. but not perfect..

 Heterogeneous resources
 Multi administrative domains
 Spanning wide area networks
 Consists of commodity hardware and software

Prone to network-, hardware-, software-, middleware-


failures!

We cannot expect everything from the Grid or Grid


middleware!

George Kola, Tevfik Kosar and Miron Livny 8


A Client-centric Grid Knowledgebase
Take the Ethernet Approach
 A truly distributed (and very effective) access control
protocol to a shared service
 Client responsible access control
 Client responsible for error detection
 Client responsible for fairness

Keep track of job/resource performance & failure


characteristics as observed by the client.
Use job/user log files collected at the client side
to build a grid knowledgebase.

George Kola, Tevfik Kosar and Miron Livny 9


A Client-centric Grid Knowledgebase
Grid Knowledgebase
 Parse user/job log files
 Load them into a database
 Aggregate experience of different jobs
 Interpret them
 Plan action
 Generate feedback to the scheduler as well as to
the user

George Kola, Tevfik Kosar and Miron Livny 10


PLANNE JOB
R DESCRIPTION
S

JOB QUEUE

MATCH JOB
MAKER SCHEDULE
R

Clusters Storage Servers Personal Computers

GRID RESOURCES

JOB LOGS
PLANNE JOB
R DESCRIPTION
S

JOB QUEUE

ADAPTATION NOTIFICATIO
MATCH JOB LAYER N LAYER
MAKER SCHEDULE
R

DATA
MINER

DATABASE
Clusters Storage Servers Personal Computers

GRID RESOURCES

JOB
PARSER
JOB LOGS GRID
KNOWLEDGEBAS
A Client-centric Grid Knowledgebase
Database Schema User

Field Type
Submit
JobId Int
JobName string
Schedule
State Int
SubmitHost string
SubmitTime Int Suspend Evicted

ExecuteHost string [] Execute


ExecuteTime string [] Un-suspend Exception
ImageSize int[]
ImageSizeTime int []
EvictTime int [] Terminated Terminated
Abnormally Normally
Checkpointed bool []
EvictReason string
TerminateTime int []
No
Exit code = 0?
TotalLocalUsage string
TotalRemoteUsage string
Yes
TerminateMessage string
ExceptionTime int [] Job Job
Failed Succeeded
ExceptionMessage string []

George Kola, Tevfik Kosar and Miron Livny 13


A Client-centric Grid Knowledgebase
Difference from existing approaches
 Client view
 Use only job/user log files at the client side
• Many administrators do not want to share
resource/scheduler log files.
 We do not need to know everything going on in the
whole grid
• Scalable

George Kola, Tevfik Kosar and Miron Livny 14


A Client-centric Grid Knowledgebase
What do we get?
 Collecting job execution time statistics
• Average job execution time
• Standard deviation
• Fit a distribution
 Detect and avoid black holes
• For normal distribution:
– 99.7% of job execution times should lie between
(avg-3*stdev) and (avg+3*stdev)
– 96% of job execution times should lie between
(avg-2*stdev) and (avg+2*stdev)

George Kola, Tevfik Kosar and Miron Livny 15


A Client-centric Grid Knowledgebase

Detecting hanging transfers


Transfer Time (T) vs Probability (t<T)

120

100
Probability (t<T)

80
(%)

60

40

20

7.9
4.6
4.8
5.0
5.1
5.3
5.5
5.7
5.9
6.2
6.6
6.9
7.3

8.4
9.3
9.8
11.9
14.2
15.3
Transfer Time (T)
(minutes)

George Kola, Tevfik Kosar and Miron Livny 16


A Client-centric Grid Knowledgebase
Setting Execution Time Limits
 Avg = 7.8 min
 Stdev = 3.17min
 For normal distribution:
• %99.7 : [0 – 17.31 min]
• %96 : [1.46 min – 14.14 min]

George Kola, Tevfik Kosar and Miron Livny 17


A Client-centric Grid Knowledgebase
What do we get? (2)
 Identifying misconfigured machines
• e.g. find set of machines which fail jobs with I/O data
size larger than 2 GB (i.e. OS limitations)
 Identifying factors affecting job run-time
 Bug hunting
• Job failures on certain inputs
• Memory leaks
– Scheduler logs image size regularly

George Kola, Tevfik Kosar and Miron Livny 18


A Client-centric Grid Knowledgebase
Catching Memory Leaks
Job Memory Image Size (MB)

Time

George Kola, Tevfik Kosar and Miron Livny 19


A Client-centric Grid Knowledgebase
What do we get? (3)
 Application optimization
• How long does each step of an application/pipeline
take to execute?
 Adaptation
• Find resources that take least time to execute jobs
from a particular class

George Kola, Tevfik Kosar and Miron Livny 20


A Client-centric Grid Knowledgebase
Conclusions
 View of the Grid from the client side
 Job/user log files as main source of information
 Aggregate experience of different jobs and pass
them to future ones
 Helps in:
• Catching black holes
• Identify faulty/misconfigured resources
• Bug tracking
• Statistics collection
 Future work:
• Merge experience of different clients

George Kola, Tevfik Kosar and Miron Livny 21


A Client-centric Grid Knowledgebase

Thank you…
For more information, contact:

Tevfik Kosar
http://www.cs.wisc.edu/~kosart
[email protected]

George Kola, Tevfik Kosar and Miron Livny 22

You might also like