HPC Tokyo TIM v2

Uploaded by

abhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views13 pages

HPC Tokyo TIM v2

Uploaded by

abhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Integrating HPC into the

ATLAS Distributed
Computing environment
Doug Benjamin
Duke University
HPC Boundary conditions
• There are many scientific HPC machines across the US and the
world.
o Need to design system that is general enough to work on
many different machines
• Each machine is independent of other
o The “grid” side of the equation must aggregate the
information
• There several different machine architectures
o ATLAS jobs will not run unchanged on many of the machines
o Need to compile programs for each HPC machine
o Memory per node (each with multiple cores) varies from
machine to machine
• The computational nodes typically do not have connectivity to
the Internet
o connectivity is through login node/edge machine
o Pilot jobs typically can not run directly on the computational
nodes
o TCP/IP stack missing on computational nodes
Introduction
• My take on comparison between HPC and
HTC(grid)
HTC – goes fast and
steady

Goes really fast

Similar but different
Additional HPC issues
• Each HPC machine has its own job management
system
• Each HPC machine has its own identity
management system
• Login/Interactive nodes have mechanisms for
fetching information and data files
• HPC computational nodes are typically MPI
• Can get a large number of nodes
• The latency between job submission and
completion can be variable. (Many other users)
Work Flow
• Some ATLAS simulation jobs can be broken up into
3 components
(Tom LeCompte’s talked about this in greater detail)
1. Preparatory phase - Make the job ready for HPC
o For example - generate computational grid for Alpgen
o Fetch Database files for Simulation
o Transfer input files to HPC system
2. Computational phase – can be done on HPC
o Generate events
o Simulate events
3. Post Computational phase (Cleanup)
o Collect output files (log files, data files) from HPC jobs
o Verify output
o Unweight (if needed) and merge files
HTC->HPC->HTC
• ATLAS job management system (PANDA) need not run
on HPC system
o This represents a simplification
o Nordu-grid has been running this way for a while
• Panda requires pilot jobs
• Autopy factory is used to submit Panda pilots
• Direct submission of pilots to a condor queue works
well.
o Many cloud sites use this mechanism – straight forward to use
• HPC portion should be coupled but independent of
HTC work flow.
o Use messaging system to send messages between the
domains
o Use grid tools to move files between HTC and HPC
Infrastructure
• APF Pilot factory to submit pilots
• Panda queue – currently testing an ANALY QUEUE
• Local batch system
• Web server to provide steering XML files to HPC
domain
• Message Broker system to exchange information
between Grid Domain and HPC domain
• Gridftp server to transfer files between HTC
domain and HPC domain.
o Globus Online might be a good solution here (what are
the costs?)
• ATLAS DDM Site - SRM and Gridftp server(s).
HPC code stack
• Work done by Tom Uram - ANL
• Work on HPC side is performed by two components
o Service: Interacts with message broker to retrieve job descriptions, saves jobs in
local database, notifies message broker of job state changes
o Daemon: Stages input data from HTC GridFTP server, submits job to queue,
monitors progress of job, and stages output data to HTC GridFTP server
• Service and Daemon are built in Python, using the
Django Object Relational Mapper (ORM) to
communicate with the shared underlying database
o Django is a stable, open-source project with an active community
o Django supports several database backends
• Current implementation relies on GridFTP for data
transfer and the ALCF Cobalt scheduler
• Modular design enables future extension to alternative
data transfer mechanisms and schedulers
Message Broker system
• System must have large community support beyond
just HEP
• Solution must be open source (Keeps Costs
manageable)
• Message Broker system must have good
documentation
• Scalable
• Robust
• Secure
• Easy to use
• Must use a standard protocol (AMQP 0-9-1 for
example)
• Clients in multiple languages (like JAVA/Python)
RabbitMQ message broker
• ActiveMQ and RabbitMQ evaluated.
• Google analytics shows both are equally popular

• Bench mark measurements show that RabbitMQ

server out performs ActiveMQ
• Found it easier to handle message routing and our
work flow
• Pika python client easy to use.
Basic Message Broker design
• Each HPC has multiple permanent durable queues.
o One queue per activity on HPC
o Grid jobs send messages to HPC machines through these queues
o Each HPC will consume messages from these queues
o Routing string is used to direct message to the proper place
• Each Grid Job will have multiple durable queues
o One queue per activity (Step in process)
o Grid job creates the queues before sending any message to HPC queues
o On completion of grid job job queues are removed
o Each HPC cluster publishes message to these queues through an
exchange
o Routing string is used to direct message to the proper place
o Grid jobs will consume messages the messages only on its queues.

• Grid domains and HPC domains have independent

polling loops
• Message producer and Client code needs to be
tweaked for additional robustness
Open issues for a production
system
• Need a federated Identity management
o Grid identify system is not used in HPC domain
o Need to strictly regulate who can run on HPC machines

• Security-Security (need I say more)

• What is the proper scale for the Front-End grid
cluster?
o Now many nodes are needed?
o How much data needs to be merged?

• Panda system must be able to handle large

latencies.
o Could expect jobs to wait a week before running
o Could be flooded with output once the jobs run.

• Production task system should let HTC-HPC system

have flexibility to decide how to arrange the task.
o HPC scheduling decisions might require different Task geometry to get
the work through in an expedient manner
Conclusions
• Many ATLAS MC jobs can be divided into a Grid
(HTC) component and a HPC component
• Have demonstrated that using existing ATLAS
tools that we can design and build a system to
send jobs from grid to HPC and back to Grid
• Modular design of all components makes it easier
to add new HPC sites and clone the HTC side if
needed for scaling reasons.
• Lessons learned from Nordugrid Panda integration
will be helpful
• A lightweight yet powerful system is being
developed.

Proposal For Cloud Computing
0% (1)
Proposal For Cloud Computing
4 pages
Workbook H2 CFG CCIE PDF
No ratings yet
Workbook H2 CFG CCIE PDF
91 pages
Introduction To HPC and Current Usage in HEP
No ratings yet
Introduction To HPC and Current Usage in HEP
33 pages
HPC Intro Ad OS
No ratings yet
HPC Intro Ad OS
44 pages
Cost-Effective HPC Clustering For Computer Vision Applications
No ratings yet
Cost-Effective HPC Clustering For Computer Vision Applications
6 pages
L1.1 HPC Environment
No ratings yet
L1.1 HPC Environment
27 pages
The Development in Network Performance: and It's Impact On The Computing Model of Tomorrow
No ratings yet
The Development in Network Performance: and It's Impact On The Computing Model of Tomorrow
40 pages
CC Sem
No ratings yet
CC Sem
20 pages
G. B. Pant of Institute & Technology: Comparison of Parallel Processing Via HPC Cluster Vs Non Parallel Processor
No ratings yet
G. B. Pant of Institute & Technology: Comparison of Parallel Processing Via HPC Cluster Vs Non Parallel Processor
22 pages
Cluster Platform Knowledgebase Readthedocs Io en Latest
No ratings yet
Cluster Platform Knowledgebase Readthedocs Io en Latest
69 pages
Clustering For Massive Parallelism
No ratings yet
Clustering For Massive Parallelism
3 pages
Celery Ipython Mpi4py PDF
No ratings yet
Celery Ipython Mpi4py PDF
8 pages
Final Phase 3 Presentation PDF
No ratings yet
Final Phase 3 Presentation PDF
15 pages
Parallel and Cluster Computing
No ratings yet
Parallel and Cluster Computing
31 pages
Levque Cluster User Manual
No ratings yet
Levque Cluster User Manual
34 pages
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
No ratings yet
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
6 pages
Lecture 5 GridComputing-2014
No ratings yet
Lecture 5 GridComputing-2014
39 pages
2-IJCI Vol. 3 No. 7-July 2024-Paper1-Mr. M.mahagoub3
No ratings yet
2-IJCI Vol. 3 No. 7-July 2024-Paper1-Mr. M.mahagoub3
41 pages
Grid Computing: Advanced Web, For Computing, Collaboration and Communication
100% (3)
Grid Computing: Advanced Web, For Computing, Collaboration and Communication
11 pages
Understanding of HPC Cluster and Its Component
No ratings yet
Understanding of HPC Cluster and Its Component
29 pages
Summer Student Project Report
No ratings yet
Summer Student Project Report
5 pages
Unit I GCC
No ratings yet
Unit I GCC
6 pages
Cluster 2
No ratings yet
Cluster 2
26 pages
04 - Computer Clusters
No ratings yet
04 - Computer Clusters
66 pages
Apache Hadoop and Hive: Dhruba Borthakur
No ratings yet
Apache Hadoop and Hive: Dhruba Borthakur
32 pages
A Practical Guide To Building High-Performance Computing Clusters
No ratings yet
A Practical Guide To Building High-Performance Computing Clusters
69 pages
1 s2.0 S1877042809004236 Main
No ratings yet
1 s2.0 S1877042809004236 Main
6 pages
1 Intro To HPC Compressed 1 Part 1
No ratings yet
1 Intro To HPC Compressed 1 Part 1
22 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Ccmid1 Unit1
No ratings yet
Ccmid1 Unit1
17 pages
Performance Analysis of Parallel Programs in HPC in Cloud
No ratings yet
Performance Analysis of Parallel Programs in HPC in Cloud
5 pages
Introduction To Grid Computing With High Performance Computing
No ratings yet
Introduction To Grid Computing With High Performance Computing
46 pages
Unit-1 Part-1
No ratings yet
Unit-1 Part-1
14 pages
MQTT Vs Opc Ua: Simon Detollenaere
No ratings yet
MQTT Vs Opc Ua: Simon Detollenaere
22 pages
Unit I
No ratings yet
Unit I
13 pages
Pratima Thesis June2015
No ratings yet
Pratima Thesis June2015
36 pages
1 Cluster Computing
No ratings yet
1 Cluster Computing
42 pages
CC 1
No ratings yet
CC 1
11 pages
Cdac Grid References
No ratings yet
Cdac Grid References
15 pages
Seminar
No ratings yet
Seminar
20 pages
22 Clusters Slides
No ratings yet
22 Clusters Slides
61 pages
Client/Server Computing: Operating Systems: Internals and Design Principles, 6/E
No ratings yet
Client/Server Computing: Operating Systems: Internals and Design Principles, 6/E
79 pages
Notes
No ratings yet
Notes
72 pages
Intro - HPC Cluster Computing v2 PDF
No ratings yet
Intro - HPC Cluster Computing v2 PDF
73 pages
Module 1
No ratings yet
Module 1
33 pages
Introduction To High Performance Computing: Shaohao Chen Research Computing Services (RCS) Boston University
No ratings yet
Introduction To High Performance Computing: Shaohao Chen Research Computing Services (RCS) Boston University
29 pages
PDC Lec 7
No ratings yet
PDC Lec 7
22 pages
UNIT I Notes
No ratings yet
UNIT I Notes
5 pages
Good
No ratings yet
Good
39 pages
Distributed CS571
No ratings yet
Distributed CS571
36 pages
Module1 Part1
No ratings yet
Module1 Part1
26 pages
Taking The Complexity Out of Cluster Computing
No ratings yet
Taking The Complexity Out of Cluster Computing
16 pages
Moon Buggy
No ratings yet
Moon Buggy
22 pages
Jamshed 2015
No ratings yet
Jamshed 2015
17 pages
Cloud Computing Decode
No ratings yet
Cloud Computing Decode
92 pages
Presentation CC 1
No ratings yet
Presentation CC 1
63 pages
Explain All The Evolutionary Changes in The Age of Internet Computing. The Age of Internet Computing
No ratings yet
Explain All The Evolutionary Changes in The Age of Internet Computing. The Age of Internet Computing
5 pages
Cluster Basics
No ratings yet
Cluster Basics
34 pages
Online Documents Submission: LS-OSA Proposal 07012020
No ratings yet
Online Documents Submission: LS-OSA Proposal 07012020
5 pages
Grade 7 Final Exams Paper
No ratings yet
Grade 7 Final Exams Paper
9 pages
Analog Electronics Module 2 New
No ratings yet
Analog Electronics Module 2 New
73 pages
C04 - CRM - WebClient - Navigation - Bar - Profile - CRM WEBCLIENT USER INTERFACE
No ratings yet
C04 - CRM - WebClient - Navigation - Bar - Profile - CRM WEBCLIENT USER INTERFACE
12 pages
5 Advance Presentation Skill
No ratings yet
5 Advance Presentation Skill
27 pages
Jio 5g True Rechrge 51 - Google Search
No ratings yet
Jio 5g True Rechrge 51 - Google Search
1 page
Azure Detailed Steps and Guidelines For Migration
No ratings yet
Azure Detailed Steps and Guidelines For Migration
11 pages
Git Installation
No ratings yet
Git Installation
10 pages
Security Laboratory
No ratings yet
Security Laboratory
48 pages
10 Basic Examples of Linux Netstat Command - HTML
No ratings yet
10 Basic Examples of Linux Netstat Command - HTML
5 pages
Consent Form For Parent and Guardians of Students Final
No ratings yet
Consent Form For Parent and Guardians of Students Final
4 pages
Approved General Guidelines For Admission and Enrollment of The Juris Doctor Applicants 1st Sem AY 2023 2024 v3
No ratings yet
Approved General Guidelines For Admission and Enrollment of The Juris Doctor Applicants 1st Sem AY 2023 2024 v3
4 pages
DownSouth Handbook
No ratings yet
DownSouth Handbook
5 pages
NetLinx Programmer's Guide RMS 3.3
No ratings yet
NetLinx Programmer's Guide RMS 3.3
84 pages
Final Doc7
No ratings yet
Final Doc7
39 pages
R1 - Case Study Problem Statement PDF
No ratings yet
R1 - Case Study Problem Statement PDF
2 pages
Module 1: Basic Device Configuration: Instructor Materials
No ratings yet
Module 1: Basic Device Configuration: Instructor Materials
64 pages
Blackline Live: Quick Start Guide
No ratings yet
Blackline Live: Quick Start Guide
20 pages
Packet Tracer LAB CISCO
No ratings yet
Packet Tracer LAB CISCO
43 pages
Activity 4
No ratings yet
Activity 4
65 pages
Core Course Outline
No ratings yet
Core Course Outline
4 pages
PMT Hps Experion Pks Domain Migration Guide r520
No ratings yet
PMT Hps Experion Pks Domain Migration Guide r520
43 pages
2025-07-10
No ratings yet
2025-07-10
26 pages
Android Training Lesson 3: FPT Software
No ratings yet
Android Training Lesson 3: FPT Software
15 pages
Yealink Configuration Generator Tool User Guide V86
No ratings yet
Yealink Configuration Generator Tool User Guide V86
28 pages
1research Links - Find Information On Anything
100% (5)
1research Links - Find Information On Anything
24 pages
Menu Code
No ratings yet
Menu Code
19 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
5 pages
Embedded Systems Architecture - Second Edition Daniele Lacamera Download
100% (2)
Embedded Systems Architecture - Second Edition Daniele Lacamera Download
46 pages

HPC Tokyo TIM v2

Uploaded by

HPC Tokyo TIM v2

Uploaded by

Integrating HPC into the

Goes really fast

• Bench mark measurements show that RabbitMQ

• Grid domains and HPC domains have independent

• Security-Security (need I say more)

• Panda system must be able to handle large

• Production task system should let HTC-HPC system

You might also like