0% found this document useful (0 votes)

10 views

Prasanth Kothuri, Danilo Piparo, Enric Tejedor Saavedra, Diogo Castro Cern It and Ep-Sft

Uploaded by

Ade Rahman

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Prasanth Kothuri, Danilo Piparo, Enric Tejedor Saavedra, Diogo Castro Cern It and Ep-Sft

Uploaded by

Ade Rahman

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Apache Spark – Usage and deployment models for scientific computing

Prasanth Kothuri, Danilo Piparo, Enric Tejedor Saavedra,

Diogo Castro
CERN IT and EP-SFT
Outline
 What is Spark?
 Current Usage and Deployment models
 Recent Developments
 Integration of SWAN with Spark Clusters
 Access and Authentication to EOS storage
 Spark on Kubernetes
 TPC-DS – Validation of the infrastructure
 CMS Big Data – Data Reduction Facility
2
Apache Spark
 Apache Spark is an open-source parallel processing framework with expressive
development APIs (in multiple languages) that allows for sophisticated analytics, real-
time streaming and machine learning on large datasets

 Wide library support for

• unstructured input data
• efficient analysis storage formats
• stats and machine learning algorithms

 provides parallel processing primitives

• declarative - traditional SQL queries
• imperative (no-SQL)

 bindings to most popular analysis languages: Python, R, Scala, Java 3

Spark – Current Deployment
 Spark in deployed on the Hadoop clusters and uses Hadoop YARN as resource manager

 YARN – Yet Another Resource Negotiator

 general-purpose application scheduling framework for distributed applications
 responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues
etc

 Support multiple versions of Spark, from 1.6.0 to 2.3.0

Cluster Name Configuration Software Version
Accelerator 20 nodes Spark 1.6.0 – 2.3.0
logging (Cores 480, Mem - 8 TB, Storage – 5 PB, 96GB in SSD) hadoop_2.7.5
SNAPSHOT OF CURRENT

General 48 nodes Spark 1.6.0 – 2.3.0

Purpose (Cores – 892,Mem – 7.5TB,Storage – 6 PB) cdh5.7.5
ATLAS Event 18 nodes Spark 1.6.0 – 2.3.0
DEPLOYMENT

Index (Cores – 288,Mem – 912GB,Storage – 1.29 PB) cdh5.7.5

QA cluster 10 nodes Spark 1.6.0 – 2.3.0
hadoop_2.7.5
4
Selected “Big Data” projects using Spark
 Next Generation CERN Accelerator Logging system (NXCals)
 Critical system for running LHC – 700 TB today, growing 200 TB/year
 Spark is compute engine chosen for interactive exploration and analysis of data, API for Data Extraction
and batch analysis

 WLCG and CERN CC monitoring infrastructure [1]

 Critical application for CC operations and WLCG, 200 GB/day and 200M events/day
 Spark is used in streaming analytics (enrichment, validation), interactive and batch analysis

 CMS Big Data Project [2] - Data Reduction Facility

 Ongoing investigations using Spark to produce reduced data n-tuples for analysis in a more agile way than
current methods

 CMSSpark [3]
 Spark is used to parse and extract useful aggregated information from various CMS data streams on HDFS
[1] https://indico.cern.ch/event/587955/contributions/2937899/
[2] https://cms-big-data.github.io/
5
[3] https://github.com/vkuznet/CMSSpark
Integration of SWAN with Spark Clusters
 SWAN – Service for Web based ANalysis
 collaboration between EP-SFT, IT-ST and IT-DB

 Analysis from a web browser

 Integrated with other analysis ecosystems: ROOT C++, Python and R
 Ideal for exploration, reproducibility, collaboration
 Available everywhere and at any time

 Integrated with CERN services [1]

 Software: CVMFS
 Storage: CERNBox, EOS
 Compute: local (docker)
 Scalable Analytics: Fully Integrated with IT Spark and Hadoop Clusters
o powerful and scalable platform for data analysis
o Python on Spark (PySpark) at scale

[1] https://doi.org/10.1016/j.future.2016.11.035
6
Integrating Services

Software

Compute Storage
Isolation| local compute

7
SWAN – Architecture

SSO

Web portal
Spark Worker
Python task
Container Scheduler Python task

Spark
User 1 User 2 ... User n
Driver

AppMaster
EOS CVMFS CERNBox
(Data) (Software) (User Files)

CERN Resources IT Hadoop and Spark clusters

8
SWAN Interface

9
Scalable Analytics: Spark-clusters with SWAN integration

 All the Spark (Hadoop) clusters are integrated with SWAN

 BE NXCals project will offer their users SWAN as key entry point for analysis
 Growing usage and reliance on SWAN (+ Spark)

Cluster Name Configuration Primary Usage

nxcals 20 nodes Accelerator logging (NXCALS)
(Cores 480, Mem - 8 TB, Storage – 5 PB, 96GB in SSD) project dedicated cluster
analytix 48 nodes General Purpose
(Cores – 892,Mem – 7.5TB,Storage – 6 PB)

hadalytic 14 nodes Development cluster

(Cores – 196,Mem – 768GB,Storage – 2.15 PB)

10
SWAN_Spark features
 Spark Connector – handling
the spark configuration
complexity
 User is presented with Spark
Session (Spark) and Spark
Context (sc)
 Ability to bundle
configurations specific to user
communities
 Ability to specify additional
configuration
11
SWAN_Spark features
 Spark Monitor – jupyter notebook extension
 For live monitoring of spark jobs spawned from the notebook
 Access to Spark WEB UI from the notebook
 Several other features to debug and troubleshoot Spark application
 Developed in the context of HSF Google Summer of Code program [1]

[1] http://hepsoftwarefoundation.org/gsoc/2017/proposal_ROOTspark.html
12
SWAN_Spark features
 HDFS Browser – jupyter notebook extension
 browsing the Hadoop Distributed File System from notebook
 useful for selection of the datasets for analysis

All the required tools, software

and data available in the single
window!

13
Text

Code

Monitoring

Visualizations

14
XRootD connector for Hadoop and Spark
 A library that binds Hadoop-based file system API with XRootD native client
 Developed by CERN IT department
 Allows most of components from Hadoop stack (Spark, MapReduce, Hive etc) to read
from/write to EOS and CASTOR directly
 Works with Grid certificates and Kerberos for authentication
 Used for: HDFS backups, performing analytics on data stored on EOS / CERNBox

C++ Java

Hadoop
EOS HDFS
Storage Hadoop- Spark
System XrootD
Xrootd (analytix
JNI XrootD
Client Connector )

15
Challenges
 Spark on YARN satisfies the needs of stable, predictable and production
workloads from NXCals, WLCG & CC monitoring, IT security and other smaller
communities

 Physical machines allocated – means static allocation of resources (no resource

elasticity), no isolation from other users, compute coupled with data storage.

 Periodic Load Spikes

 International Conferences, physics analysis with Spark

 Future demand
 CERN EP-SFT and CMS Big Data project are investigating use of Spark for physics analysis

 Physics data stored in external storage system – EOS storage with over 250 PB 16
Spark on Kubernetes
On-Demand Elastic Resource Provisioning of Spark for data
processing
Spawn/resize/shutdown cluster of 10s/100s of nodes in minutes

Set of tools to manage the Spark cluster and submit Spark Jobs

High-availability, no infrastructure maintenance, no data storage maintenance, self-

healing

Data decoupled from compute (Spark) - data is externally stored (Kafka, EOS/S3
Storage..), processing happens as in the cloud model

Spark on Kubernetes architecture much simpler and easier to maintain than YARN
17
Elastic Resource Provisioning with Spark on Kubernetes

Hadoop/Spark Spark on Kubernetes over Openstack

Apache Spark Spark-on-Kubernetes Spark-on-Kubernetes
compute and storage
only compute only compute
on the same machines
HBase
External Storage Kubernetes Resource Kubernetes Resource
YARN Resource Manager
(EOS, S3, HDFS) Manager Manager

HDFS Hadoop Distributed File System Openstack Project 1 Openstack Project 2

Resources Resources

On-Premise Bare metal Infrastructure On-Premise Openstack Cloud Infrastructure

- Stable production workloads - Cloud-native (rapid resource

- Data Locality provisioning)
- No on-demand resource - Elasticity (Scale up / down cluster
elasticity resources)
- Separation of storage and compute 18
Provisioning Spark on Kubernetes cluster

Spark-on-Kubernetes Spark-on-Kubernetes
only compute only compute
CLIENT HOST
Linux | Mac | lxplus-
[email protected] Spark Spark Spark Spark Spark Spark
Driver Executor Executor Driver Executor Executor
Run Jobs $ sparkctl

$ helm install spark- Kubernetes Resource Kubernetes Resource

operator Manager Manager
Deploy Spark $ opsparkctl create
spark
$ openstack coe Openstack Project 1 Openstack Project 2
cluster .. Resources Resources
$ opsparkctl create
Create Cluster
kube
$ opsparkctl resize
kube
Openstack Cloud Infrastructure

19
Spark on Kubernetes architecture

20
Spark on Kubernetes@CERN – Current Status
 Possible to run Spark on Kubernetes on OpenStack
 Spark version for driver and executor is taken from master branch (2.4.0)
 Kubernetes cluster is created on the openstack projects owned by the
user
 S3 service is used to store event logs and checkpoints for spark streaming

 Early Adapters
 CMS Big Data Project for their Data Reduction Facility [1]

 Spark on Kubernetes installation is validated with industry

standard TPC-DS benchmarks
[1] https://indico.cern.ch/event/587955/contributions/2937521/ 21
Summary
 Current deployment of Spark (on YARN) satisfies the needs of stable,
predictable and production workloads from NXCals, WLCG & CC monitoring,
IT security etc.

 Integration of Spark clusters with SWAN allows interactive data exploration

and analysis from notebook interface.

 Future demand from the users, especially the usage of spark for physics
analysis from experiments requires different deployment model.

 Spark on Kubernetes on openstack is the target deployment model for

physics analysis and early results look promising. 22

Microsoft Azure DP-300 Exam Dumps
No ratings yet
Microsoft Azure DP-300 Exam Dumps
11 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Big Data Processing With Apache Spark
No ratings yet
Big Data Processing With Apache Spark
17 pages
Spark Introduction
No ratings yet
Spark Introduction
4 pages
Apache Spark Self Learning 1
No ratings yet
Apache Spark Self Learning 1
7 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Mastering Apache Spark - Sample Chapter
No ratings yet
Mastering Apache Spark - Sample Chapter
24 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
4.Big Data Platforms
No ratings yet
4.Big Data Platforms
49 pages
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Spark Overview: Security
No ratings yet
Spark Overview: Security
4 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Spark For Python Developers - Sample Chapter
100% (6)
Spark For Python Developers - Sample Chapter
32 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Shark
No ratings yet
Shark
24 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Chap5_BigDataComputingAndProcessing
No ratings yet
Chap5_BigDataComputingAndProcessing
72 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Using Spark On Cori: Lisa Gerhardt, Evan Racah NERSC New User Training
No ratings yet
Using Spark On Cori: Lisa Gerhardt, Evan Racah NERSC New User Training
14 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
sparkapache
No ratings yet
sparkapache
2 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
Bda 5
No ratings yet
Bda 5
21 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
Apache Spark Primer 170303
No ratings yet
Apache Spark Primer 170303
8 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
BD Notes 5
No ratings yet
BD Notes 5
37 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
DB For Data Engineering Solution Sheet
No ratings yet
DB For Data Engineering Solution Sheet
2 pages
7
No ratings yet
7
39 pages
Databricks: Building and Operating A Big Data Service Based On Apache Spark
No ratings yet
Databricks: Building and Operating A Big Data Service Based On Apache Spark
32 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
(External) Spark Connect - A Client and Server Interface For Spark
No ratings yet
(External) Spark Connect - A Client and Server Interface For Spark
6 pages
Advanced Real-Time Data Integration: Apache Kafka and Spark Streaming Techniques
From Everand
Advanced Real-Time Data Integration: Apache Kafka and Spark Streaming Techniques
Adam Jones
No ratings yet
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Unit 5
100% (1)
Unit 5
109 pages
Enterprise Data Storage and Analysis On Spark
No ratings yet
Enterprise Data Storage and Analysis On Spark
34 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Sparkarchitecture 190419130916
No ratings yet
Sparkarchitecture 190419130916
21 pages
VMware Cloud Foundation 4 FAQ - Partner EN
No ratings yet
VMware Cloud Foundation 4 FAQ - Partner EN
1 page
QRadar Advisor With Watson App - IBM Documentation
No ratings yet
QRadar Advisor With Watson App - IBM Documentation
3 pages
Machine Learning Program 4 (SHANKAR)
No ratings yet
Machine Learning Program 4 (SHANKAR)
6 pages
Best 10 Android Framework.9589291.powerpoint
No ratings yet
Best 10 Android Framework.9589291.powerpoint
5 pages
prim_comp_tr6_u4_worksheet_ans
No ratings yet
prim_comp_tr6_u4_worksheet_ans
2 pages
Service Provider IPv6 Deployment PDF
No ratings yet
Service Provider IPv6 Deployment PDF
79 pages
Final-Chapter 4 Defect Management-updated-RECENT (1) (1)
No ratings yet
Final-Chapter 4 Defect Management-updated-RECENT (1) (1)
24 pages
Hacking With Kali Linux A Beginner's Guide To Study Penetration
100% (3)
Hacking With Kali Linux A Beginner's Guide To Study Penetration
113 pages
UC22NA 01EE10 AVEVA Debeer Ever More Efficient AVEVAs Engineering Roadmap
No ratings yet
UC22NA 01EE10 AVEVA Debeer Ever More Efficient AVEVAs Engineering Roadmap
38 pages
Navision Bible
75% (4)
Navision Bible
542 pages
Samuel Martinez - React - Developer
No ratings yet
Samuel Martinez - React - Developer
3 pages
Chapter 1 Intro To Business Analytics
No ratings yet
Chapter 1 Intro To Business Analytics
47 pages
Food Manufacturing - Kaif Kazim 8-6-23
No ratings yet
Food Manufacturing - Kaif Kazim 8-6-23
13 pages
Developers Guide To Customer Experience Management Powered by The Adobe Digital Enterprise Platform
No ratings yet
Developers Guide To Customer Experience Management Powered by The Adobe Digital Enterprise Platform
10 pages
GRUB2 Modules
No ratings yet
GRUB2 Modules
12 pages
Xilinx TimingClosure
No ratings yet
Xilinx TimingClosure
31 pages
Delete User From Ad
No ratings yet
Delete User From Ad
2 pages
P011 - Baseefa Wallchart Iss9 0312
No ratings yet
P011 - Baseefa Wallchart Iss9 0312
1 page
Senior Quality Analyst Engineer: Joseph Miranda
No ratings yet
Senior Quality Analyst Engineer: Joseph Miranda
5 pages
Scratch Fish
No ratings yet
Scratch Fish
24 pages
Architecture Vision
No ratings yet
Architecture Vision
23 pages
Manual CISCO2501 - Router PDF
No ratings yet
Manual CISCO2501 - Router PDF
10 pages
Roshan 51 Resume
No ratings yet
Roshan 51 Resume
1 page
60 To 60000 Kva Sirius
No ratings yet
60 To 60000 Kva Sirius
8 pages
WHQ SQD PR007 Controlled Shipping Level 1 and 2
No ratings yet
WHQ SQD PR007 Controlled Shipping Level 1 and 2
5 pages
Mini Project New PDF
100% (1)
Mini Project New PDF
6 pages
Central Institute of Tool Design
No ratings yet
Central Institute of Tool Design
1 page
Quiz 3
No ratings yet
Quiz 3
4 pages
Unit - 3 Block Chain Technology
No ratings yet
Unit - 3 Block Chain Technology
20 pages