Prasanth Kothuri, Danilo Piparo, Enric Tejedor Saavedra, Diogo Castro Cern It and Ep-Sft
Prasanth Kothuri, Danilo Piparo, Enric Tejedor Saavedra, Diogo Castro Cern It and Ep-Sft
CMSSpark [3]
Spark is used to parse and extract useful aggregated information from various CMS data streams on HDFS
[1] https://indico.cern.ch/event/587955/contributions/2937899/
[2] https://cms-big-data.github.io/
5
[3] https://github.com/vkuznet/CMSSpark
Integration of SWAN with Spark Clusters
SWAN – Service for Web based ANalysis
collaboration between EP-SFT, IT-ST and IT-DB
[1] https://doi.org/10.1016/j.future.2016.11.035
6
Integrating Services
Software
Compute Storage
Isolation| local compute
7
SWAN – Architecture
SSO
Web portal
Spark Worker
Python task
Container Scheduler Python task
Spark
User 1 User 2 ... User n
Driver
AppMaster
EOS CVMFS CERNBox
(Data) (Software) (User Files)
8
SWAN Interface
9
Scalable Analytics: Spark-clusters with SWAN integration
10
SWAN_Spark features
Spark Connector – handling
the spark configuration
complexity
User is presented with Spark
Session (Spark) and Spark
Context (sc)
Ability to bundle
configurations specific to user
communities
Ability to specify additional
configuration
11
SWAN_Spark features
Spark Monitor – jupyter notebook extension
For live monitoring of spark jobs spawned from the notebook
Access to Spark WEB UI from the notebook
Several other features to debug and troubleshoot Spark application
Developed in the context of HSF Google Summer of Code program [1]
[1] http://hepsoftwarefoundation.org/gsoc/2017/proposal_ROOTspark.html
12
SWAN_Spark features
HDFS Browser – jupyter notebook extension
browsing the Hadoop Distributed File System from notebook
useful for selection of the datasets for analysis
13
Text
Code
Monitoring
Visualizations
14
XRootD connector for Hadoop and Spark
A library that binds Hadoop-based file system API with XRootD native client
Developed by CERN IT department
Allows most of components from Hadoop stack (Spark, MapReduce, Hive etc) to read
from/write to EOS and CASTOR directly
Works with Grid certificates and Kerberos for authentication
Used for: HDFS backups, performing analytics on data stored on EOS / CERNBox
C++ Java
Hadoop
EOS HDFS
Storage Hadoop- Spark
System XrootD
Xrootd (analytix
JNI XrootD
Client Connector )
15
Challenges
Spark on YARN satisfies the needs of stable, predictable and production
workloads from NXCals, WLCG & CC monitoring, IT security and other smaller
communities
Future demand
CERN EP-SFT and CMS Big Data project are investigating use of Spark for physics analysis
Physics data stored in external storage system – EOS storage with over 250 PB 16
Spark on Kubernetes
On-Demand Elastic Resource Provisioning of Spark for data
processing
Spawn/resize/shutdown cluster of 10s/100s of nodes in minutes
Set of tools to manage the Spark cluster and submit Spark Jobs
Data decoupled from compute (Spark) - data is externally stored (Kafka, EOS/S3
Storage..), processing happens as in the cloud model
Spark on Kubernetes architecture much simpler and easier to maintain than YARN
17
Elastic Resource Provisioning with Spark on Kubernetes
Spark-on-Kubernetes Spark-on-Kubernetes
only compute only compute
CLIENT HOST
Linux | Mac | lxplus-
[email protected] Spark Spark Spark Spark Spark Spark
Driver Executor Executor Driver Executor Executor
Run Jobs $ sparkctl
19
Spark on Kubernetes architecture
20
Spark on Kubernetes@CERN – Current Status
Possible to run Spark on Kubernetes on OpenStack
Spark version for driver and executor is taken from master branch (2.4.0)
Kubernetes cluster is created on the openstack projects owned by the
user
S3 service is used to store event logs and checkpoints for spark streaming
Early Adapters
CMS Big Data Project for their Data Reduction Facility [1]
Future demand from the users, especially the usage of spark for physics
analysis from experiments requires different deployment model.