Apache Spark Cheatsheet (2014)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

APACHE SPARK

Apache Spark
TABLE OF CONTENTS

Preface 2
Introduction to Apache Spark 2
What is Apache Spark?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Why Use Apache Spark? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Key Features of Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Spark Components Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Getting Started with Spark 2
Installation and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Using Spark on Local Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Initializing Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Resilient Distributed Datasets (RDDs) 3
Creating RDDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Transformations on RDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Actions on RDDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
RDD Persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Structured APIs: DataFrames and Datasets 3
Creating DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Basic DataFrame Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Aggregations and Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Working with Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Spark SQL 4
Registering and Querying Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Running SQL Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
DataFrame to RDD Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Streaming Processing with Spark 4
DStream Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Transformations on DStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Output Operations for DStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Machine Learning with MLlib 5
MLlib Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Building and Evaluating Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Graph Processing with GraphX 5
Creating Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Vertex and Edge RDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Graph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


1 APACHE SPARK

Cluster Computing and Deployment 6


Cluster Manager Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Deploying Spark on Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Performance Tuning and Optimization 6
Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Parallelism and Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Caching Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Interacting with External Data Sources 7
Reading and Writing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Supported File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Connecting to Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Monitoring and Debugging 7
Spark UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Logging and Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Integration with Other Tools 7
Spark and Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Spark and Apache Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Spark and Jupyter Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Commonly Used Libraries with Spark 8

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


2 APACHE SPARK

PREFACE
PREFACE SPARK COMPONENTS OVERVIEW

This cheatsheet is designed to provide quick access • Spark Core: Foundation of Spark, providing
to the most commonly used Spark components, basic functionality like task scheduling,
methods, and practices. Whether you’re diving into memory management, and fault recovery.
Spark’s resilient distributed datasets (RDDs),
• Spark SQL: Enables SQL querying and
exploring the DataFrame and SQL capabilities, or
DataFrame API for structured data processing.
harnessing the advanced machine learning
libraries through MLlib, this cheatsheet offers bite- • Spark Streaming: Enables processing of real-
sized code snippets and explanations to facilitate time data streams.
your learning.
• MLlib: Library for machine learning tasks.

• GraphX: Library for graph computation.


INTRODUCTIONTO
INTRODUCTION TOAPACHE
APACHESPARK
SPARK
• Cluster Managers: Supports various cluster
managers like Apache Mesos, Hadoop YARN,
WHAT IS APACHE SPARK?
and Kubernetes.
Apache Spark is an open-source, distributed
computing system designed for big data processing. GETTINGSTARTED
GETTING STARTEDWITH
WITHSPARK
SPARK
It provides an interface for programming entire
clusters with implicit data parallelism and fault
INSTALLATION AND SETUP
tolerance. Spark’s core abstraction is the Resilient
Distributed Dataset (RDD), a fault-tolerant collection Apache Spark can be installed on various platforms.
of elements that can be processed in parallel. Here’s a basic guide for setting it up on a local
machine
WHY USE APACHE SPARK?
USING SPARK ON LOCAL MACHINE
Spark offers significant advantages over traditional
MapReduce-based systems, including faster • Download the latest Spark version from the
processing speed due to in-memory computation, a official website.
wide range of libraries for various data processing
• Extract the downloaded archive.
tasks, and support for multiple languages such as
Java, Scala, Python, and R. • Set up environment variables, such as
SPARK_HOME and PATH.
KEY FEATURES OF APACHE SPARK • Configure spark-defaults.conf for basic
settings.
• Speed: Spark’s in-memory processing
capability results in faster data processing.
INITIALIZING SPARK
• Ease of Use: Provides high-level APIs in
languages like Scala, Python, and Java. To use Spark in your application, initialize a
SparkSession
• Versatility: Supports batch processing,
interactive queries, streaming, machine
learning, and graph processing.
import
• Fault Tolerance: Recovers lost data using org.apache.spark.sql.SparkSession;
lineage information.

• Advanced Analytics: Offers libraries for public class SparkApp {


machine learning (MLlib), graph processing public static void main(String[]
(GraphX), and more. args) {
SparkSession spark =
• Integration: Seamlessly integrates with
Hadoop, HDFS, and other data sources. SparkSession.builder()
.appName("SparkApp")

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


3 APACHE SPARK

.master("local[*]") // JavaRDD<Integer> unionRDD =


Use all available cores rdd1.union(rdd2);
.getOrCreate();

// Your Spark application ACTIONS ON RDDS


code here Actions return values to the driver program or
write data to an external storage system
spark.stop(); // Stop the
SparkSession
} long count = rdd.count();
} int firstElement = rdd.first();
List<Integer> collectedData =
rdd.collect();
RESILIENTDISTRIBUTED
RESILIENT DISTRIBUTEDDATASETS
DATASETS rdd.saveAsTextFile("output.txt");
(RDDS)
(RDDS)

RDD PERSISTENCE
CREATING RDDS
Caching RDDs in memory can speed up iterative
You can create RDDs from existing data or by
algorithms
parallelizing a collection

import rdd.persist(StorageLevel.MEMORY_ONLY
org.apache.spark.api.java.JavaRDD; ());
import org.apache.spark.SparkConf; rdd.unpersist(); // Remove from
import memory
org.apache.spark.SparkContext;
STRUCTUREDAPIS:
STRUCTURED APIS:DATAFRAMES
DATAFRAMES
SparkConf conf = new AND
SparkConf().setAppName("RDDExample")
AND DATASETS
DATASETS
.setMaster("local[*]");
SparkContext sc = new CREATING DATAFRAMES
SparkContext(conf); DataFrames can be created from various data
sources
List<Integer> data =
Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = import org.apache.spark.sql.Dataset;
sc.parallelize(data); import org.apache.spark.sql.Row;
import
org.apache.spark.sql.SparkSession;
TRANSFORMATIONS ON RDDS
SparkSession spark =
Transformations create a new RDD from an existing
SparkSession.builder()
one
.appName("DataFrameExample")
.master("local[*]")
JavaRDD<Integer> squaredRDD = .getOrCreate();
rdd.map(x -> x * x);
JavaRDD<Integer> filteredRDD = Dataset<Row> df =
rdd.filter(x -> x % 2 == 0); spark.read().json("data.json");

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


4 APACHE SPARK

BASIC DATAFRAME OPERATIONS spark.sql("SELECT name, age FROM


employees WHERE age > 25");
Perform various operations on DataFrames results.show();

df.show();
DATAFRAME TO RDD CONVERSION
df.printSchema();
df.select("name").show(); Convert DataFrames to RDDs when needed
df.filter(df.col("age").gt(21)).show
();
df.groupBy("age").count().show(); JavaRDD<Row> rddFromDF =
df.rdd().toJavaRDD();

AGGREGATIONS AND GROUPING


STREAMINGPROCESSING
STREAMING PROCESSINGWITH
WITH
Perform aggregations on DataFrames SPARK
SPARK

df.groupBy("age").agg(functions.avg( DSTREAM CREATION


"salary"), Create a DStream for streaming processing
functions.max("bonus")).show();

import
WORKING WITH DATASETS org.apache.spark.streaming.Durations
;
Datasets offer strongly-typed, object-oriented
import
programming interfaces
org.apache.spark.streaming.api.java.
JavaStreamingContext;
Dataset<Person> people =
df.as(Encoders.bean(Person.class)); JavaStreamingContext
people.filter(person -> streamingContext = new
person.getAge() > 25).show(); JavaStreamingContext(sparkConf,
Durations.seconds(1));
JavaReceiverInputDStream<String>
SPARKSQL
SPARK SQL lines =
streamingContext.socketTextStream("l
REGISTERING AND QUERYING TABLES ocalhost", 9999);

Register DataFrames as temporary tables for SQL


querying TRANSFORMATIONS ON DSTREAMS

Perform transformations on DStreams


df.createOrReplaceTempView("employee
s");
JavaDStream<String> words =
lines.flatMap(x ->
RUNNING SQL QUERIES Arrays.asList(x.split("
")).iterator());
Execute SQL queries on registered tables
JavaPairDStream<String, Integer>
wordCounts = words.mapToPair(s ->
Dataset<Row> results = new Tuple2<>(s, 1)).reduceByKey((a,

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


5 APACHE SPARK

b) -> a + b); BUILDING AND EVALUATING MODELS

Build and evaluate a machine learning model

OUTPUT OPERATIONS FOR DSTREAMS

Perform output operations on DStreams StringIndexer labelIndexer = new


StringIndexer()
.setInputCol("label")
wordCounts.print(); .setOutputCol("indexedLabel");
wordCounts.saveAsTextFiles("wordcoun LogisticRegression lr = new
t", "txt"); LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01);
MACHINELEARNING
MACHINE LEARNINGWITH
WITHMLLIB
MLLIB
Pipeline pipeline = new Pipeline()
.setStages(new
MLLIB OVERVIEW PipelineStage[]{labelIndexer,
assembler, lr});
MLlib is a powerful library for machine learning
PipelineModel model =
tasks
pipeline.fit(trainingData);
Dataset<Row> predictions =
import org.apache.spark.ml.Pipeline; model.transform(testData);
import BinaryClassificationEvaluator
org.apache.spark.ml.classification.L evaluator = new
ogisticRegression; BinaryClassificationEvaluator()
import .setLabelCol("indexedLabel")
org.apache.spark.ml.evaluation.Binar
yClassificationEvaluator; .setRawPredictionCol("rawPrediction"
import );
org.apache.spark.ml.feature.VectorAs double accuracy =
sembler; evaluator.evaluate(predictions);
import
org.apache.spark.ml.feature.StringIn
GRAPHPROCESSING
GRAPH PROCESSINGWITH
WITHGRAPHX
GRAPHX
dexer;

CREATING GRAPHS
DATA PREPARATION
Create a graph in GraphX
Prepare data for machine learning

import
Dataset<Row> rawData = org.apache.spark.graphx.Graph;
spark.read().csv("data.csv"); import
VectorAssembler assembler = new org.apache.spark.graphx.VertexRDD;
VectorAssembler() import
.setInputCols(new org.apache.spark.graphx.util.GraphGe
String[]{"feature1", "feature2"}) nerators;
.setOutputCol("features");
Dataset<Row> assembledData = Graph<Object, Object> graph =
assembler.transform(rawData); GraphGenerators.logNormalGraph(spark
Context, numVertices, numEPart, mu,

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


6 APACHE SPARK

sigma); DEPLOYING SPARK ON CLUSTERS

Submit Spark applications to the cluster

VERTEX AND EDGE RDDS

Access vertex and edge RDDs // Submit using spark-submit script


$ spark-submit --class
com.example.SparkApp --master yarn
VertexRDD<Object> vertices = --deploy-mode cluster myApp.jar
graph.vertices();
EdgeRDD<Object> edges =
graph.edges(); PERFORMANCETUNING
PERFORMANCE TUNINGAND
AND
OPTIMIZATION
OPTIMIZATION

GRAPH ALGORITHMS MEMORY MANAGEMENT


Apply graph algorithms on the graph Optimize memory usage in Spark

import // Set memory configurations


org.apache.spark.graphx.lib.PageRank conf.set("spark.driver.memory",
; "2g");
conf.set("spark.executor.memory",
Graph<Object, Object> pageRankGraph "4g");
=
PageRank.runUntilConvergence(graph, // Enable off-heap memory
tolerance); conf.set("spark.memory.offHeap.enabl
ed", "true");
CLUSTERCOMPUTING
COMPUTINGAND
AND conf.set("spark.memory.offHeap.size"
CLUSTER , "2g");
DEPLOYMENT
DEPLOYMENT

CLUSTER MANAGER SELECTION PARALLELISM AND PARTITIONS

Choose a cluster manager for Spark deployment Adjust parallelism and partitions for better
performance

// Set Spark to run on Mesos


SparkConf conf = new SparkConf() // Set the number of executor cores
.setMaster("mesos://mesos- conf.set("spark.executor.cores",
master:5050") "4");
.setAppName("SparkApp");
// Repartition RDDs for balanced
// Set Spark to run on YARN workloads
SparkConf conf = new SparkConf() JavaRDD<Integer> repartitionedRDD =
.setMaster("yarn") rdd.repartition(10);
.setAppName("SparkApp");

CACHING STRATEGIES

Cache RDDs and DataFrames for repeated


computations

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


7 APACHE SPARK

rdd.persist(StorageLevel.MEMORY_AND_ // Access the Spark UI from the


DISK()); driver program's URL
df.cache(); http:&#47;&#47;driver-node:4040

INTERACTINGWITH
INTERACTING WITHEXTERNAL
EXTERNALDATA LOGGING AND DEBUGGING
DATA SOURCES
SOURCES
Use logging for debugging

READING AND WRITING DATA


import org.apache.log4j.Logger;
Read and write data from/to external sources
import org.apache.log4j.Level;

Dataset<Row> csvData = Logger.getLogger("org").setLevel(Lev


spark.read().csv("data.csv"); el.ERROR);
csvData.write().parquet("data.parque
t");
INTEGRATIONWITH
INTEGRATION WITHOTHER
OTHERTOOLS
TOOLS

SUPPORTED FILE FORMATS SPARK AND HADOOP

Spark supports various file formats Spark can work seamlessly with Hadoop

Dataset<Row> parquetData = // Use HDFS file paths


spark.read().parquet("data.parquet") JavaRDD<String> lines =
; sparkContext.textFile("hdfs://nameno
de:8020/input.txt");

CONNECTING TO DATABASES
SPARK AND APACHE KAFKA
Connect to databases using JDBC
Integrate Spark with Kafka for real-time data
processing
Dataset<Row> jdbcData = spark.read()
.format("jdbc")
.option("url", import
"jdbc:mysql://host:port/database") org.apache.spark.streaming.kafka010.
.option("dbtable", "table") KafkaUtils;
.option("user", "username") import
.option("password", "password") org.apache.spark.streaming.kafka010.
.load(); LocationStrategies;
import
org.apache.spark.streaming.kafka010.
MONITORINGAND
MONITORING ANDDEBUGGING
DEBUGGING ConsumerStrategies;

SPARK UI JavaInputDStream<ConsumerRecord<Stri
ng, String>> kafkaStream =
Monitor application progress using the Spark UI KafkaUtils.createDirectStream(
streamingContext,

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!


8 APACHE SPARK

LocationStrategies.PreferConsistent(
),

ConsumerStrategies.Subscribe(topics,
kafkaParams)
);

SPARK AND JUPYTER NOTEBOOKS

Use Jupyter Notebooks for interactive data


exploration with Spark

# Use PySpark in Jupyter Notebook


from pyspark.sql import SparkSession
spark =
SparkSession.builder.appName("SparkA
pp").getOrCreate()

COMMONLYUSED
COMMONLY USEDLIBRARIES
LIBRARIESWITH
WITH
SPARK
SPARK

Library Description

Spark NLP Natural Language


Processing library for
Spark.

Spark Cassandra Interact with Apache


Connector Cassandra.

Spark BigDL Distributed deep


learning library for
Spark.

Spark GATK Genome Analysis Toolkit


library for Spark.

Spark TensorFrames Library for TensorFlow


integration with Spark.

JCG delivers over 1 million pages each month to more than 700K software
developers, architects and decision makers. JCG offers something for everyone,
including news, tutorials, cheat sheets, research guides, feature articles, source code
and more.
CHEATSHEET FEEDBACK
WELCOME
[email protected]

Copyright © 2014 Exelixis Media P.C. All rights reserved. No part of this publication may be SPONSORSHIP
reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, OPPORTUNITIES
mechanical, photocopying, or otherwise, without prior written permission of the publisher. [email protected]

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

You might also like