Apache Spark Cheatsheet (2014)
Apache Spark Cheatsheet (2014)
Apache Spark Cheatsheet (2014)
Apache Spark
TABLE OF CONTENTS
Preface 2
Introduction to Apache Spark 2
What is Apache Spark?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Why Use Apache Spark? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Key Features of Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Spark Components Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Getting Started with Spark 2
Installation and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Using Spark on Local Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Initializing Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Resilient Distributed Datasets (RDDs) 3
Creating RDDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Transformations on RDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Actions on RDDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
RDD Persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Structured APIs: DataFrames and Datasets 3
Creating DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Basic DataFrame Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Aggregations and Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Working with Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Spark SQL 4
Registering and Querying Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Running SQL Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
DataFrame to RDD Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Streaming Processing with Spark 4
DStream Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Transformations on DStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Output Operations for DStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Machine Learning with MLlib 5
MLlib Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Building and Evaluating Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Graph Processing with GraphX 5
Creating Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Vertex and Edge RDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Graph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
PREFACE
PREFACE SPARK COMPONENTS OVERVIEW
This cheatsheet is designed to provide quick access • Spark Core: Foundation of Spark, providing
to the most commonly used Spark components, basic functionality like task scheduling,
methods, and practices. Whether you’re diving into memory management, and fault recovery.
Spark’s resilient distributed datasets (RDDs),
• Spark SQL: Enables SQL querying and
exploring the DataFrame and SQL capabilities, or
DataFrame API for structured data processing.
harnessing the advanced machine learning
libraries through MLlib, this cheatsheet offers bite- • Spark Streaming: Enables processing of real-
sized code snippets and explanations to facilitate time data streams.
your learning.
• MLlib: Library for machine learning tasks.
RDD PERSISTENCE
CREATING RDDS
Caching RDDs in memory can speed up iterative
You can create RDDs from existing data or by
algorithms
parallelizing a collection
import rdd.persist(StorageLevel.MEMORY_ONLY
org.apache.spark.api.java.JavaRDD; ());
import org.apache.spark.SparkConf; rdd.unpersist(); // Remove from
import memory
org.apache.spark.SparkContext;
STRUCTUREDAPIS:
STRUCTURED APIS:DATAFRAMES
DATAFRAMES
SparkConf conf = new AND
SparkConf().setAppName("RDDExample")
AND DATASETS
DATASETS
.setMaster("local[*]");
SparkContext sc = new CREATING DATAFRAMES
SparkContext(conf); DataFrames can be created from various data
sources
List<Integer> data =
Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = import org.apache.spark.sql.Dataset;
sc.parallelize(data); import org.apache.spark.sql.Row;
import
org.apache.spark.sql.SparkSession;
TRANSFORMATIONS ON RDDS
SparkSession spark =
Transformations create a new RDD from an existing
SparkSession.builder()
one
.appName("DataFrameExample")
.master("local[*]")
JavaRDD<Integer> squaredRDD = .getOrCreate();
rdd.map(x -> x * x);
JavaRDD<Integer> filteredRDD = Dataset<Row> df =
rdd.filter(x -> x % 2 == 0); spark.read().json("data.json");
df.show();
DATAFRAME TO RDD CONVERSION
df.printSchema();
df.select("name").show(); Convert DataFrames to RDDs when needed
df.filter(df.col("age").gt(21)).show
();
df.groupBy("age").count().show(); JavaRDD<Row> rddFromDF =
df.rdd().toJavaRDD();
import
WORKING WITH DATASETS org.apache.spark.streaming.Durations
;
Datasets offer strongly-typed, object-oriented
import
programming interfaces
org.apache.spark.streaming.api.java.
JavaStreamingContext;
Dataset<Person> people =
df.as(Encoders.bean(Person.class)); JavaStreamingContext
people.filter(person -> streamingContext = new
person.getAge() > 25).show(); JavaStreamingContext(sparkConf,
Durations.seconds(1));
JavaReceiverInputDStream<String>
SPARKSQL
SPARK SQL lines =
streamingContext.socketTextStream("l
REGISTERING AND QUERYING TABLES ocalhost", 9999);
CREATING GRAPHS
DATA PREPARATION
Create a graph in GraphX
Prepare data for machine learning
import
Dataset<Row> rawData = org.apache.spark.graphx.Graph;
spark.read().csv("data.csv"); import
VectorAssembler assembler = new org.apache.spark.graphx.VertexRDD;
VectorAssembler() import
.setInputCols(new org.apache.spark.graphx.util.GraphGe
String[]{"feature1", "feature2"}) nerators;
.setOutputCol("features");
Dataset<Row> assembledData = Graph<Object, Object> graph =
assembler.transform(rawData); GraphGenerators.logNormalGraph(spark
Context, numVertices, numEPart, mu,
Choose a cluster manager for Spark deployment Adjust parallelism and partitions for better
performance
CACHING STRATEGIES
INTERACTINGWITH
INTERACTING WITHEXTERNAL
EXTERNALDATA LOGGING AND DEBUGGING
DATA SOURCES
SOURCES
Use logging for debugging
Spark supports various file formats Spark can work seamlessly with Hadoop
CONNECTING TO DATABASES
SPARK AND APACHE KAFKA
Connect to databases using JDBC
Integrate Spark with Kafka for real-time data
processing
Dataset<Row> jdbcData = spark.read()
.format("jdbc")
.option("url", import
"jdbc:mysql://host:port/database") org.apache.spark.streaming.kafka010.
.option("dbtable", "table") KafkaUtils;
.option("user", "username") import
.option("password", "password") org.apache.spark.streaming.kafka010.
.load(); LocationStrategies;
import
org.apache.spark.streaming.kafka010.
MONITORINGAND
MONITORING ANDDEBUGGING
DEBUGGING ConsumerStrategies;
SPARK UI JavaInputDStream<ConsumerRecord<Stri
ng, String>> kafkaStream =
Monitor application progress using the Spark UI KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(
),
ConsumerStrategies.Subscribe(topics,
kafkaParams)
);
COMMONLYUSED
COMMONLY USEDLIBRARIES
LIBRARIESWITH
WITH
SPARK
SPARK
Library Description
JCG delivers over 1 million pages each month to more than 700K software
developers, architects and decision makers. JCG offers something for everyone,
including news, tutorials, cheat sheets, research guides, feature articles, source code
and more.
CHEATSHEET FEEDBACK
WELCOME
[email protected]
Copyright © 2014 Exelixis Media P.C. All rights reserved. No part of this publication may be SPONSORSHIP
reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, OPPORTUNITIES
mechanical, photocopying, or otherwise, without prior written permission of the publisher. [email protected]