0% found this document useful (0 votes)

115 views9 pages

Apache Spark Cheatsheet (2014)

Apache Spark is a fast and general-purpose cluster computing system. It provides APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. Key features include its Resilient Distributed Datasets (RDDs) that let programmers load data in memory across clusters, and libraries for streaming, SQL, machine learning, and graph processing. Spark is commonly used for large-scale data processing and analytics applications.

Uploaded by

john Bronson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views9 pages

Apache Spark Cheatsheet (2014)

Uploaded by

john Bronson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

APACHE SPARK

Apache Spark
TABLE OF CONTENTS

Preface 2
Introduction to Apache Spark 2
What is Apache Spark?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Why Use Apache Spark? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Key Features of Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Spark Components Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Getting Started with Spark 2
Installation and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Using Spark on Local Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Initializing Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Resilient Distributed Datasets (RDDs) 3
Creating RDDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Transformations on RDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Actions on RDDs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
RDD Persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Structured APIs: DataFrames and Datasets 3
Creating DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Basic DataFrame Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Aggregations and Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Working with Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Spark SQL 4
Registering and Querying Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Running SQL Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
DataFrame to RDD Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Streaming Processing with Spark 4
DStream Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Transformations on DStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Output Operations for DStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Machine Learning with MLlib 5
MLlib Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Building and Evaluating Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Graph Processing with GraphX 5
Creating Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Vertex and Edge RDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Graph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

1 APACHE SPARK

Cluster Computing and Deployment 6

Cluster Manager Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Deploying Spark on Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Performance Tuning and Optimization 6
Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Parallelism and Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Caching Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Interacting with External Data Sources 7
Reading and Writing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Supported File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Connecting to Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Monitoring and Debugging 7
Spark UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Logging and Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Integration with Other Tools 7
Spark and Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Spark and Apache Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Spark and Jupyter Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Commonly Used Libraries with Spark 8

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

2 APACHE SPARK

PREFACE
PREFACE SPARK COMPONENTS OVERVIEW

This cheatsheet is designed to provide quick access • Spark Core: Foundation of Spark, providing
to the most commonly used Spark components, basic functionality like task scheduling,
methods, and practices. Whether you’re diving into memory management, and fault recovery.
Spark’s resilient distributed datasets (RDDs),
• Spark SQL: Enables SQL querying and
exploring the DataFrame and SQL capabilities, or
DataFrame API for structured data processing.
harnessing the advanced machine learning
libraries through MLlib, this cheatsheet offers bite- • Spark Streaming: Enables processing of real-
sized code snippets and explanations to facilitate time data streams.
your learning.
• MLlib: Library for machine learning tasks.

• GraphX: Library for graph computation.

INTRODUCTIONTO
INTRODUCTION TOAPACHE
APACHESPARK
SPARK
• Cluster Managers: Supports various cluster
managers like Apache Mesos, Hadoop YARN,
WHAT IS APACHE SPARK?
and Kubernetes.
Apache Spark is an open-source, distributed
computing system designed for big data processing. GETTINGSTARTED
GETTING STARTEDWITH
WITHSPARK
SPARK
It provides an interface for programming entire
clusters with implicit data parallelism and fault
INSTALLATION AND SETUP
tolerance. Spark’s core abstraction is the Resilient
Distributed Dataset (RDD), a fault-tolerant collection Apache Spark can be installed on various platforms.
of elements that can be processed in parallel. Here’s a basic guide for setting it up on a local
machine
WHY USE APACHE SPARK?
USING SPARK ON LOCAL MACHINE
Spark offers significant advantages over traditional
MapReduce-based systems, including faster • Download the latest Spark version from the
processing speed due to in-memory computation, a official website.
wide range of libraries for various data processing
• Extract the downloaded archive.
tasks, and support for multiple languages such as
Java, Scala, Python, and R. • Set up environment variables, such as
SPARK_HOME and PATH.
KEY FEATURES OF APACHE SPARK • Configure spark-defaults.conf for basic
settings.
• Speed: Spark’s in-memory processing
capability results in faster data processing.
INITIALIZING SPARK
• Ease of Use: Provides high-level APIs in
languages like Scala, Python, and Java. To use Spark in your application, initialize a
SparkSession
• Versatility: Supports batch processing,
interactive queries, streaming, machine
learning, and graph processing.
import
• Fault Tolerance: Recovers lost data using org.apache.spark.sql.SparkSession;
lineage information.

• Advanced Analytics: Offers libraries for public class SparkApp {

machine learning (MLlib), graph processing public static void main(String[]
(GraphX), and more. args) {
SparkSession spark =
• Integration: Seamlessly integrates with
Hadoop, HDFS, and other data sources. SparkSession.builder()
.appName("SparkApp")

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

3 APACHE SPARK

.master("local[*]") // JavaRDD<Integer> unionRDD =

Use all available cores rdd1.union(rdd2);
.getOrCreate();

// Your Spark application ACTIONS ON RDDS

code here Actions return values to the driver program or
write data to an external storage system
spark.stop(); // Stop the
SparkSession
} long count = rdd.count();
} int firstElement = rdd.first();
List<Integer> collectedData =
rdd.collect();
RESILIENTDISTRIBUTED
RESILIENT DISTRIBUTEDDATASETS
DATASETS rdd.saveAsTextFile("output.txt");
(RDDS)
(RDDS)

RDD PERSISTENCE
CREATING RDDS
Caching RDDs in memory can speed up iterative
You can create RDDs from existing data or by
algorithms
parallelizing a collection

import rdd.persist(StorageLevel.MEMORY_ONLY
org.apache.spark.api.java.JavaRDD; ());
import org.apache.spark.SparkConf; rdd.unpersist(); // Remove from
import memory
org.apache.spark.SparkContext;
STRUCTUREDAPIS:
STRUCTURED APIS:DATAFRAMES
DATAFRAMES
SparkConf conf = new AND
SparkConf().setAppName("RDDExample")
AND DATASETS
DATASETS
.setMaster("local[*]");
SparkContext sc = new CREATING DATAFRAMES
SparkContext(conf); DataFrames can be created from various data
sources
List<Integer> data =
Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = import org.apache.spark.sql.Dataset;
sc.parallelize(data); import org.apache.spark.sql.Row;
import
org.apache.spark.sql.SparkSession;
TRANSFORMATIONS ON RDDS
SparkSession spark =
Transformations create a new RDD from an existing
SparkSession.builder()
one
.appName("DataFrameExample")
.master("local[*]")
JavaRDD<Integer> squaredRDD = .getOrCreate();
rdd.map(x -> x * x);
JavaRDD<Integer> filteredRDD = Dataset<Row> df =
rdd.filter(x -> x % 2 == 0); spark.read().json("data.json");

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

4 APACHE SPARK

BASIC DATAFRAME OPERATIONS spark.sql("SELECT name, age FROM

employees WHERE age > 25");
Perform various operations on DataFrames results.show();

df.show();
DATAFRAME TO RDD CONVERSION
df.printSchema();
df.select("name").show(); Convert DataFrames to RDDs when needed
df.filter(df.col("age").gt(21)).show
();
df.groupBy("age").count().show(); JavaRDD<Row> rddFromDF =
df.rdd().toJavaRDD();

AGGREGATIONS AND GROUPING

STREAMINGPROCESSING
STREAMING PROCESSINGWITH
WITH
Perform aggregations on DataFrames SPARK
SPARK

df.groupBy("age").agg(functions.avg( DSTREAM CREATION

"salary"), Create a DStream for streaming processing
functions.max("bonus")).show();

import
WORKING WITH DATASETS org.apache.spark.streaming.Durations
;
Datasets offer strongly-typed, object-oriented
import
programming interfaces
org.apache.spark.streaming.api.java.
JavaStreamingContext;
Dataset<Person> people =
df.as(Encoders.bean(Person.class)); JavaStreamingContext
people.filter(person -> streamingContext = new
person.getAge() > 25).show(); JavaStreamingContext(sparkConf,
Durations.seconds(1));
JavaReceiverInputDStream<String>
SPARKSQL
SPARK SQL lines =
streamingContext.socketTextStream("l
REGISTERING AND QUERYING TABLES ocalhost", 9999);

Register DataFrames as temporary tables for SQL

querying TRANSFORMATIONS ON DSTREAMS

Perform transformations on DStreams

df.createOrReplaceTempView("employee
s");
JavaDStream<String> words =
lines.flatMap(x ->
RUNNING SQL QUERIES Arrays.asList(x.split("
")).iterator());
Execute SQL queries on registered tables
JavaPairDStream<String, Integer>
wordCounts = words.mapToPair(s ->
Dataset<Row> results = new Tuple2<>(s, 1)).reduceByKey((a,

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

5 APACHE SPARK

b) -> a + b); BUILDING AND EVALUATING MODELS

Build and evaluate a machine learning model

OUTPUT OPERATIONS FOR DSTREAMS

Perform output operations on DStreams StringIndexer labelIndexer = new

StringIndexer()
.setInputCol("label")
wordCounts.print(); .setOutputCol("indexedLabel");
wordCounts.saveAsTextFiles("wordcoun LogisticRegression lr = new
t", "txt"); LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01);
MACHINELEARNING
MACHINE LEARNINGWITH
WITHMLLIB
MLLIB
Pipeline pipeline = new Pipeline()
.setStages(new
MLLIB OVERVIEW PipelineStage[]{labelIndexer,
assembler, lr});
MLlib is a powerful library for machine learning
PipelineModel model =
tasks
pipeline.fit(trainingData);
Dataset<Row> predictions =
import org.apache.spark.ml.Pipeline; model.transform(testData);
import BinaryClassificationEvaluator
org.apache.spark.ml.classification.L evaluator = new
ogisticRegression; BinaryClassificationEvaluator()
import .setLabelCol("indexedLabel")
org.apache.spark.ml.evaluation.Binar
yClassificationEvaluator; .setRawPredictionCol("rawPrediction"
import );
org.apache.spark.ml.feature.VectorAs double accuracy =
sembler; evaluator.evaluate(predictions);
import
org.apache.spark.ml.feature.StringIn
GRAPHPROCESSING
GRAPH PROCESSINGWITH
WITHGRAPHX
GRAPHX
dexer;

CREATING GRAPHS
DATA PREPARATION
Create a graph in GraphX
Prepare data for machine learning

import
Dataset<Row> rawData = org.apache.spark.graphx.Graph;
spark.read().csv("data.csv"); import
VectorAssembler assembler = new org.apache.spark.graphx.VertexRDD;
VectorAssembler() import
.setInputCols(new org.apache.spark.graphx.util.GraphGe
String[]{"feature1", "feature2"}) nerators;
.setOutputCol("features");
Dataset<Row> assembledData = Graph<Object, Object> graph =
assembler.transform(rawData); GraphGenerators.logNormalGraph(spark
Context, numVertices, numEPart, mu,

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

6 APACHE SPARK

sigma); DEPLOYING SPARK ON CLUSTERS

Submit Spark applications to the cluster

VERTEX AND EDGE RDDS

Access vertex and edge RDDs // Submit using spark-submit script

$ spark-submit --class
com.example.SparkApp --master yarn
VertexRDD<Object> vertices = --deploy-mode cluster myApp.jar
graph.vertices();
EdgeRDD<Object> edges =
graph.edges(); PERFORMANCETUNING
PERFORMANCE TUNINGAND
AND
OPTIMIZATION
OPTIMIZATION

GRAPH ALGORITHMS MEMORY MANAGEMENT

Apply graph algorithms on the graph Optimize memory usage in Spark

import // Set memory configurations

org.apache.spark.graphx.lib.PageRank conf.set("spark.driver.memory",
; "2g");
conf.set("spark.executor.memory",
Graph<Object, Object> pageRankGraph "4g");
=
PageRank.runUntilConvergence(graph, // Enable off-heap memory
tolerance); conf.set("spark.memory.offHeap.enabl
ed", "true");
CLUSTERCOMPUTING
COMPUTINGAND
AND conf.set("spark.memory.offHeap.size"
CLUSTER , "2g");
DEPLOYMENT
DEPLOYMENT

CLUSTER MANAGER SELECTION PARALLELISM AND PARTITIONS

Choose a cluster manager for Spark deployment Adjust parallelism and partitions for better
performance

// Set Spark to run on Mesos

SparkConf conf = new SparkConf() // Set the number of executor cores
.setMaster("mesos://mesos- conf.set("spark.executor.cores",
master:5050") "4");
.setAppName("SparkApp");
// Repartition RDDs for balanced
// Set Spark to run on YARN workloads
SparkConf conf = new SparkConf() JavaRDD<Integer> repartitionedRDD =
.setMaster("yarn") rdd.repartition(10);
.setAppName("SparkApp");

CACHING STRATEGIES

Cache RDDs and DataFrames for repeated

computations

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

7 APACHE SPARK

rdd.persist(StorageLevel.MEMORY_AND_ // Access the Spark UI from the

DISK()); driver program's URL
df.cache(); http://driver-node:4040

INTERACTINGWITH
INTERACTING WITHEXTERNAL
EXTERNALDATA LOGGING AND DEBUGGING
DATA SOURCES
SOURCES
Use logging for debugging

READING AND WRITING DATA

import org.apache.log4j.Logger;
Read and write data from/to external sources
import org.apache.log4j.Level;

Dataset<Row> csvData = Logger.getLogger("org").setLevel(Lev

spark.read().csv("data.csv"); el.ERROR);
csvData.write().parquet("data.parque
t");
INTEGRATIONWITH
INTEGRATION WITHOTHER
OTHERTOOLS
TOOLS

SUPPORTED FILE FORMATS SPARK AND HADOOP

Spark supports various file formats Spark can work seamlessly with Hadoop

Dataset<Row> parquetData = // Use HDFS file paths

spark.read().parquet("data.parquet") JavaRDD<String> lines =
; sparkContext.textFile("hdfs://nameno
de:8020/input.txt");

CONNECTING TO DATABASES
SPARK AND APACHE KAFKA
Connect to databases using JDBC
Integrate Spark with Kafka for real-time data
processing
Dataset<Row> jdbcData = spark.read()
.format("jdbc")
.option("url", import
"jdbc:mysql://host:port/database") org.apache.spark.streaming.kafka010.
.option("dbtable", "table") KafkaUtils;
.option("user", "username") import
.option("password", "password") org.apache.spark.streaming.kafka010.
.load(); LocationStrategies;
import
org.apache.spark.streaming.kafka010.
MONITORINGAND
MONITORING ANDDEBUGGING
DEBUGGING ConsumerStrategies;

SPARK UI JavaInputDStream<ConsumerRecord<Stri
ng, String>> kafkaStream =
Monitor application progress using the Spark UI KafkaUtils.createDirectStream(
streamingContext,

8 APACHE SPARK

LocationStrategies.PreferConsistent(
),

ConsumerStrategies.Subscribe(topics,
kafkaParams)
);

SPARK AND JUPYTER NOTEBOOKS

Use Jupyter Notebooks for interactive data

exploration with Spark

# Use PySpark in Jupyter Notebook

from pyspark.sql import SparkSession
spark =
SparkSession.builder.appName("SparkA
pp").getOrCreate()

COMMONLYUSED
COMMONLY USEDLIBRARIES
LIBRARIESWITH
WITH
SPARK
SPARK

Library Description

Spark NLP Natural Language

Processing library for
Spark.

Spark Cassandra Interact with Apache

Connector Cassandra.

Spark BigDL Distributed deep

learning library for
Spark.

Spark GATK Genome Analysis Toolkit

library for Spark.

Spark TensorFrames Library for TensorFlow

integration with Spark.

JCG delivers over 1 million pages each month to more than 700K software
developers, architects and decision makers. JCG offers something for everyone,
including news, tutorials, cheat sheets, research guides, feature articles, source code
and more.
CHEATSHEET FEEDBACK
WELCOME
[email protected]

Copyright © 2014 Exelixis Media P.C. All rights reserved. No part of this publication may be SPONSORSHIP
reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, OPPORTUNITIES
mechanical, photocopying, or otherwise, without prior written permission of the publisher. [email protected]

5 6323551620588110404
100% (1)
5 6323551620588110404
212 pages
Google - Professional Cloud Architect.v2023 08 21.q194
No ratings yet
Google - Professional Cloud Architect.v2023 08 21.q194
157 pages
Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
Mac Terminal Commands
100% (1)
Mac Terminal Commands
3 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Py Spark
No ratings yet
Py Spark
427 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
150+ Python Interview Questions
No ratings yet
150+ Python Interview Questions
76 pages
Spark
No ratings yet
Spark
160 pages
Data Engineering Roadmap 1679521887
No ratings yet
Data Engineering Roadmap 1679521887
11 pages
AWS Big Data Specialty
100% (1)
AWS Big Data Specialty
211 pages
Pyspark PDF
100% (1)
Pyspark PDF
397 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
1 Introduction To Databricks Machine Learning
No ratings yet
1 Introduction To Databricks Machine Learning
9 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
4 pages
Data Engineering Quick Reference
No ratings yet
Data Engineering Quick Reference
9 pages
Tungban Machine Learning Math Course
No ratings yet
Tungban Machine Learning Math Course
124 pages
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
POA - Tracker
No ratings yet
POA - Tracker
60 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
111 pages
Piyush Data Science 3
No ratings yet
Piyush Data Science 3
26 pages
Data Engineering
No ratings yet
Data Engineering
15 pages
Databricks 101
No ratings yet
Databricks 101
16 pages
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
SVC508.Digital April Guide 2025
No ratings yet
SVC508.Digital April Guide 2025
19 pages
SCN378.Digital June Guide 2025
No ratings yet
SCN378.Digital June Guide 2025
12 pages
TVS97.digital_TVT_eBook_US
No ratings yet
TVS97.digital_TVT_eBook_US
14 pages
Dante Best-Practices-Guide-2.0_V3
No ratings yet
Dante Best-Practices-Guide-2.0_V3
20 pages
Multicast Orchestration With Arista Media Control Service
No ratings yet
Multicast Orchestration With Arista Media Control Service
8 pages
OC - Module 1 - Intro To BDA 021312
No ratings yet
OC - Module 1 - Intro To BDA 021312
38 pages
Azure Cloud Intro
No ratings yet
Azure Cloud Intro
34 pages
Rahul Sharma
100% (1)
Rahul Sharma
2 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
SQLGraph - When ClickHouse Marries Graph Processing Amoisbird PDF
0% (1)
SQLGraph - When ClickHouse Marries Graph Processing Amoisbird PDF
35 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
100 Days of Data Engineering - Make A Copy and Use As You Need
No ratings yet
100 Days of Data Engineering - Make A Copy and Use As You Need
7 pages
Machine Learning + Devops Using Azure ML Services
No ratings yet
Machine Learning + Devops Using Azure ML Services
17 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Data Engineering Explanation
No ratings yet
Data Engineering Explanation
43 pages
Lab - Exploring DataLake With Athena and Quicksight PDF
No ratings yet
Lab - Exploring DataLake With Athena and Quicksight PDF
22 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
SCN376.Digital April Guide 2025
No ratings yet
SCN376.Digital April Guide 2025
8 pages
CB Queryoptimization 01
No ratings yet
CB Queryoptimization 01
78 pages
Data Engineering 6 Months Plan
No ratings yet
Data Engineering 6 Months Plan
3 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Apache Kafka Course Curriculum
No ratings yet
Apache Kafka Course Curriculum
5 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Data Science Links
No ratings yet
Data Science Links
1 page
Data Engineer - Roadmap and FREE Resources - Paper 2021
No ratings yet
Data Engineer - Roadmap and FREE Resources - Paper 2021
7 pages
Tieline Codec Applications and Solutions v3!0!20240506
No ratings yet
Tieline Codec Applications and Solutions v3!0!20240506
29 pages
Informatica Big Data Management Course Agenda
100% (2)
Informatica Big Data Management Course Agenda
4 pages
TVS90.digita TVT 0090 Oct 24
No ratings yet
TVS90.digita TVT 0090 Oct 24
21 pages
Frontend Cheatsheet
No ratings yet
Frontend Cheatsheet
2 pages
SCN371.Digital November Guide 2024
No ratings yet
SCN371.Digital November Guide 2024
5 pages
Messaging With RabbitMQ - Logical Link Diagram
100% (1)
Messaging With RabbitMQ - Logical Link Diagram
11 pages
Applied Coding Track
No ratings yet
Applied Coding Track
10 pages
Bullet Journal Basics and Ideas
No ratings yet
Bullet Journal Basics and Ideas
1 page
SCN370.Digital October Guide 2024
No ratings yet
SCN370.Digital October Guide 2024
8 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
Donald Ngandeu 1
No ratings yet
Donald Ngandeu 1
6 pages
Apache Hive Interview Questions
50% (2)
Apache Hive Interview Questions
6 pages
Embuk
No ratings yet
Embuk
36 pages
Funny Interview Questions, 2024
No ratings yet
Funny Interview Questions, 2024
12 pages
SVC500.Digital August Guide 2024v1
No ratings yet
SVC500.Digital August Guide 2024v1
23 pages
SVC501.Digital Sept Guide 2024
No ratings yet
SVC501.Digital Sept Guide 2024
25 pages
SVCE11.Digital August 2024
No ratings yet
SVCE11.Digital August 2024
22 pages
RWM1276.digital Sept 01 2024
No ratings yet
RWM1276.digital Sept 01 2024
34 pages
RWE69.Digital Sept 2024
No ratings yet
RWE69.Digital Sept 2024
22 pages
Google Certified Professional Data Engineer
No ratings yet
Google Certified Professional Data Engineer
4 pages

Apache Spark Cheatsheet (2014)

Uploaded by

Apache Spark Cheatsheet (2014)

Uploaded by

APACHE SPARK

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

Cluster Computing and Deployment 6

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

• GraphX: Library for graph computation.

• Advanced Analytics: Offers libraries for public class SparkApp {

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

.master("local[*]") // JavaRDD<Integer> unionRDD =

// Your Spark application ACTIONS ON RDDS

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

BASIC DATAFRAME OPERATIONS spark.sql("SELECT name, age FROM

AGGREGATIONS AND GROUPING

df.groupBy("age").agg(functions.avg( DSTREAM CREATION

Register DataFrames as temporary tables for SQL

Perform transformations on DStreams

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

b) -> a + b); BUILDING AND EVALUATING MODELS

Build and evaluate a machine learning model

OUTPUT OPERATIONS FOR DSTREAMS

Perform output operations on DStreams StringIndexer labelIndexer = new

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

sigma); DEPLOYING SPARK ON CLUSTERS

Submit Spark applications to the cluster

VERTEX AND EDGE RDDS

Access vertex and edge RDDs // Submit using spark-submit script

GRAPH ALGORITHMS MEMORY MANAGEMENT

import // Set memory configurations

CLUSTER MANAGER SELECTION PARALLELISM AND PARTITIONS

// Set Spark to run on Mesos

Cache RDDs and DataFrames for repeated

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

rdd.persist(StorageLevel.MEMORY_AND_ // Access the Spark UI from the

READING AND WRITING DATA

Dataset<Row> csvData = Logger.getLogger("org").setLevel(Lev

SUPPORTED FILE FORMATS SPARK AND HADOOP

Dataset<Row> parquetData = // Use HDFS file paths

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

SPARK AND JUPYTER NOTEBOOKS

Use Jupyter Notebooks for interactive data

# Use PySpark in Jupyter Notebook

Spark NLP Natural Language

Spark Cassandra Interact with Apache

Spark BigDL Distributed deep

Spark GATK Genome Analysis Toolkit

Spark TensorFrames Library for TensorFlow

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

You might also like