0% found this document useful (0 votes)

582 views7 pages

Apache Spark With Scala - Cheatsheet

Spark cheat codes sheet

Uploaded by

Vinod Kumar Amanchi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

582 views7 pages

Apache Spark With Scala - Cheatsheet

Spark cheat codes sheet

Uploaded by

Vinod Kumar Amanchi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

# [ Apache Spark with Scala ] {CheatSheet}

1. Spark Session and Context

● Creating Spark Session: val spark =

SparkSession.builder.appName("SparkApp").getOrCreate()
● Accessing Spark Context: val sc = spark.sparkContext

2. Data Loading and Writing

● Reading a CSV File: val df = spark.read.format("csv").option("header",

"true").load("path/to/csv")
● Writing DataFrame to Parquet: df.write.parquet("path/to/output")
● Reading JSON File: val df = spark.read.json("path/to/json")
● Writing DataFrame to JSON: df.write.json("path/to/output")

3. DataFrame Operations

● Selecting Columns: df.select("column1", "column2")

● Filtering Rows: df.filter($"column" > value)
● Adding a New Column: df.withColumn("newColumn", $"existingColumn" + 1)
● Renaming a Column: df.withColumnRenamed("oldName", "newName")
● Dropping a Column: df.drop("column")

4. Aggregation Functions

● Group By and Aggregate: df.groupBy("column").agg(sum("otherColumn"))

● Calculating Average: df.groupBy("column").avg()
● Calculating Maximum: df.groupBy("column").max()
● Calculating Minimum: df.groupBy("column").min()
● Counting Values: df.groupBy("column").count()

5. Join Operations

● Inner Join: df1.join(df2, df1("id") === df2("id"))

● Left Outer Join: df1.join(df2, df1("id") === df2("id"), "left_outer")
● Right Outer Join: df1.join(df2, df1("id") === df2("id"), "right_outer")
● Full Outer Join: df1.join(df2, df1("id") === df2("id"), "full_outer")

6. RDD Operations

By: Waleed Mousa

● Creating an RDD: val rdd = sc.parallelize(Seq(1, 2, 3))
● Transforming with map: val rdd2 = rdd.map(x => x * x)
● Filtering Data: val filteredRdd = rdd.filter(x => x > 1)
● FlatMap Operation: val flatRdd = rdd.flatMap(x => Seq(x, x*2))
● Reducing Elements: val sum = rdd.reduce((x, y) => x + y)

7. Working with Key-Value Pairs

● Creating Pair RDD: val pairRdd = rdd.map(x => (x, x*2))

● Reducing by Key: val reduced = pairRdd.reduceByKey((x, y) => x + y)
● Grouping by Key: val grouped = pairRdd.groupByKey()
● Sorting by Key: val sorted = pairRdd.sortByKey()
● Map Values: val mappedValues = pairRdd.mapValues(x => x + 1)

8. Data Partitioning

● Repartitioning RDD: val repartitionedRdd = rdd.repartition(4)

● Coalescing RDD: val coalescedRdd = rdd.coalesce(2)

9. SQL Queries on DataFrames

● Creating Temp View: df.createOrReplaceTempView("tableView")

● Running SQL Query: val result = spark.sql("SELECT * FROM tableView WHERE
column > value")

10. UDFs and UDAFs

● Defining UDF: val myUDF = udf((x: Int) => x * 2)

● Using UDF in DataFrame: df.withColumn("newCol", myUDF($"column"))
● Registering UDF for SQL: spark.udf.register("myUDF", myUDF)
● Using UDAF: val myUDAF = new MyUDAF();
df.groupBy("column").agg(myUDAF($"otherColumn"))

11. Window Functions

● Using Window Function: val windowSpec =

Window.partitionBy("column").orderBy("otherColumn");
df.withColumn("rank", rank().over(windowSpec))

12. Handling Missing and Null Values

By: Waleed Mousa

● Filling Null Values: df.na.fill(0)
● Dropping Rows with Null: df.na.drop()
● Replacing Values: df.na.replace("column", Map("oldValue" -> "newValue"))

13. Handling JSON and Complex Data Types

● Extracting JSON Fields: df.withColumn("extractedField",

get_json_object($"jsonColumn", "$.fieldName"))
● Working with Structs: df.select($"structColumn.fieldName")

14. Reading and Writing Data from Various Sources

● Reading from Parquet: val df = spark.read.parquet("path/to/parquet")

● Writing to CSV: df.write.format("csv").save("path/to/output")
● Reading from JDBC: val jdbcDF = spark.read.format("jdbc").option("url",
jdbcUrl).option("dbtable", "tableName").load()
● Writing to JDBC: df.write.format("jdbc").option("url",
jdbcUrl).option("dbtable", "tableName").save()

15. Machine Learning with MLlib

● Vector Assembler: val assembler = new

VectorAssembler().setInputCols(Array("col1",
"col2")).setOutputCol("features")
● Standard Scaler: val scaler = new
StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").f
it(df)
● Linear Regression Model: val lr = new LinearRegression(); val model =
lr.fit(df)
● KMeans Clustering: val kmeans = new KMeans().setK(2); val model =
kmeans.fit(df)

16. Streaming Data

● Structured Streaming from Socket: val stream =

spark.readStream.format("socket").option("host",
"localhost").option("port", 9999).load()
● Writing Streaming Data: stream.writeStream.format("console").start()
● Kafka Source for Streaming: val kafkaStream =
spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe", "topic").load()

By: Waleed Mousa

● Writing to Kafka in Streaming:
stream.writeStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("topic", "outputTopic").start()

17. Performance Tuning

● Broadcast Variables: val broadcastVar = sc.broadcast(Array(1, 2, 3))

● Accumulators: val accumulator = sc.longAccumulator("MyAccumulator")
● Caching Data: df.cache()
● Checkpointing RDD: rdd.checkpoint()

18. Advanced DataFrame Transformations

● Pivoting Data:
df.groupBy("column").pivot("pivotColumn").agg(sum("value"))
● Explode Array Column: df.withColumn("exploded", explode($"arrayColumn"))
● Rollup: df.rollup("col1", "col2").agg(sum("value"))
● Cube: df.cube("col1", "col2").agg(sum("value"))

19. Handling Large Datasets

● Broadcast Join Hint: df1.join(broadcast(df2), Seq("id"), "inner")

● Avoiding Shuffle with Coalesce: df.coalesce(1)
● Repartitioning for Parallelism: df.repartition(10)

20. Dealing with Text Data

● Regular Expression with rlike: df.filter($"column".rlike("regex"))

● Splitting Strings: df.withColumn("splitCol", split($"stringCol",
"delimiter"))
● Concatenating Strings: df.withColumn("concatenated", concat_ws("-",
$"col1", $"col2"))

21. Working with Dates and Times

● Current Date and Timestamp: df.withColumn("currentDate",

current_date()).withColumn("currentTimestamp", current_timestamp())
● Date Formatting: df.withColumn("formattedDate", date_format($"dateCol",
"yyyy-MM-dd"))
● Date Arithmetic: df.withColumn("datePlusDays", expr("dateCol + interval 5
days"))

By: Waleed Mousa

22. Advanced SQL Queries

● Registering DataFrame as a Temp View for SQL:

df.createOrReplaceTempView("tempView"); spark.sql("SELECT * FROM tempView
WHERE column > value")
● Complex SQL Query: spark.sql("SELECT col1, col2, SUM(col3) FROM tempView
GROUP BY col1, col2")

23. Error Handling and Debugging

● Catching Exceptions in DataFrame Operations:

Try(df.select("invalidColumn")) match { case Success(df) => df case
Failure(e) => e.printStackTrace() }

24. Interoperability with RDDs and DataFrames

● Converting RDD to DataFrame: val df = rdd.toDF("column1", "column2")

● Converting DataFrame to RDD: val rdd = df.rdd

25. External Data Sources and Formats

● Reading Data from Avro Files: val df =

spark.read.format("avro").load("path/to/avro")
● Writing Data to Avro: df.write.format("avro").save("path/to/output")

26. Spark MLlib: Feature Transformers

● StringIndexer for Categorical Features: val indexer = new

StringIndexer().setInputCol("category").setOutputCol("categoryIndex")
● OneHotEncoder for Categorical Encoding: val encoder = new
OneHotEncoder().setInputCol("index").setOutputCol("vector")

27. Spark MLlib: Classification and Regression

● Decision Tree Classifier: val dt = new

DecisionTreeClassifier().setLabelCol("label").setFeaturesCol("features")
● Linear Regression: val lr = new
LinearRegression().setMaxIter(10).setRegParam(0.3)

28. Spark MLlib: Clustering

By: Waleed Mousa

● KMeans Clustering: val kmeans = new
KMeans().setK(3).setFeaturesCol("features")
● Gaussian Mixture Model: val gmm = new
GaussianMixture().setK(3).setFeaturesCol("features")

29. Spark MLlib: Model Evaluation

● Binary Classification Evaluator: val evaluator = new

BinaryClassificationEvaluator()
● Regression Evaluator (e.g., RMSE): val regEvaluator = new
RegressionEvaluator().setMetricName("rmse")

30. Spark GraphX: Graph Processing

● Creating a Graph from RDDs: val graph = Graph(verticesRDD, edgesRDD)

● Applying PageRank Algorithm: val ranks = graph.pageRank(0.0001).vertices

31. Spark GraphX: Graph Algorithms

● Connected Components: val cc = graph.connectedComponents().vertices

● Triangle Counting: val triangles = graph.triangleCount().vertices

32. Working with Accumulators

● Creating a Long Accumulator: val accumulator =

sc.longAccumulator("MyAccumulator")
● Using Accumulator in RDD Operations: rdd.foreach(x => accumulator.add(x))

33. Configurations and Tuning

● Setting Dynamic Allocation:

spark.conf.set("spark.dynamicAllocation.enabled", "true")
● Configuring Serialization: spark.conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")

34. Spark Streaming

● Creating DStream from Socket: val stream =

ssc.socketTextStream("localhost", 9999)
● Stateful Transformation in Streaming: val stateDStream =
stream.updateStateByKey(updateFunction)

By: Waleed Mousa

35. Structured Streaming

● Reading Stream from Kafka: val stream =

spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"host:port").option("subscribe", "topic").load()
● Writing Stream to Console: val query =
stream.writeStream.outputMode("complete").format("console").start()

By: Waleed Mousa

Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Special Chassis MRP Total Discount and Closing Price
No ratings yet
Special Chassis MRP Total Discount and Closing Price
51 pages
DS Interview Question Ineuron
100% (1)
DS Interview Question Ineuron
208 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
MVC Framework Tutorial
100% (1)
MVC Framework Tutorial
90 pages
Big O Notation Cheat Sheet - Leetcode Cheat Sheet - La Vivien Post1233
No ratings yet
Big O Notation Cheat Sheet - Leetcode Cheat Sheet - La Vivien Post1233
5 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
Tutorial Elasticsearch - English
0% (1)
Tutorial Elasticsearch - English
166 pages
Uxpin Zen of White Space. Space, Ratios, Minimalism
100% (1)
Uxpin Zen of White Space. Space, Ratios, Minimalism
73 pages
PracticeExam DCADAS3 Scala 1
No ratings yet
PracticeExam DCADAS3 Scala 1
27 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Spark
No ratings yet
Spark
160 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Mastering Python Data Structure (Lists)
No ratings yet
Mastering Python Data Structure (Lists)
31 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Modeling Guide For First Solar Thin Film Technology
No ratings yet
Modeling Guide For First Solar Thin Film Technology
47 pages
Py Spark
No ratings yet
Py Spark
427 pages
Spark Walmart Data Analysis Project
0% (1)
Spark Walmart Data Analysis Project
17 pages
Data Engineering
No ratings yet
Data Engineering
15 pages
Databricks On AWS 01 Getting Started Apache Spark Slides
100% (1)
Databricks On AWS 01 Getting Started Apache Spark Slides
29 pages
Company Interview Question Bank
No ratings yet
Company Interview Question Bank
16 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
SQL Handwritten Notes
No ratings yet
SQL Handwritten Notes
34 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
Python Notes
No ratings yet
Python Notes
49 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
4 Data-Testing PDF
No ratings yet
4 Data-Testing PDF
79 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
Hive Tutorial PDF
0% (1)
Hive Tutorial PDF
14 pages
VBA MACRO Course Training Syllabus PDF
No ratings yet
VBA MACRO Course Training Syllabus PDF
1 page
Make API Call in NodeJS
No ratings yet
Make API Call in NodeJS
12 pages
Spark Training - Java
No ratings yet
Spark Training - Java
8 pages
Class XII Data Handlinng Using PandasI
No ratings yet
Class XII Data Handlinng Using PandasI
46 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Spark
No ratings yet
Spark
13 pages
Scala Cheat Sheet Amresh
No ratings yet
Scala Cheat Sheet Amresh
2 pages
Black II 8 07d738 PDF
No ratings yet
Black II 8 07d738 PDF
9 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
Troubleshooting Active Directory Replication Problems
No ratings yet
Troubleshooting Active Directory Replication Problems
54 pages
Amelunxen, P, Bagdad Concentrator Process Control Update
No ratings yet
Amelunxen, P, Bagdad Concentrator Process Control Update
6 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Python Variables Cheatsheet
No ratings yet
Python Variables Cheatsheet
2 pages
Pythin Qa
No ratings yet
Pythin Qa
8 pages
Computing Inter-Rater Reliability For Observational Data - An Overview and Tutorial
No ratings yet
Computing Inter-Rater Reliability For Observational Data - An Overview and Tutorial
12 pages
Hadoop Admin Course
No ratings yet
Hadoop Admin Course
8 pages
Dokumentasi - Laporan Pembelian & Laporan Penjualan
No ratings yet
Dokumentasi - Laporan Pembelian & Laporan Penjualan
13 pages
Amazon Programming and Technical Interview Questions
No ratings yet
Amazon Programming and Technical Interview Questions
5 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
MongoDB Restaurants
No ratings yet
MongoDB Restaurants
5 pages
Acceleo User Guide
No ratings yet
Acceleo User Guide
56 pages
Basic Software Library Volume 6 - A Complete Business System
No ratings yet
Basic Software Library Volume 6 - A Complete Business System
186 pages
Web Services
No ratings yet
Web Services
66 pages
0937 Using Flutter Framework
No ratings yet
0937 Using Flutter Framework
50 pages
Windows Win32 Debug PE
No ratings yet
Windows Win32 Debug PE
104 pages
Cs Option: Illustrated Parts List
No ratings yet
Cs Option: Illustrated Parts List
11 pages
2373 Programming With MS Visual Basic
No ratings yet
2373 Programming With MS Visual Basic
5 pages
Car Crane
No ratings yet
Car Crane
12 pages
Develop PLSQL Program Units - sg3 (1-10)
No ratings yet
Develop PLSQL Program Units - sg3 (1-10)
10 pages
6th Month Capstone Project
No ratings yet
6th Month Capstone Project
4 pages
Hot List 1
No ratings yet
Hot List 1
4 pages
Optimisation in The Design of Underground Mine Access: January 2004
No ratings yet
Optimisation in The Design of Underground Mine Access: January 2004
19 pages
Syed Amir Ali - SAP - 22-SEP-2019 PDF
No ratings yet
Syed Amir Ali - SAP - 22-SEP-2019 PDF
8 pages
Section23 - BPC Data Load4
No ratings yet
Section23 - BPC Data Load4
22 pages
Rust in Action v1.0
No ratings yet
Rust in Action v1.0
7 pages
Data Privacy
No ratings yet
Data Privacy
10 pages
Assignment-Iii Case Study: Domino'S Sizzles With Pizza Tracker
No ratings yet
Assignment-Iii Case Study: Domino'S Sizzles With Pizza Tracker
3 pages
Ascii Characters:: Shadings
No ratings yet
Ascii Characters:: Shadings
5 pages
Symantec VIP Web Based RDP - User Guide
No ratings yet
Symantec VIP Web Based RDP - User Guide
6 pages
Ecircular - 3950 - Participate in Our World Mental Health Group Quest On GETGREAT Malaysia To Earn Touch ?N Go Credits
No ratings yet
Ecircular - 3950 - Participate in Our World Mental Health Group Quest On GETGREAT Malaysia To Earn Touch ?N Go Credits
6 pages
TDOSPB0041R2 BRO SpirostikBlue en Web
No ratings yet
TDOSPB0041R2 BRO SpirostikBlue en Web
3 pages
Tay Ho Bus Route For Dance Show - 18.05.2024 - For Audiences
No ratings yet
Tay Ho Bus Route For Dance Show - 18.05.2024 - For Audiences
2 pages
PULLEY Ø165x70: All Sharp Edges C0.3-0.5
No ratings yet
PULLEY Ø165x70: All Sharp Edges C0.3-0.5
1 page

Apache Spark With Scala - Cheatsheet

Uploaded by

Apache Spark With Scala - Cheatsheet

Uploaded by

# [ Apache Spark with Scala ] {CheatSheet}

1. Spark Session and Context

● Creating Spark Session: val spark =

2. Data Loading and Writing

● Reading a CSV File: val df = spark.read.format("csv").option("header",

● Selecting Columns: df.select("column1", "column2")

● Group By and Aggregate: df.groupBy("column").agg(sum("otherColumn"))

● Inner Join: df1.join(df2, df1("id") === df2("id"))

By: Waleed Mousa

7. Working with Key-Value Pairs

● Creating Pair RDD: val pairRdd = rdd.map(x => (x, x*2))

● Repartitioning RDD: val repartitionedRdd = rdd.repartition(4)

9. SQL Queries on DataFrames

● Creating Temp View: df.createOrReplaceTempView("tableView")

10. UDFs and UDAFs

● Defining UDF: val myUDF = udf((x: Int) => x * 2)

11. Window Functions

● Using Window Function: val windowSpec =

12. Handling Missing and Null Values

By: Waleed Mousa

13. Handling JSON and Complex Data Types

● Extracting JSON Fields: df.withColumn("extractedField",

14. Reading and Writing Data from Various Sources

● Reading from Parquet: val df = spark.read.parquet("path/to/parquet")

15. Machine Learning with MLlib

● Vector Assembler: val assembler = new

16. Streaming Data

● Structured Streaming from Socket: val stream =

By: Waleed Mousa

17. Performance Tuning

● Broadcast Variables: val broadcastVar = sc.broadcast(Array(1, 2, 3))

18. Advanced DataFrame Transformations

19. Handling Large Datasets

● Broadcast Join Hint: df1.join(broadcast(df2), Seq("id"), "inner")

20. Dealing with Text Data

● Regular Expression with rlike: df.filter($"column".rlike("regex"))

21. Working with Dates and Times

● Current Date and Timestamp: df.withColumn("currentDate",

By: Waleed Mousa

● Registering DataFrame as a Temp View for SQL:

23. Error Handling and Debugging

● Catching Exceptions in DataFrame Operations:

24. Interoperability with RDDs and DataFrames

● Converting RDD to DataFrame: val df = rdd.toDF("column1", "column2")

25. External Data Sources and Formats

● Reading Data from Avro Files: val df =

26. Spark MLlib: Feature Transformers

● StringIndexer for Categorical Features: val indexer = new

27. Spark MLlib: Classification and Regression

● Decision Tree Classifier: val dt = new

28. Spark MLlib: Clustering

By: Waleed Mousa

29. Spark MLlib: Model Evaluation

● Binary Classification Evaluator: val evaluator = new

30. Spark GraphX: Graph Processing

● Creating a Graph from RDDs: val graph = Graph(verticesRDD, edgesRDD)

31. Spark GraphX: Graph Algorithms

● Connected Components: val cc = graph.connectedComponents().vertices

32. Working with Accumulators

● Creating a Long Accumulator: val accumulator =

33. Configurations and Tuning

● Setting Dynamic Allocation:

34. Spark Streaming

● Creating DStream from Socket: val stream =

By: Waleed Mousa

● Reading Stream from Kafka: val stream =

By: Waleed Mousa

You might also like