Apache Spark with Scala - Resilient Distributed Dataset
Last Updated :
02 Sep, 2021
In the modern world, we are dealing with huge datasets every day. Data is growing even faster than processing speeds. To perform computations on such large data is often achieved by using distributed systems. A distributed system consists of clusters (nodes/networked computers) that run processes in parallel and communicate with each other if needed.
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. This rich set of functionalities and libraries supported higher-level tools like Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing. In this article, we will be learning Apache spark (version 2.x) using Scala.
Some basic concepts :
- RDD(Resilient Distributed Dataset) - It is an immutable distributed collection of objects. In the case of RDD, the dataset is the main part and It is divided into logical partitions.
- SparkSession -The entry point to programming Spark with the Dataset and DataFrame API.
We will be using Scala IDE only for demonstration purposes. A dedicated spark compiler is required to run the below code. Follow the link to run the below code.
Let's create our first data frame in spark.
Scala
// Importing SparkSession
import org.apache.spark.sql.SparkSession
// Creating SparkSession object
val sparkSession = SparkSession.builder()
.appName("My First Spark Application")
.master("local").getOrCreate()
// Loading sparkContext
val sparkContext = sparkSession.sparkContext
// Creating an RDD
val intArray = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
// parallelize method creates partitions, which additionally
// takes integer argument to specifies the number of partitions.
// Here we are using 3 partitions.
val intRDD = sparkContext.parallelize(intArray, 3)
// Printing number of partitions
println(s"Number of partitions in intRDD : ${intRDD.partitions.size}")
// Printing first element of RDD
println(s"First element in intRDD : ${intRDD.first}")
// Creating string from RDD
// take(n) function is used to fetch n elements from
// RDD and returns an Array.
// Then we will convert the Array to string using
// mkString function in scala.
val strFromRDD = intRDD.take(intRDD.count.toInt).mkString(", ")
println(s"String from intRDD : ${strFromRDD}")
// Printing contents of RDD
// collect function is used to retrieve all the data in an RDD.
println("Printing intRDD: ")
intRDD.collect().foreach(println)
Output :
Number of partitions in intRDD : 3
First element in intRDD : 1
String from intRDD : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Printing intRDD:
1
2
3
4
5
6
7
8
9
10
Similar Reads
How to Convert RDD to Dataframe in Spark Scala? This article focuses on discussing ways to convert rdd to dataframe in Spark Scala. Table of Content RDD and DataFrame in SparkConvert Using createDataFrame MethodConversion Using toDF() Implicit MethodConclusionFAQsRDD and DataFrame in SparkRDD and DataFrame are Spark's two primary methods for hand
6 min read
How to create Spark session in Scala? Scala stands for scalable language. It was developed in 2003 by Martin Odersky. It is an object-oriented language that provides support for functional programming approach as well. Everything in scala is an object e.g. - values like 1,2 can invoke functions like toString(). Scala is a statically typ
5 min read
How to Import SparkSession in Scala? This article focuses on discussing how to import SparkSession in Scala. Table of Content What is Sparksession?PrerequisitesApproach to Import SparkSession in ScalaImplementationCreate a DataFrame Using SparkSessionConclusionWhat is Sparksession?When spark runs, spark Driver creates a SparkSession wh
2 min read
Best Programming Languages For Apache Spark It has been observed so often that people or organizations donât focus on selecting the right language before working on any project. However, there are certain criteria to look into before going ahead like a perfect blend of data, right implementation, accuracy, data models, and so on. The point is
6 min read
How to create partition in scala? In the world of big data, processing efficiency is key, and data partitioning emerges as a vital tool for optimizing performance. By strategically dividing large datasets into smaller subsets, partitioning enables parallel processing, significantly accelerating data manipulation tasks. In Scala, ach
2 min read
How to Use Spark-Shell to Execute Scala File? Apache Spark is a lightning-quick analytics tool that is used for cluster registering for huge data sets like BigData and Hadoop which can run programs lined up across different nodes. Users can perform a wide range of work using Spark Shell, like stacking information, controlling DataFrames and RDD
4 min read