Spark Slides
Spark Slides
- Spark is an open-source big data processing framework that is designed to handle large-scale data processing tasks. It is faster
and more efficient than traditional big data processing frameworks such as Hadoop.
- Spark can process data in-memory, making it much faster than disk-based processing. Spark includes a variety of high-level APIs
and libraries that make it easy to develop and deploy data processing applications, including Spark SQL, MLlib, and GraphX.
Spark can be run on a cluster of computers, allowing it to scale to handle very large datasets.
Hadoop Vs Spark
1. Processing Model: Hadoop is based on the MapReduce programming model, which involves dividing
large datasets into smaller, more manageable chunks and processing them in parallel on a distributed
computing infrastructure. Spark, on the other hand, is based on the Resilient Distributed Datasets
(RDD) programming model, which allows for in-memory processing of data.
2. Performance: Spark is generally faster than Hadoop due to its ability to perform in-memory
processing. In addition, Spark has a more efficient execution engine and can run batch processing,
stream processing, machine learning, and graph processing workloads more efficiently.
3. Ease of Use: Spark is generally considered to be easier to use than Hadoop, as it includes a variety of high-level
APIs and libraries that make it easier to develop and deploy data processing applications.
4. Real-time Processing: Spark is designed to handle real-time processing tasks, whereas Hadoop is primarily used
for batch processing.
5. Scalability: Both Hadoop and Spark are designed to scale horizontally, meaning that they can handle large
amounts of data by distributing processing across a cluster of computers. However, Spark is generally considered to
be more scalable than Hadoop, as it can process data in-memory and has a more efficient execution engine
What is a Spark Cluster ?
● A Spark cluster is a group of connected computers (nodes) that work together to process large datasets.
● The cluster is managed by a Spark driver program that coordinates the distribution of tasks across the nodes.
● The driver program runs on one node, while the other nodes act as worker nodes that process the tasks.
● The worker nodes can be divided into smaller partitions, which can be processed in parallel to speed up the
computation.
● Spark clusters can be deployed on-premises, in a cloud environment, or on a hybrid infrastructure.
● By distributing the processing tasks across multiple nodes, Spark clusters can process large datasets
much faster than traditional single-node systems.
● Spark provides fault tolerance, which means that if a node fails during processing, the failed task can
be automatically redistributed to another node to complete the processing.
● The size of a Spark cluster can range from a few nodes to thousands, depending on the size of the
dataset and performance requirements of the application.
● Spark clusters can be used for a wide range of data processing tasks, such as data transformation,
machine learning, and graph processing.
What is RDD in Spark ?
In simple terms, RDD (Resilient Distributed Dataset) in Spark can be thought of as a collection of data that is spread across multiple
computers, which can be processed in parallel. RDDs are fault-tolerant and immutable, meaning they can recover from failures and their
data cannot be changed once created.
Think of RDDs as a big, distributed list of elements that can be operated on in parallel. Each element can be a number, a string, or even
a more complex data structure like an array or a tuple. You can transform RDDs by applying operations like filter, map, and reduce,
which will be executed in parallel across the distributed nodes.
Type of Operations in Spark
Transformation Operations:
These operations transform the input data into some other form of data. Transformations are immutable in nature which means they do
not change the original data, but rather create new RDDs. Here are some examples of transformation operations:
map() - applies a function to each element in the RDD and returns a new RDD with the transformed elements. For example:
new_rdd = rdd.map(lambda x: x + 1)
filter() - applies a predicate function to each element in the RDD and returns a new RDD with the elements that satisfy the predicate. For
example:
new_rdd = rdd.filter(lambda x: x % 2 == 0)
groupByKey() - groups the values of each key in the RDD and returns a new RDD with a
pair of the key and the list of corresponding values. For example:
rdd = sc.parallelize([(1, 'a'), (2, 'b'), (1, 'c'), (2, 'd')])
new_rdd = rdd.groupByKey()
Action Operations:
These operations trigger some computation on the RDD and return a result or perform
some action on the data. Unlike transformations, actions are not immutable in nature,
which means they change the state of the RDD. Here are some examples of action
operations:
collect() - returns all the elements of the RDD to the driver program as an array. For
example:
rdd = sc.parallelize([1, 2, 3])
count() - returns the number of elements in the RDD. For example:
count = rdd.count()
reduce() - aggregates the elements of the RDD using a function and returns a single value.
result = rdd.reduce(lambda x, y: x + y)
Functioning of RDD in Spark ?
RDD (Resilient Distributed Datasets) is a fundamental concept in Apache Spark for processing and
analyzing large-scale datasets in a distributed environment. To visualize RDDs in Spark, we can use a graph
representation where each node represents a partition of the RDD and each block represents an element in
the RDD.
Let's take an example of a simple RDD that contains a list of integers:
val data = Array(1, 2, 3, 4, 5)
val rdd = sc.parallelize(data, 2)
In this example, we create an RDD rdd with two partitions using the sc.parallelize method. Each
partition of the RDD will contain a subset of the data. We can visualize this RDD as follows:
RDD (rdd)
/ \
Node 1 Node 2
/ \ / \
Block 1 Block 2 Block 3 Block 4 Block 5
RDD (rdd)
/ \
Node 1 Node 2
/ \ / \
Block 1 Block 2 Block 3 Block 4 Block 5
In this visualization, the RDD is represented as a node with two child nodes, each representing a
partition of the RDD. Each partition is further divided into blocks, which represent the individual
elements in the RDD.
In this case, the RDD has two partitions, so the first partition (Node 1) contains the first two blocks
(Block 1 and Block 2) and the second partition (Node 2) contains the remaining three blocks (Block 3,
Block 4, and Block 5).
By visualizing the RDD in this way, we can better understand how Spark partitions and distributes the
data across a cluster of machines for parallel processing. It also helps us to optimize our Spark
applications by identifying potential bottlenecks or areas for improvement.
Let's take another example where we have an RDD of words, and we want to visualize how the data is distributed across
partitions and blocks:
val data = Array("apple", "banana", "orange", "grape", "pineapple")
val rdd = sc.parallelize(data, 3)
In this example, we create an RDD rdd with three partitions using the sc.parallelize method. We can visualize this
RDD as follows:
RDD (rdd)
/ | \
Node 1 Node 2 Node 3
/ | /| / | \
Block 1 Block 2 Block 3 Block 4 Block 5
"apple" "banana" "orange" "grape" "pineapple"
In this visualization, the RDD is represented as a node with three child nodes, each representing a partition of the RDD.
Each partition is further divided into blocks, which represent the individual elements in the RDD (in this case, words).
The RDD has three partitions, so the first partition (Node 1) contains the first two blocks (Block 1 and Block 2), the
second partition (Node 2) contains the next two blocks (Block 3 and Block 4), and the third partition (Node 3) contains
the final block (Block 5).
Immutability in RDD
let's take an example to explain the immutability of RDDs in Apache Spark.
Consider the following example where we have an RDD of integers:
val numbersRDD = sc.parallelize(Seq(1, 2, 3, 4, 5))
Since RDDs are immutable, we cannot modify the contents of the numbersRDD once it is created.
For example, if we try to change the value of the first element in the RDD, like this:
numbersRDD(0) = 6 // This will result in an error
We will get a compilation error, because RDDs are read-only and their data cannot be changed.
Instead, if we want to modify the data in an RDD, we need to create a new RDD with the modified
data. For example, if we want to add 1 to each element in the numbersRDD, we can create a new
RDD with the modified data, like this:
val incrementedRDD = numbersRDD.map(x => x + 1)
This will create a new RDD called incrementedRDD with the modified data, where each element is
incremented by 1. The original numbersRDD remains unchanged.
Resilience in RDD
Consider the following example where we have an RDD of integers:
RDDs are resilient to node failures, which means they can recover from failures and continue processing data. This is achieved through
the use of lineage, which is a record of how the RDD was created from other RDDs.
Let's say we have a cluster of three machines (M1, M2, and M3), and we store the numbersRDD across two partitions, with one partition
on M1 and the other partition on M2. If one of the machines (say, M2) fails, Spark can use the lineage information to recover the lost
partition and continue processing the data.
Here's how it works: When we create the numbersRDD, Spark keeps track of how it was created, and stores this information in the
lineage. The lineage includes the transformations and dependencies that were used to create the RDD, as well as the location of the
parent RDDs. In this example, the lineage for numbersRDD would include the parallelize transformation and the location of the two
partitions.
If M2 fails, the partition on M2 would be lost, and Spark would not be able to access the data in that partition. However, because the
lineage includes the location of the parent RDD (i.e., the parallelize transformation), Spark can recreate the lost partition by re-
executing the transformation on the remaining partition on M1. This ensures that the data is still available for processing, even if one of
the nodes fails.
Lineage in RDD ?
lineage is a record of how an RDD (Resilient Distributed Dataset) was created from other RDDs through
transformations. It is a fundamental concept in Spark's fault tolerance mechanism and is used to recover
lost data due to node failures.
The lineage of an RDD is a directed acyclic graph (DAG) that shows the dependencies between RDDs
and the transformations applied to them. Each node in the graph represents an RDD, and each edge
represents a transformation that was applied to create the new RDD.
For example, consider the following code that creates an RDD and applies two transformations to it:
If a node in the cluster fails and some data is lost, Spark can use the lineage information to reconstruct the lost data by re-executing the
transformations that were applied to create the lost RDDs. In the above example, if rdd2 was lost due to a node failure, Spark can
recreate it by re-applying the filter transformation on rdd1. If rdd1 was lost, Spark can recreate it by re-executing the parallelize
transformation.
Lineage allows Spark to achieve fault tolerance without requiring expensive replication of data across the cluster. It also enables Spark
to optimize the execution plan by reordering transformations based on their dependencies and minimizing data shuffling between nodes.