0% found this document useful (0 votes)
8 views86 pages

Unit 4 Spark Updated

The document provides an overview of Apache Spark, including its installation, architecture, and components such as Spark Core, Spark SQL, and Spark Streaming. It details the execution model, application model, and the concepts of Resilient Distributed Datasets (RDDs), transformations, and actions. Additionally, it outlines the history of Spark and its advantages over Hadoop MapReduce for data processing.

Uploaded by

Utkarsha Mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views86 pages

Unit 4 Spark Updated

The document provides an overview of Apache Spark, including its installation, architecture, and components such as Spark Core, Spark SQL, and Spark Streaming. It details the execution model, application model, and the concepts of Resilient Distributed Datasets (RDDs), transformations, and actions. Additionally, it outlines the history of Spark and its advantages over Hadoop MapReduce for data processing.

Uploaded by

Utkarsha Mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Spark Unit-4

• Spark – Installing Spark,


• An Example,
• Spark Applications, Jobs, Stages,
and Tasks,
• A Scala Standalone Application,
• A Java Example,
• A Python Example,
• Resilient Distributed Datasets
Creation
• Transformations and Actions,
• Spark on YARN.
• analytics and visualization using
Lumify, DataWrapper.
History
• At first, in 2009 Apache Spark was introduced in the UC Berkeley R&D
Lab, which is now known as AMPLab.
• In 2010 it became open source under BSD license.
• Further, the spark was donated to Apache Software Foundation, in 2013.
• Then in 2014, it became top-level Apache project.
Introduction
• Apache Spark is a light-fast cluster computing designed for fast computation.
• It was built on top of Hadoop MapReduce and it extends the MapReduce model to
efficiently use more types of computations which includes Interactive Queries and
Stream Processing.
• Spark Programming is nothing but a general-purpose & lightning fast cluster
computing platform.
• It is an open source, wide range data processing engine.
• Spark can perform batch processing and stream processing.
• Batch processing refers, to the processing of the previously collected job in a single
batch.
• stream processing means to deal with Spark streaming data.
• It is designed in such a way that it integrates with all the Big data tools.
• Spark can run on Hadoop clusters.
• Apache Spark extends Hadoop MapReduce to the next level.
• Includes iterative queries and stream processing.
• Spark is independent of Hadoop since it has its own cluster
management system.
• Basically, it uses Hadoop for storage purpose only.

• Spark’s key feature is


• it has in-memory cluster computation capability.
• increases the processing speed of an application.
Architecture
Apache Spark Components
a. Spark Core
• Spark Core is a central point of Spark.
• Basically, it provides an execution platform for all the Spark applications.
• To support a wide array of applications, Spark Provides a generalized platform.
b. Spark SQL
• On the top of Spark, Spark SQL enables users to run SQL/HQL queries.
• We can process structured as well as semi-structured data, by using Spark SQL.
• We can run unmodified queries up to 100 times faster on existing deployments.
c. Spark Streaming
• Basically, across live streaming, Spark Streaming enables a powerful interactive and
data analytics application.
• The live streams are converted into micro-batches that are executed on top of
spark core.
d. Spark MLlib
• Machine learning library delivers both efficiencies as well as the high-quality algorithms.
• it is the hottest choice for a data scientist. Since it is capable of in-memory data processing,
that improves the performance of iterative algorithm drastically.
e. Spark GraphX
• Spark GraphX is the graph computation engine built on top of Apache Spark
that enables to process graph data at scale.
f. SparkR
• Basically, to use Apache Spark from R.
• It is R package that gives light-weight frontend.
• It allows data scientists to analyze large datasets.
• Also allows running jobs interactively on them from the R shell.
• The main idea behind SparkR was to explore different techniques to integrate
the usability of R with the scalability of Spark.
g. Resilient Distributed Dataset – RDD

• The key abstraction of Spark is RDD.


• It is the fundamental unit of data in Spark.
• It is a distributed collection of elements across cluster nodes.
• It performs parallel operations.
• Spark RDDs are immutable in nature.
• It can generate new RDD by transforming existing Spark RDD.
Why Spark?
How to Run Apache Spark Application on a cluster
• The first block you see is the
driver program. Once you do a
Spark submit, a driver program
is launched and this requests for
resources to the cluster
manager .
• The main program of the user
function of the user processing
program is initiated by the driver
program.
• The execution logic is processed
and parallelly Spark context is
also created. Using the Spark
context, the different
transformations and actions are
processed.
• All the transformations will go
into the Spark context in the
• Once the action is called job is created. Job is the collection of different task stages. form of DAG that will create
• The conversion of RDD lineage into tasks is done by the DAG scheduler. RDD lineage.
• once the action is called these are split into different stages of tasks and submitted to the task scheduler
• Then these are launched on the different executors in the worker node through the cluster manager. The entire resource
allocation and the tracking of the jobs and tasks are performed by the cluster manager.
Spark Execution Model
• Spark application execution involves runtime concepts such as
• driver,
• executor,
• task,
• job,
• stage.
1. Using spark-submit, the user submits an application.
2. In spark-submit, we invoke the main() method that the user specifies.
3. It also launches the driver program.
4. The driver program asks for the resources to the cluster manager that we need to launch
executors.
5. The cluster manager launches executors on behalf of the driver program.
6. The driver process runs with the help of user application. Based on the actions and
transformation on RDDs, the driver sends work to executors in the form of tasks.
7. The executors process the task and the result sends back to the driver through the cluster
manager.
Terminology
Terminology
INSTALLATION OF SPARK
• Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux
based system.
• The following steps show how to install Apache Spark.

Step 1: Verifying Java Installation


Java installation is one of the mandatory things in installing Spark. Try the following
command to verify the JAVA version.

$java -version
If Java is already, installed on your system, you get to see the following response −

java version "1.7.0_71"


Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
In case you do not have Java installed on your system, then Install Java before proceeding to
next step.
Step 2: Verifying Scala installation
You should Scala language to implement Spark. So let us verify Scala
installation using following command.
$scala -version
If Scala is already installed on your system, you get to see the following
response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step
for Scala installation.
Step 3: Downloading Scala
Download the latest version of Scala by visit the following link Download Scala.
For this tutorial, we are using scala-2.11.6 version. After downloading, you will
find the Scala tar file in the download folder.

Step 4: Installing Scala


Follow the below given steps for installing Scala.
Extract the Scala tar file
Type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz
Move Scala software files
Use the following commands for moving the Scala software files, to respective directory
(/usr/local/scala).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
Set PATH for Scala
Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin
Verifying Scala Installation
After installation, it is better to verify it. Use the following command for verifying Scala
installation.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
Step 5: Downloading Apache Spark
Download the latest version of Spark by visiting the following link Download
Spark. For this tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After
downloading it, you will find the Spark tar file in the download folder.

Step 6: Installing Spark


Follow the steps given below for installing Spark.
Extracting Spark tar
The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz
Moving Spark software files
The following commands for moving the Spark software files to respective
directory (/usr/local/spark).
$ su – Password: # cd /home/Hadoop/Downloads/ # mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark #
exit
Setting up the environment for Spark
Add the following line to ~/.bashrc file. It means adding the location, where the spark software file
are located to the PATH variable.
export PATH=$PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc
Step 7: Verifying the Spark Installation
Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's
default log4j profile: org/apache/spark/log4j-defaults.properties 15/06/04 15:25:22 INFO
SecurityManager: Changing view acls to: hadoop 15/06/04 15:25:22 INFO SecurityManager:
Changing modify acls to: hadoop 15/06/04 15:25:22 INFO SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with
modify permissions: Set(hadoop) 15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) Type in
expressions to have them evaluated. Spark context available as sc scala>
Spark Application Model
•Apache Spark is widely considered to be the successor to
MapReduce for general purpose data processing on Apache
Hadoop clusters.
•Like MapReduce applications, each Spark application is a
self-contained computation that runs user-supplied code to
compute a result.
•As with MapReduce jobs, Spark applications can use the resources
of multiple hosts.
•Spark has many advantages over MapReduce.
Interactive Analysis with the Spark Shell
• Spark’s interactive shell provides a simple way to learn the API, as
well as a powerful tool to analyze datasets interactively.
• Start the shell by running ./bin/spark-shell in the Spark directory.
• Spark’s primary abstraction is a distributed collection of items called a
Resilient Distributed Dataset (RDD).
• RDDs can be created from Hadoop InputFormats (such as HDFS files)
or by transforming other RDDs.
• Let’s make a new RDD from the text of the README file in the Spark source
directory:
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
Running Your First Spark Application

• The simplest way to run a Spark application is by using the Scala or


Python shells.
• To start one of the shell applications, run one of the following
commands:
• Scala:
• $ SPARK_HOME/bin/spark-shell
• scala>
Pyhton
$ SPARK_HOME/bin/pyspark
Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version ...
/_/
Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56)
SparkContext available as sc, HiveContext available as sqlContext.

>>>
• To run the classic Hadoop word count application, copy an input file
to HDFS:
• $ hdfs dfs -put input
• Within a shell, run the word count application using the following
code examples, substituting for namenode_host, path/to/input, and
path/to/output:
Scala
scala> val myfile = sc.textFile("hdfs://namenode_host:8020/path/to/input")
scala> val counts = myfile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
scala> counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")
• Python
>>> myfile = sc.textFile("hdfs://namenode_host:8020/path/to/input")
>>> counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word:
(word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
>>> counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")
Actions &Transformation
Apache Spark RDD supports two types of Operations-
• Transformations
• Actions
Actions, which return values,
Transformations, which return pointers to new RDDs.
• RDD Transformation
• Spark Transformation is a function that produces new RDD from the existing RDDs.
• It takes RDD as input and produces one or more RDD as output.
• Each time it creates new RDD when we apply any transformation.
• Thus, the so input RDDs, cannot be changed since RDD are immutable in nature.
• Applying transformation built an RDD lineage, with the entire parent RDDs of the final RDD(s).
• RDD lineage, also known as RDD operator graph or RDD dependency graph.

• It is a logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent
RDDs of RDD.
• Transformations are lazy in nature i.e., they get execute when we call an action.
• They are not executed immediately.
• Two most basic type of transformations is a map(), filter().
After the transformation, the resultant RDD is always different from its parent RDD.
• It can be smaller (e.g. filter, count, distinct, sample), bigger
(e.g. flatMap(), union(), Cartesian()) or the same size (e.g. map).
We will use the filter transformation to return a new RDD with a subset of the items in
the file.

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))


linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
There are two types of transformations:
• Narrow transformation
• Wide transformation
• Narrow transformation – In Narrow transformation, all the elements
that are required to compute the records in single partition live in the
single partition of parent RDD.
• A limited subset of partition is used to calculate the result.
• Narrow transformations are the result of map(), filter().
Spark Essentials: Transformations
transformation description
map(func) return a new distributed dataset formed by
passing each element of the source through a
function func
filter(func) return a new dataset formed by selecting those
elements of the source on which func returns
true

flatMap(func) similar to map, but each input item can be


mapped to 0 or more output items (so func
should return a Seq rather than a single item)

sample(withReplacement, sample a fraction fraction of the data, with or


fraction, seed) without replacement, using a given random
number generator seed

union(otherDataset) return a new dataset that contains the union of


the elements in the source dataset and the
argument
distinct([numTasks])) return a new dataset that contains the distinct
elements of the source dataset
• Wide transformation – In wide transformation, all the elements that
are required to compute the records in the single partition may live in
many partitions of parent RDD.
• The partition may live in many partitions of parent RDD.
• Wide transformations are the result of groupbyKey()
and reducebyKey().
map(func)
• The map function iterates over every line in RDD and split into new
RDD.
• Using map() transformation we take in any function, and that function
is applied to every element of RDD.
• In the map, we have the flexibility that the input and the return type of
RDD may differ from each other.
• For example, we can have input RDD type as String, after applying the map()
function the return RDD can be Boolean.
• For example, in RDD {1, 2, 3, 4, 5} if we apply “rdd.map(x=>x+2)” we will
get the result as (3, 4, 5, 6, 7).
Map() example:
[php]import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object mapTest{
def main(args: Array[String]) = {
val spark =
SparkSession.builder.appName(“mapExample”).master(“local”).getOrC
reate()
val data = spark.read.textFile(“spark_test.txt”).rdd
val mapFile = data.map(line => (line,line.length))
mapFile.foreach(println)
}
}[/php]
flatMap()
• With the help of flatMap() function, to each input element, we have
many elements in an output RDD.
• The most simple use of flatMap() is to split each input string into
words.
Map and flatMap are similar in the way that they take a line from input
RDD and apply a function on that line.
• The key difference between map() and flatMap() is map() returns
only one element, while flatMap() can return a list of elements.
• flatMap() example:
• [php]val data = spark.read.textFile(“spark_test.txt”).rdd
val flatmapFile = data.flatMap(lines => lines.split(” “))
flatmapFile.foreach(println)[/php]
• In above code, flatMap() function splits each line when space occurs.
map() flatmap()
filter(func)
• Spark RDD filter() function returns a new RDD, containing only the
elements that meet a predicate.
• It is a narrow operation because it does not shuffle data from one partition
to many partitions.
• For example,
• Suppose RDD contains first five natural numbers (1, 2, 3, 4, and 5) and the predicate
is check for an even number.
• The resulting RDD after the filter will contain only the even numbers i.e., 2 and 4.
• Filter() example:
• [php]val data = spark.read.textFile(“spark_test.txt”).rdd
val mapFile = data.flatMap(lines => lines.split(” “)).filter(value => value==”spark”)
println(mapFile.count())[/php]
• In above code, flatMap function map line into words and then count the word
“Spark” using count() Action after filtering lines containing “Spark” from mapFile.
union(dataset)
With the union() function, we get the elements of both the RDD in new RDD.
The key rule of this function is that the two RDDs should be of the same type.

For example, the elements of RDD1 are (Spark, Spark, Hadoop, Flink) and
that of RDD2 are (Big data, Spark, Flink)
so the resultant rdd1.union(rdd2) will have elements (Spark, Spark, Spark, Hadoop, Flink,
Flink, Big data).
• Union() example:
[php]val rdd1 = spark.sparkContext.parallelize(Seq((1,”jan”,2016),(3,”nov”,2014),(16,”feb”,2014)))
val rdd2 = spark.sparkContext.parallelize(Seq((5,”dec”,2014),(17,”sep”,2015)))
val rdd3 = spark.sparkContext.parallelize(Seq((6,”dec”,2011),(16,”may”,2015)))
val rddUnion = rdd1.union(rdd2).union(rdd3)
rddUnion.foreach(Println)[/php]
In above code union() operation will return a new dataset that contains the union of the
elements in the source dataset (rdd1) and the argument (rdd2 & rdd3).
Intersection
• With the intersection() function, we get only the common element of both the
RDD in new RDD.
• The key rule of this function is that the two RDDs should be of the same type.

• Consider an example, the elements of RDD1 are (Spark, Spark, Hadoop, Flink)
and that of RDD2 are (Big data, Spark, Flink) So the
resultant rdd1.intersection(rdd2) will have elements (spark).
• Intersection() example:
[php]val rdd1 = spark.sparkContext.parallelize(Seq((1,”jan”,2016),(3,”nov”,2014, (16,”feb”,2014)))
val rdd2 = spark.sparkContext.parallelize(Seq((5,”dec”,2014),(1,”jan”,2016)))
val comman = rdd1.intersection(rdd2)
comman.foreach(Println)[/php].
• The intersection() operation return a new RDD.
• It contains the intersection of elements in the rdd1 & rdd2.
distinct()
• It returns a new dataset that contains the distinct elements of the source
dataset.
• It is helpful to remove duplicate data.
For example, if RDD has elements (Spark, Spark, Hadoop,
Flink), then rdd.distinct() will give elements (Spark, Hadoop, Flink).

• [php]val rdd1 =
park.sparkContext.parallelize(Seq((1,”jan”,2016),(3,”nov”,2014),(16,”feb”,2014),(3,
”nov”,2014)))
val result = rdd1.distinct()
println(result.collect().mkString(“, “))[/php]
• In the above example, the distinct function will remove the duplicate record
i.e. (3,'”nov”,2014).
• groupByKey()
• When we use groupByKey() on a dataset of (K, V) pairs, the data is shuffled
according to the key value K in another RDD.
• In this transformation, lots of unnecessary data get to transfer over the
network.
• reduceByKey(func, [numTasks])
• When we use reduceByKey on a dataset (K, V), the pairs on the same
machine with the same key are combined, before the data is shuffled.
• sortByKey()
• When we apply the sortByKey() function on a dataset of (K, V) pairs,
the data is sorted according to the key K in another RDD.
• join()
• The Join is database term. It combines the fields from two table using
common values. join() operation in Spark is defined on pair-wise RDD.
• Pair-wise RDDs are RDD in which each element is in the form of tuples.
Where the first element is key and the second element is the value.
• The join() operation combines two data sets on the basis of the key.
coalesce()

• To avoid full shuffling of data we use coalesce() function.


• In coalesce() we use existing partition so that less data is shuffled.
• Using this we can cut the number of the partition.
Suppose, we have four nodes and we want only two nodes. Then the data of
extra nodes will be kept onto nodes which we kept.
RDD Action
• Transformations create RDDs from each other, but when we want to work with the
actual dataset, at that point action is performed.
• The final result of RDD computation is the action.
• It uses the DAG to execute the tasks
• When the action is triggered after the result, new RDD is not formed like transformation.
• At first , it loads the data into the original RDD, then performs all intermediate
transformation jobs , and finally returns to the driver program.
• The values of action are stored to drivers or to the external storage system.
• An action is one of the ways of sending data from Executer to the driver.
• Executors are agents that are responsible for executing a task.
• While the driver is a JVM process that coordinates workers and execution of the task.
Example of Actions
scala> textFile.count()
// Number of items in this RDD
res0: Long = 74
scala> textFile.first()
// First item in this RDD
res1: String = # Apache Spark
• count()
Action count() returns the number of elements in RDD.
For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd.count()” will
give the result 8.
• collect()
• The action collect() is the common and simplest operation that returns our entire
RDDs content to driver program.
• The application of collect() is unit testing where the entire RDD is expected to fit in
memory.
• it makes easy to compare the result of RDD with the expected result.

• Action Collect() had a constraint that all the data should fit in the machine, and
copies to the driver.
• Take(n)
The action take(n) returns n number of elements from RDD.
It tries to cut the number of partition it accesses, so it represents a biased
collection.
We cannot presume the order of the elements.
For example, consider RDD {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “take (4)”
will give result { 2, 2, 3, 4}
• top()
If ordering is present in our RDD, then we can extract top elements from
our RDD using top(). Action top() use default ordering of data.
countByValue()
• The countByValue() returns, many times each element occur in RDD.
• For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD
“rdd.countByValue()” will give the result {(1,1), (2,2), (3,1), (4,1), (5,2),
(6,1)}
reduce()
• The reduce() function takes the two elements as input from the RDD
and then produces the output of the same type as that of the input
elements. The simple forms of such function are an addition. We can
add the elements of RDD, count the number of words. It accepts
commutative and associative operations as an argument.
rdd = sc.parallelize([1, 2, 3, 4, 5])
sum_result = rdd.reduce(lambda x, y: x + y)
print(sum_result) # Output: 15 (1 + 2 + 3 + 4 + 5)

rdd = sc.parallelize([10, 7, 15, 3, 9])


max_value = rdd.reduce(lambda x, y: x if x > y else y)
print(max_value) # Output: 15
How to Create RDDs in Apache Spark?
• Resilient Distributed Datasets (RDD) is the fundamental data structure of
Spark. RDDs are immutable and fault tolerant in nature.
• These are distributed collections of objects.
• The datasets are divided into a logical partition, which is further computed
on different nodes over the cluster.
• Thus, RDD is just the way of representing dataset distributed across
multiple machines, which can be operated around in parallel.
• RDDs are called resilient because they have the ability to always re-compute
an RDD.
• There are three ways to create an RDD in Spark.
• Parallelizing already existing collection in driver program.
• Referencing a dataset in an external storage system (e.g. HDFS, Hbase, shared file
system).
• Creating RDD from already existing RDDs.
Parallelized collection (parallelizing)

• RDDs are generally created by parallelized collection i.e. by taking an


existing collection in the program and passing it
to SparkContext’s parallelize() method.
• This method is used in the initial stage of learning Spark since it
quickly creates our own RDDs in Spark shell and performs operations
on them.
• This method is rarely used outside testing and prototyping because
this method requires entire dataset on one machine.
• Consider the following example of sortByKey().
In this, the data to be sorted is taken through parallelized collection:

[php]val data=spark.sparkContext.parallelize(Seq((“maths”,52),(“english”,75),(“science”,82),
(“computer”,65),(“maths”,85)))
val sorted = data.sortByKey()
sorted.foreach(println)[/php]

• The key point to note in parallelized collection is the number of partition the dataset is cut into.
• Spark will run one task for each partition of cluster.
• We require two to four partitions for each CPU in cluster.
• Spark sets number of partition based on our cluster.
• But we can also manually set the number of partitions.
• This is achieved by passing number of partition as second parameter to parallelize .
e.g. sc.parallelize(data, 10), here we have manually given number of partition as 10.
• Consider one more example, here we have used parallelized collection
and manually given the number of partitions:

[php]val rdd1 =
spark.sparkContext.parallelize(Array(“jan”,”feb”,”mar”,”april”,”may”,”j
un”),3)
val result = rdd1.coalesce(2)
result.foreach(println)[/php]
External Datasets (Referencing a dataset)
• In Spark, the distributed dataset can be formed from any data source supported by
Hadoop, including the local file system, HDFS, Cassandra, HBase etc.
• In this, the data is loaded from the external dataset.
• To create text file RDD, we can use SparkContext’s textFile method.
• It takes URL of the file and read it as a collection of line.
• URL can be a local path on the machine or a hdfs://, s3n://, etc.
The point to jot down is that the path of the local file system and worker node
should be the same.
• The file should be present at same destinations both in the local file system and
worker node.
• We can copy the file to the worker nodes or use a network mounted shared file
system.
• DataFrameReader Interface is used to load a Dataset from external storage systems
(e.g. file systems, key-value stores, etc).
• Use SparkSession.read to access an instance of DataFrameReader.
• DataFrameReader supports many file formats-
File Formats
• csv (String path)

• It loads a CSV file and returns the result as a Dataset<Row>.

Example:
[php]import org.apache.spark.sql.SparkSession
def main(args: Array[String]):Unit = {
object DataFormat {
val spark = SparkSession.builder.appName(“AvgAnsTime”).master(“local”).getOrCreate()
val dataRDD = spark.read.csv(“path/of/csv/file”).rdd[/php]

Note – Here .rdd method is used to convert Dataset<Row> to RDD<Row>.


• b. json (String path)
It loads a JSON file (one object per line) and returns the result as a Dataset
<Row>[php]val dataRDD = spark.read.json(“path/of/json/file”).rdd[/php]
• c. textFile (String path)
It loads text files and returns a Dataset of String.
[php]val dataRDD = spark.read.textFile(“path/of/text/file”).rdd[/php]
Creating RDD from existing RDD

• Transformation mutates one RDD into another RDD, thus transformation is


the way to create an RDD from already existing RDD.
• Transformation acts as a function that intakes an RDD and produces one.
• The input RDD does not get changed, because RDDs are immutable in
nature but it produces one or more RDD by applying operations.
• Some of the operations applied on RDD are:
• filter,
• count,
• distinct,
• Map,
• FlatMap
• Example:
[php]val words=spark.sparkContext.parallelize(Seq(“the”, “quick”,
“brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”))
val wordPair = words.map(w => (w.charAt(0), w))
wordPair.foreach(println)[/php]
• In above code RDD “wordPair” is created from existing RDD
“word” using map() transformation which contains word and its
starting character together.
Scala application
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext.
object SimpleApp {
def main(args: Array[String]) {
val logFile = "$YOUR_SPARK_HOME/README.md" // Should be some file on your system
val sc = new SparkContext("local", "Simple App", "YOUR_SPARK_HOME",
List("target/scala-2.10/simple-project_2.10-1.0.jar"))
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
• This program just counts the number of lines containing ‘a’ and the
number of line containing ‘b’ in the Spark README.
• Note that you’ll need to replace $YOUR_SPARK_HOME with the
location where Spark is installed.
• We initialize a SparkContext as part of the program.
• We pass the SparkContext constructor four arguments,
• the type of scheduler we want to use (in this case, a local scheduler),
• a name for the application,
• the directory where Spark is installed, and
• a name for the jar file containing the application’s code.
• The final two arguments are needed in a distributed setting, where
Spark is running across several nodes, so we include them for
completeness.
• Spark will automatically ship the jar files you list to slave nodes.
• This file depends on the Spark API, so we’ll also include an sbt
configuration file, simple.sbt which explains that Spark is a
dependency.
• This file also adds a repository that Spark depends on:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.2"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
A Standalone App in Java

• We’ll create a very simple Spark application, SimpleApp.java:


• /*** SimpleApp.java ***/
• import org.apache.spark.api.java.*;
• import org.apache.spark.api.java.function.Function;
• public class SimpleApp {
public static void main(String[] args) {
String logFile = "$YOUR_SPARK_HOME/README.md";
// Should be some file on your system
JavaSparkContext sc = new JavaSparkContext("local", "Simple App",
"$YOUR_SPARK_HOME", new String[]{"target/simple-project-1.0.jar"});
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>()
{
public Boolean call(String s)
{ return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();

System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);


}
}
This program just counts the number of lines containing ‘a’ and the
number containing ‘b’ in a text file. Note that you’ll need to replace
$YOUR_SPARK_HOME with the location where Spark is installed.
As with the Scala example, we initialize a SparkContext, though we use
the special JavaSparkContext class to get a Java-friendly one.
We also create RDDs (represented by JavaRDD) and run
transformations on them.
Finally, we pass functions to Spark by creating classes that extend
spark.api.java.function.Function.
A Standalone App in Python
As an example, we’ll create a simple Spark application, SimpleApp.py:

"""SimpleApp.py"""
from pyspark import SparkContext
logFile = "$YOUR_SPARK_HOME/README.md" # Should be some file on
your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print "Lines with a: %i, lines with b: %i" % (numAs, numBs)


This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file.
Note that you’ll need to replace $YOUR_SPARK_HOME with the location where Spark is installed.
As with the Scala and Java examples, we use a SparkContext to create RDDs.
We can pass Python functions to Spark, which are automatically serialized along with any variables that
they reference.
For applications that use custom classes or third-party libraries, we can add those code dependencies to
SparkContext to ensure that they will be available on remote machines; this is described in more detail in
the Python programming guide.
SimpleApp is simple enough that we do not need to specify any code dependencies.

We can run this application using the bin/pyspark script:

$ cd $SPARK_HOME
$ ./bin/pyspark SimpleApp.py
...
Lines with a: 46, Lines with b: 23

You might also like