Unit 4 Spark Updated
Unit 4 Spark Updated
$java -version
If Java is already, installed on your system, you get to see the following response −
>>>
• To run the classic Hadoop word count application, copy an input file
to HDFS:
• $ hdfs dfs -put input
• Within a shell, run the word count application using the following
code examples, substituting for namenode_host, path/to/input, and
path/to/output:
Scala
scala> val myfile = sc.textFile("hdfs://namenode_host:8020/path/to/input")
scala> val counts = myfile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
scala> counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")
• Python
>>> myfile = sc.textFile("hdfs://namenode_host:8020/path/to/input")
>>> counts = myfile.flatMap(lambda line: line.split(" ")).map(lambda word:
(word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
>>> counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")
Actions &Transformation
Apache Spark RDD supports two types of Operations-
• Transformations
• Actions
Actions, which return values,
Transformations, which return pointers to new RDDs.
• RDD Transformation
• Spark Transformation is a function that produces new RDD from the existing RDDs.
• It takes RDD as input and produces one or more RDD as output.
• Each time it creates new RDD when we apply any transformation.
• Thus, the so input RDDs, cannot be changed since RDD are immutable in nature.
• Applying transformation built an RDD lineage, with the entire parent RDDs of the final RDD(s).
• RDD lineage, also known as RDD operator graph or RDD dependency graph.
• It is a logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent
RDDs of RDD.
• Transformations are lazy in nature i.e., they get execute when we call an action.
• They are not executed immediately.
• Two most basic type of transformations is a map(), filter().
After the transformation, the resultant RDD is always different from its parent RDD.
• It can be smaller (e.g. filter, count, distinct, sample), bigger
(e.g. flatMap(), union(), Cartesian()) or the same size (e.g. map).
We will use the filter transformation to return a new RDD with a subset of the items in
the file.
For example, the elements of RDD1 are (Spark, Spark, Hadoop, Flink) and
that of RDD2 are (Big data, Spark, Flink)
so the resultant rdd1.union(rdd2) will have elements (Spark, Spark, Spark, Hadoop, Flink,
Flink, Big data).
• Union() example:
[php]val rdd1 = spark.sparkContext.parallelize(Seq((1,”jan”,2016),(3,”nov”,2014),(16,”feb”,2014)))
val rdd2 = spark.sparkContext.parallelize(Seq((5,”dec”,2014),(17,”sep”,2015)))
val rdd3 = spark.sparkContext.parallelize(Seq((6,”dec”,2011),(16,”may”,2015)))
val rddUnion = rdd1.union(rdd2).union(rdd3)
rddUnion.foreach(Println)[/php]
In above code union() operation will return a new dataset that contains the union of the
elements in the source dataset (rdd1) and the argument (rdd2 & rdd3).
Intersection
• With the intersection() function, we get only the common element of both the
RDD in new RDD.
• The key rule of this function is that the two RDDs should be of the same type.
• Consider an example, the elements of RDD1 are (Spark, Spark, Hadoop, Flink)
and that of RDD2 are (Big data, Spark, Flink) So the
resultant rdd1.intersection(rdd2) will have elements (spark).
• Intersection() example:
[php]val rdd1 = spark.sparkContext.parallelize(Seq((1,”jan”,2016),(3,”nov”,2014, (16,”feb”,2014)))
val rdd2 = spark.sparkContext.parallelize(Seq((5,”dec”,2014),(1,”jan”,2016)))
val comman = rdd1.intersection(rdd2)
comman.foreach(Println)[/php].
• The intersection() operation return a new RDD.
• It contains the intersection of elements in the rdd1 & rdd2.
distinct()
• It returns a new dataset that contains the distinct elements of the source
dataset.
• It is helpful to remove duplicate data.
For example, if RDD has elements (Spark, Spark, Hadoop,
Flink), then rdd.distinct() will give elements (Spark, Hadoop, Flink).
• [php]val rdd1 =
park.sparkContext.parallelize(Seq((1,”jan”,2016),(3,”nov”,2014),(16,”feb”,2014),(3,
”nov”,2014)))
val result = rdd1.distinct()
println(result.collect().mkString(“, “))[/php]
• In the above example, the distinct function will remove the duplicate record
i.e. (3,'”nov”,2014).
• groupByKey()
• When we use groupByKey() on a dataset of (K, V) pairs, the data is shuffled
according to the key value K in another RDD.
• In this transformation, lots of unnecessary data get to transfer over the
network.
• reduceByKey(func, [numTasks])
• When we use reduceByKey on a dataset (K, V), the pairs on the same
machine with the same key are combined, before the data is shuffled.
• sortByKey()
• When we apply the sortByKey() function on a dataset of (K, V) pairs,
the data is sorted according to the key K in another RDD.
• join()
• The Join is database term. It combines the fields from two table using
common values. join() operation in Spark is defined on pair-wise RDD.
• Pair-wise RDDs are RDD in which each element is in the form of tuples.
Where the first element is key and the second element is the value.
• The join() operation combines two data sets on the basis of the key.
coalesce()
• Action Collect() had a constraint that all the data should fit in the machine, and
copies to the driver.
• Take(n)
The action take(n) returns n number of elements from RDD.
It tries to cut the number of partition it accesses, so it represents a biased
collection.
We cannot presume the order of the elements.
For example, consider RDD {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “take (4)”
will give result { 2, 2, 3, 4}
• top()
If ordering is present in our RDD, then we can extract top elements from
our RDD using top(). Action top() use default ordering of data.
countByValue()
• The countByValue() returns, many times each element occur in RDD.
• For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD
“rdd.countByValue()” will give the result {(1,1), (2,2), (3,1), (4,1), (5,2),
(6,1)}
reduce()
• The reduce() function takes the two elements as input from the RDD
and then produces the output of the same type as that of the input
elements. The simple forms of such function are an addition. We can
add the elements of RDD, count the number of words. It accepts
commutative and associative operations as an argument.
rdd = sc.parallelize([1, 2, 3, 4, 5])
sum_result = rdd.reduce(lambda x, y: x + y)
print(sum_result) # Output: 15 (1 + 2 + 3 + 4 + 5)
[php]val data=spark.sparkContext.parallelize(Seq((“maths”,52),(“english”,75),(“science”,82),
(“computer”,65),(“maths”,85)))
val sorted = data.sortByKey()
sorted.foreach(println)[/php]
• The key point to note in parallelized collection is the number of partition the dataset is cut into.
• Spark will run one task for each partition of cluster.
• We require two to four partitions for each CPU in cluster.
• Spark sets number of partition based on our cluster.
• But we can also manually set the number of partitions.
• This is achieved by passing number of partition as second parameter to parallelize .
e.g. sc.parallelize(data, 10), here we have manually given number of partition as 10.
• Consider one more example, here we have used parallelized collection
and manually given the number of partitions:
[php]val rdd1 =
spark.sparkContext.parallelize(Array(“jan”,”feb”,”mar”,”april”,”may”,”j
un”),3)
val result = rdd1.coalesce(2)
result.foreach(println)[/php]
External Datasets (Referencing a dataset)
• In Spark, the distributed dataset can be formed from any data source supported by
Hadoop, including the local file system, HDFS, Cassandra, HBase etc.
• In this, the data is loaded from the external dataset.
• To create text file RDD, we can use SparkContext’s textFile method.
• It takes URL of the file and read it as a collection of line.
• URL can be a local path on the machine or a hdfs://, s3n://, etc.
The point to jot down is that the path of the local file system and worker node
should be the same.
• The file should be present at same destinations both in the local file system and
worker node.
• We can copy the file to the worker nodes or use a network mounted shared file
system.
• DataFrameReader Interface is used to load a Dataset from external storage systems
(e.g. file systems, key-value stores, etc).
• Use SparkSession.read to access an instance of DataFrameReader.
• DataFrameReader supports many file formats-
File Formats
• csv (String path)
Example:
[php]import org.apache.spark.sql.SparkSession
def main(args: Array[String]):Unit = {
object DataFormat {
val spark = SparkSession.builder.appName(“AvgAnsTime”).master(“local”).getOrCreate()
val dataRDD = spark.read.csv(“path/of/csv/file”).rdd[/php]
"""SimpleApp.py"""
from pyspark import SparkContext
logFile = "$YOUR_SPARK_HOME/README.md" # Should be some file on
your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
$ cd $SPARK_HOME
$ ./bin/pyspark SimpleApp.py
...
Lines with a: 46, Lines with b: 23