0% found this document useful (0 votes)
72 views42 pages

HDP Developer Apache Pig and Hive

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views42 pages

HDP Developer Apache Pig and Hive

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

HDP Developer: Apache Pig and Hive

Hortonworks. We do Hadoop.

Revision 4
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Introducing Apache Spark

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Topics Covered
• The origin of Apache Spark
• Rapid rate of growth of the Spark ecosystem
• Spark use cases
• Major differences between Spark and MapReduce

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


What is Apache Spark?
• Apache open source project, originally developed at
AmpLab at UC-Berkeley
– 2009: Research project; BDAS (Berkley Data Analysis Stack)
– Jun 2013: Accepted into Apache Incubator
– Feb 2014: Became a top-level Apache project
– Dec 2014: Included in HDP 2.2
• A general data processing engine, focused on in-memory
distributed computing use-cases
• APIs in Scala, Python and Java
– Recently API for R was introduced
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
The Spark ecosystem

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Why Spark?
• Elegant Developer APIs: Data Frames/SQL, Machine
Learning, Graph algorithms and streaming
– Scala, Python, Java and R
– Single environment for importing, transforming, and exporting
data
• In-memory computation model
– Effective for iterative computations
• High level API
– Allows users to focus on the business logic and not internals

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Why Spark cont.
• Supports wide variety of workloads
– Mllib for Data Scientists
– Spark SQL for Data Analysts
– Spark Streaming for micro batch use cases
– Spark Core, SQL, Streaming, Mllib, and GraphX for Data
Processing Applications
• Integrated fully with Hadoop and an open source tool
• Faster than MapReduce

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Who uses Spark!?
• NASA JPL
– Deep Space Network
• eBay
– Analysts are clustering sellers together
• Conviva
– Video stream health statistics
• Yahoo
– News story personalization

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Spark vs MapReduce pyspark

• Higher level API


• In-memory data storage Java MapReduce
– Up to 100x performance
improvement

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Spark vs MapReduce Cont
• Why is Spark faster?
– Caching data to memory can avoid extra reads from disk
– Scheduling of tasks from 15-20s to 15-20ms
– Resources are dedicated the entire life of the application
– Can link multiple maps and reduces together without having to
write intermediate data to HDFS
– Every reduce doesn’t require a map

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Spark Growth is Massive
• One of the largest open source projects
– Last release had over 1000 commits and 230 developers
contributing
• On average release a .X version every 3 months
• Currently at spark 1.5.2 (Nov 2015)
– Mar 2015 – Spark SQL Dataframes Release (v1.3)
– Dec 2014 – Spark Streaming on Python Released (v1.2)

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Spark and HDP
• HDP 2.3.2 – Spark 1.4.1
• HDP 2.2.8 – Spark 1.3.1
• HDP 2.2.4 – Spark 1.2.1

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Lesson Review
1. What are some of the reasons Spark is faster than MR?
2. What distribution of HDP has Spark 1.4.1?
3. What are the four libraries that build on Spark Core?
4. Name another benefit to using Spark vs MR.

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Programming with Apache Spark

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Topics Covered
• Starting the spark shell
• Understanding what an RDD is
• Loading data from the HDFS and perform a word count
• The differences between Transformation and Action
• Lazy Evaluation
• Lab: Getting Started with Apache Spark

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


How to start using Apache Spark?
• The Spark Shell provides an interactive way to learn
Spark, explore data, and debug applications
• Available for python and scala
– pyspark
– spark-shell
• REPL

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


The SparkContext
• Main entry point for Spark applications
• All Spark applications require one
• The SparkContext has a few responsibilities
– Represent the connection to a Cluster
– Used to create RDDs, accumulator and broadcast variables on
the cluster
• The REPLs automatically create one for you
– In Spark 1.3 and on, the shell creates a SQL context too

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Working with the Spark Context
Attributes:
• sc.appName: Spark application name
• sc.master: Spark Master (local, yarn-client, etc)
• sc.version: Version of Spark being used
Functions:
• sc.parallelize(): create an RDD from local data
• sc.textFile(): create RDD from a text file in HDFS
• sc.stop(): stop the spark context

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


The Resilient Distributed Dataset
• An Immutable collection of objects (or records) that can
be operated on in parallel
– Resilient: can be recreated from parent RDDs - An RDD keeps
its lineage information
– Distributed: partitions of data are distributed across nodes in
the cluster
– Dataset: a set of data that can be accessed
– Each RDD is composed of 1 or more partitions - The user can
control the number of partitions - More partitions => more
parallelism
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Create an RDD
• Load data from a file (HDFS, S3, Local, etc)
– From a single file
rdd1 = sc.textFile(“file:/path/to/file.txt”)
rdd2 = sc.textFile(“hdfs://namenode:8020/mydata/data.txt”)
– Also accepts a comma separated list of files, or a wildcard list
of files
rdd3 = sc.textFile(“mydata/*.txt”)
rdd4 = sc.textFile(“data1.txt,data2.txt”)

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Create an RDD
• With parallelize() function in driver – useful for learning
Spark, distributing local collections of data

rdd5 = sc.parallelize([1, 2, 3, 4, 5])


rdd6 = sc.parallelize([“cat”, “dog”, “mouse”])

mydata = (“lets try this”)


rdd7 = sc.parallelize([mydata])

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Working with RDDs and Lazy Evaluation
• RDDs have two types of operations

– Transformations: the RDD is transformed into a new RDD

– Actions: an action is performed on the RDD and a result is


returned to the driver, or data is saved somewhere

• Transformations are lazy: they do not compute until an


action is performed
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What does “Lazy Execution” mean?
file = sc.textFile("hdfs://some-text-file”)
counts = file.flatMap(lambda line: line.split(" ")) \ DAG of transformations
.map(lambda word: (word, 1)) \ is built by Spark on
.reduceByKey(lambda a, b: a + b) driver side

counts.saveAsTextFile("hdfs://wordcount-out”)
Action triggers
execution of
whole DAG

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Spark Uses Functional Programming
• Program built on Functions instead of Objects
• Mutation is forbidden – all variables are final
• Functional purity – if you pass A into a function, you're
always getting back B
• Functions have input and output only – no state or side
effects
• Passing functions as input to other functions
• Anonymous Functions – undefined functions passed inline

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Actions – count()
• The count() action returns the number of elements in the
RDD

data = [5, 12, -4 , 7, 20]


rdd = sc.parallelize(data)
rdd.count()

The output is: 5

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Actions – reduce()
• The reduce() action has a lot of use cases in Spark
– Aggregating elements of an RDD using a defined function
– That function must be commutative and associative
• a+b = b+a and a+(b+c)=(a+b)+c

Dataset:[5, 12, -4 , 7, 20]


rdd.reduce(lambda a, b : a+b)
40

rdd.reduce(lambda a, b: a if (a>b) else b)


20

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Other Useful Spark Actions
• first(): return the first element in the RDD
• take(n): return the first n elements of the RDD
• collect(): return all the elements in the RDD to the driver
– Make sure you only call this on small datasets or risk crashing
your driver!
• saveAsTextFile(path): write the RDD to a file

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Spark Actions: Examples
Dataset:[5, 12, -4 , 7, 20]

rdd.first(): 5

rdd.take(3): [5, 12, -4]

rdd.saveAsTextFile(“myfile”)

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Spark Transformations
• Spark Transformations create new RDD’s from existing
ones
• The transformation is lazy, and processing doesn’t occur
until an action is called on the RDD, or subsequent RDD
– Transformation create a recipe, or lineage, for processing
– The actions trigger data to flow through the transformation and
create the result

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Transformations: map()
• Map applies a function to each element of the RDD
(provides a one input to one output)

rdd=sc.parallelize([1, 2, 3, 4, 5])

rdd.map(lambda x: x*2+1).collect()

[3, 5, 7, 9, 11]

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Transformations: flatMap()
• Map applies a function to each element of the RDD and
returns a collection (provides a one input to many output)

rdd=sc.parallelize([1, 2, 3, 4, 5])

rdd.map(lambda x: [x, x*2]).collect()


[(1,2), (2, 4), (3,6), (4,8), (5,10)]

rdd.flatMap(lambda x: [x, x*2]).collect()


[1, 2, 2, 4, 3, 6, 4, 8, 5, 10]

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Transformation: filter()
• Keep some elements based on a predicate

rdd=sc.parallelize([1, 2, 3, 4, 5])

rdd.filter( lambda x: x%2 == 0).collect()


[2, 4]

rdd.filter( lambda x: x<3).collect()


[1, 2]

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Key Value Pair Intro (Pair RDDs)
• A Key/Value RDD is an RDD whose elements comprise a
pair of values – key and value

• Pair-RDDs are very useful for many applications


– Allow to group operations by key
– Examples
• join()
• groupByKey()
• reduceByKey()

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Creating Pair RDDs
• Pair RDDs are often created from regular RDDs by using
the map() or flatMap() transformation:

wordlist = ‘this is my list and it is a nice list’


rdd1 = sc.parallelize([wordlist])
kv_rdd = rdd1.flatMap(lambda x: x.split(‘ ‘)). \
.map(lambda x: (x,1))
kv_rdd.collect()
[(this, 1), (is, 1), (my, 1), (list, 1), (and, 1), … (list,1)]

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Pair RDD Action: reduceByKey()
• reduceByKey() performs a reduce function on all
elements of a key/value pair RDD that share a key
– The function still must be commutative and associative
• a+b = b+a and a+(b+c)=(a+b)+c

kv_rdd.reduceByKey(lambda a,b: a+b).collect()


[('this', 1), ('my', 1), ('and', 1), ('list', 2), ('a', 1), ('it', 1),
('is', 2), ('nice', 1)]

Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Keys & Values Can Contain Rich Tuples
>>> notSimplePair = sc.parallelize(['I do not like green eggs and ham I do
not like them Sam I am']).flatMap(lambda sent: sent.split(' ')).map(lambda
word: ((word,'bogus'),('notCount',1)))
>>> notSimplePair.sortByKey(ascending=False).take(3)
[(('them', 'bogus'), ('notCount', 1)), (('not', 'bogus'), ('notCount', 1)),
(('not', 'bogus'), ('notCount', 1))]
>>>
>>> notSimplePair.reduceByKey(lambda oneVal,anotherVal:
('noise',oneVal[1]+anotherVal[1])).sortByKey(ascending=False).collect()
[(('them', 'bogus'), ('notCount', 1)), (('not', 'bogus'), ('noise', 2)),
(('like', 'bogus'), ('noise', 2)), (('ham', 'bogus'), ('notCount', 1)),
(('green', 'bogus'), ('notCount', 1)), (('eggs', 'bogus'), ('notCount', 1)),
(('do', 'bogus'), ('noise', 2)), (('and', 'bogus'), ('notCount', 1)), (('am',
'bogus'), ('notCount', 1)), (('Sam', 'bogus'), ('notCount', 1)), (('I',
'bogus'), ('noise', 3))]
Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tips for Navigating Within pyspark
• Take advantage of command history with “up arrow” key &
add operations one at a time leveraging take()

• Use dir() to get a list of current variables


– Like with Pig’s aliases command, there will be additional
system-oriented variable names present

• Use sc.setLogLevel(‘WARN’)to limit extra “noise”


– Looses some visibility to helpful INFO messages at time
Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Lesson Review
1. What are the three ways we can create an RDD?
2. What are the two types of operations we can perform on an RDD?
1. Give an example of each
3. What is functional programming?
4. What is Lazy Execution?
5. What does the R stand for in RDD? What does that mean?

Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Conclusion and Key Points
• There are two* types of operations
– Transformation which returns a new RDD
– Action which returns a result
• Spark uses functional programming to process data
• Spark is lazy, it only does work when it has too
• RDD’s are in your mind
– They’re just a set of directions to transform data, the data is
never stored in the RDD

Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Lab: Getting Started with Apache Spark

Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved


Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

You might also like