0% found this document useful (0 votes)

72 views42 pages

HDP Developer Apache Pig and Hive

Uploaded by

RICARDO FERNANDES

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views42 pages

HDP Developer Apache Pig and Hive

Uploaded by

RICARDO FERNANDES

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

HDP Developer: Apache Pig and Hive

Hortonworks. We do Hadoop.

Revision 4
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Introducing Apache Spark

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Topics Covered
• The origin of Apache Spark
• Rapid rate of growth of the Spark ecosystem
• Spark use cases
• Major differences between Spark and MapReduce

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

What is Apache Spark?
• Apache open source project, originally developed at
AmpLab at UC-Berkeley
– 2009: Research project; BDAS (Berkley Data Analysis Stack)
– Jun 2013: Accepted into Apache Incubator
– Feb 2014: Became a top-level Apache project
– Dec 2014: Included in HDP 2.2
• A general data processing engine, focused on in-memory
distributed computing use-cases
• APIs in Scala, Python and Java
– Recently API for R was introduced
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
The Spark ecosystem

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Why Spark?
• Elegant Developer APIs: Data Frames/SQL, Machine
Learning, Graph algorithms and streaming
– Scala, Python, Java and R
– Single environment for importing, transforming, and exporting
data
• In-memory computation model
– Effective for iterative computations
• High level API
– Allows users to focus on the business logic and not internals

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Why Spark cont.
• Supports wide variety of workloads
– Mllib for Data Scientists
– Spark SQL for Data Analysts
– Spark Streaming for micro batch use cases
– Spark Core, SQL, Streaming, Mllib, and GraphX for Data
Processing Applications
• Integrated fully with Hadoop and an open source tool
• Faster than MapReduce

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Who uses Spark!?
• NASA JPL
– Deep Space Network
• eBay
– Analysts are clustering sellers together
• Conviva
– Video stream health statistics
• Yahoo
– News story personalization

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark vs MapReduce pyspark

• Higher level API

• In-memory data storage Java MapReduce
– Up to 100x performance
improvement

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark vs MapReduce Cont
• Why is Spark faster?
– Caching data to memory can avoid extra reads from disk
– Scheduling of tasks from 15-20s to 15-20ms
– Resources are dedicated the entire life of the application
– Can link multiple maps and reduces together without having to
write intermediate data to HDFS
– Every reduce doesn’t require a map

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark Growth is Massive
• One of the largest open source projects
– Last release had over 1000 commits and 230 developers
contributing
• On average release a .X version every 3 months
• Currently at spark 1.5.2 (Nov 2015)
– Mar 2015 – Spark SQL Dataframes Release (v1.3)
– Dec 2014 – Spark Streaming on Python Released (v1.2)

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark and HDP
• HDP 2.3.2 – Spark 1.4.1
• HDP 2.2.8 – Spark 1.3.1
• HDP 2.2.4 – Spark 1.2.1

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Lesson Review
1. What are some of the reasons Spark is faster than MR?
2. What distribution of HDP has Spark 1.4.1?
3. What are the four libraries that build on Spark Core?
4. Name another benefit to using Spark vs MR.

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Programming with Apache Spark

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Topics Covered
• Starting the spark shell
• Understanding what an RDD is
• Loading data from the HDFS and perform a word count
• The differences between Transformation and Action
• Lazy Evaluation
• Lab: Getting Started with Apache Spark

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

How to start using Apache Spark?
• The Spark Shell provides an interactive way to learn
Spark, explore data, and debug applications
• Available for python and scala
– pyspark
– spark-shell
• REPL

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

The SparkContext
• Main entry point for Spark applications
• All Spark applications require one
• The SparkContext has a few responsibilities
– Represent the connection to a Cluster
– Used to create RDDs, accumulator and broadcast variables on
the cluster
• The REPLs automatically create one for you
– In Spark 1.3 and on, the shell creates a SQL context too

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Working with the Spark Context
Attributes:
• sc.appName: Spark application name
• sc.master: Spark Master (local, yarn-client, etc)
• sc.version: Version of Spark being used
Functions:
• sc.parallelize(): create an RDD from local data
• sc.textFile(): create RDD from a text file in HDFS
• sc.stop(): stop the spark context

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

The Resilient Distributed Dataset
• An Immutable collection of objects (or records) that can
be operated on in parallel
– Resilient: can be recreated from parent RDDs - An RDD keeps
its lineage information
– Distributed: partitions of data are distributed across nodes in
the cluster
– Dataset: a set of data that can be accessed
– Each RDD is composed of 1 or more partitions - The user can
control the number of partitions - More partitions => more
parallelism
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Create an RDD
• Load data from a file (HDFS, S3, Local, etc)
– From a single file
rdd1 = sc.textFile(“file:/path/to/file.txt”)
rdd2 = sc.textFile(“hdfs://namenode:8020/mydata/data.txt”)
– Also accepts a comma separated list of files, or a wildcard list
of files
rdd3 = sc.textFile(“mydata/*.txt”)
rdd4 = sc.textFile(“data1.txt,data2.txt”)

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Create an RDD
• With parallelize() function in driver – useful for learning
Spark, distributing local collections of data

rdd5 = sc.parallelize([1, 2, 3, 4, 5])

rdd6 = sc.parallelize([“cat”, “dog”, “mouse”])

mydata = (“lets try this”)

rdd7 = sc.parallelize([mydata])

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Working with RDDs and Lazy Evaluation
• RDDs have two types of operations

– Transformations: the RDD is transformed into a new RDD

– Actions: an action is performed on the RDD and a result is

returned to the driver, or data is saved somewhere

• Transformations are lazy: they do not compute until an

action is performed
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What does “Lazy Execution” mean?
file = sc.textFile("hdfs://some-text-file”)
counts = file.flatMap(lambda line: line.split(" ")) \ DAG of transformations
.map(lambda word: (word, 1)) \ is built by Spark on
.reduceByKey(lambda a, b: a + b) driver side

counts.saveAsTextFile("hdfs://wordcount-out”)
Action triggers
execution of
whole DAG

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark Uses Functional Programming
• Program built on Functions instead of Objects
• Mutation is forbidden – all variables are final
• Functional purity – if you pass A into a function, you're
always getting back B
• Functions have input and output only – no state or side
effects
• Passing functions as input to other functions
• Anonymous Functions – undefined functions passed inline

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Actions – count()
• The count() action returns the number of elements in the
RDD

data = [5, 12, -4 , 7, 20]

rdd = sc.parallelize(data)
rdd.count()

The output is: 5

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Actions – reduce()
• The reduce() action has a lot of use cases in Spark
– Aggregating elements of an RDD using a defined function
– That function must be commutative and associative
• a+b = b+a and a+(b+c)=(a+b)+c

Dataset:[5, 12, -4 , 7, 20]

rdd.reduce(lambda a, b : a+b)
40

rdd.reduce(lambda a, b: a if (a>b) else b)

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Other Useful Spark Actions
• first(): return the first element in the RDD
• take(n): return the first n elements of the RDD
• collect(): return all the elements in the RDD to the driver
– Make sure you only call this on small datasets or risk crashing
your driver!
• saveAsTextFile(path): write the RDD to a file

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark Actions: Examples
Dataset:[5, 12, -4 , 7, 20]

rdd.first(): 5

rdd.take(3): [5, 12, -4]

rdd.saveAsTextFile(“myfile”)

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark Transformations
• Spark Transformations create new RDD’s from existing
ones
• The transformation is lazy, and processing doesn’t occur
until an action is called on the RDD, or subsequent RDD
– Transformation create a recipe, or lineage, for processing
– The actions trigger data to flow through the transformation and
create the result

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Transformations: map()
• Map applies a function to each element of the RDD
(provides a one input to one output)

rdd=sc.parallelize([1, 2, 3, 4, 5])

rdd.map(lambda x: x*2+1).collect()

[3, 5, 7, 9, 11]

Transformations: flatMap()
• Map applies a function to each element of the RDD and
returns a collection (provides a one input to many output)

rdd=sc.parallelize([1, 2, 3, 4, 5])

rdd.map(lambda x: [x, x*2]).collect()

[(1,2), (2, 4), (3,6), (4,8), (5,10)]

rdd.flatMap(lambda x: [x, x*2]).collect()

[1, 2, 2, 4, 3, 6, 4, 8, 5, 10]

Transformation: filter()
• Keep some elements based on a predicate

rdd=sc.parallelize([1, 2, 3, 4, 5])

rdd.filter( lambda x: x%2 == 0).collect()

[2, 4]

rdd.filter( lambda x: x<3).collect()

[1, 2]

Key Value Pair Intro (Pair RDDs)
• A Key/Value RDD is an RDD whose elements comprise a
pair of values – key and value

• Pair-RDDs are very useful for many applications

– Allow to group operations by key
– Examples
• join()
• groupByKey()
• reduceByKey()

Creating Pair RDDs
• Pair RDDs are often created from regular RDDs by using
the map() or flatMap() transformation:

wordlist = ‘this is my list and it is a nice list’

rdd1 = sc.parallelize([wordlist])
kv_rdd = rdd1.flatMap(lambda x: x.split(‘ ‘)). \
.map(lambda x: (x,1))
kv_rdd.collect()
[(this, 1), (is, 1), (my, 1), (list, 1), (and, 1), … (list,1)]

Pair RDD Action: reduceByKey()
• reduceByKey() performs a reduce function on all
elements of a key/value pair RDD that share a key
– The function still must be commutative and associative
• a+b = b+a and a+(b+c)=(a+b)+c

kv_rdd.reduceByKey(lambda a,b: a+b).collect()

[('this', 1), ('my', 1), ('and', 1), ('list', 2), ('a', 1), ('it', 1),
('is', 2), ('nice', 1)]

Keys & Values Can Contain Rich Tuples
>>> notSimplePair = sc.parallelize(['I do not like green eggs and ham I do
not like them Sam I am']).flatMap(lambda sent: sent.split(' ')).map(lambda
word: ((word,'bogus'),('notCount',1)))
>>> notSimplePair.sortByKey(ascending=False).take(3)
[(('them', 'bogus'), ('notCount', 1)), (('not', 'bogus'), ('notCount', 1)),
(('not', 'bogus'), ('notCount', 1))]
>>>
>>> notSimplePair.reduceByKey(lambda oneVal,anotherVal:
('noise',oneVal[1]+anotherVal[1])).sortByKey(ascending=False).collect()
[(('them', 'bogus'), ('notCount', 1)), (('not', 'bogus'), ('noise', 2)),
(('like', 'bogus'), ('noise', 2)), (('ham', 'bogus'), ('notCount', 1)),
(('green', 'bogus'), ('notCount', 1)), (('eggs', 'bogus'), ('notCount', 1)),
(('do', 'bogus'), ('noise', 2)), (('and', 'bogus'), ('notCount', 1)), (('am',
'bogus'), ('notCount', 1)), (('Sam', 'bogus'), ('notCount', 1)), (('I',
'bogus'), ('noise', 3))]
Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tips for Navigating Within pyspark
• Take advantage of command history with “up arrow” key &
add operations one at a time leveraging take()

• Use dir() to get a list of current variables

– Like with Pig’s aliases command, there will be additional
system-oriented variable names present

• Use sc.setLogLevel(‘WARN’)to limit extra “noise”

– Looses some visibility to helpful INFO messages at time
Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Lesson Review
1. What are the three ways we can create an RDD?
2. What are the two types of operations we can perform on an RDD?
1. Give an example of each
3. What is functional programming?
4. What is Lazy Execution?
5. What does the R stand for in RDD? What does that mean?

Conclusion and Key Points
• There are two* types of operations
– Transformation which returns a new RDD
– Action which returns a result
• Spark uses functional programming to process data
• Spark is lazy, it only does work when it has too
• RDD’s are in your mind
– They’re just a set of directions to transform data, the data is
never stored in the RDD

Lab: Getting Started with Apache Spark

Number Sequence Customization in Dynamics 365
No ratings yet
Number Sequence Customization in Dynamics 365
2 pages
mb10132 Blockcoding Instruction Manual
No ratings yet
mb10132 Blockcoding Instruction Manual
22 pages
Etas Xetk s30.0c v31 Release Notes en 20240926
No ratings yet
Etas Xetk s30.0c v31 Release Notes en 20240926
17 pages
Pronest 8 Manual
No ratings yet
Pronest 8 Manual
275 pages
Project Phase 1
No ratings yet
Project Phase 1
29 pages
10 - 19UCSPC402 - A - 5 - 22unit 3 OS Application Solved
No ratings yet
10 - 19UCSPC402 - A - 5 - 22unit 3 OS Application Solved
10 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
RomCmdOutput TimeOut
No ratings yet
RomCmdOutput TimeOut
25 pages
LIFECO Product Digital Catalogue
No ratings yet
LIFECO Product Digital Catalogue
48 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Pyspark
No ratings yet
Pyspark
31 pages
S 4hana Sizing 1725532807
No ratings yet
S 4hana Sizing 1725532807
3 pages
KofaxTotalAgilityBestPracticesGuide EN
No ratings yet
KofaxTotalAgilityBestPracticesGuide EN
79 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Sysdrill 10 Getting Started Guide-En
100% (3)
Sysdrill 10 Getting Started Guide-En
97 pages
SPARK
No ratings yet
SPARK
35 pages
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
Ghost Imputation Project
No ratings yet
Ghost Imputation Project
45 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
II Sem
No ratings yet
II Sem
11 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
130790HIST
No ratings yet
130790HIST
3 pages
Configure and Manage HP Virtual Machines
No ratings yet
Configure and Manage HP Virtual Machines
82 pages
SPARK
No ratings yet
SPARK
66 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Spark
No ratings yet
Spark
160 pages
Programs C
No ratings yet
Programs C
139 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Modicon TM3 I - O Expansion Modules For Modicon Controllers 2019
No ratings yet
Modicon TM3 I - O Expansion Modules For Modicon Controllers 2019
29 pages
Spark
No ratings yet
Spark
96 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Nuevos Acoples Permanentes de 2 Inch XT5ES & XT6ES Perms PDF
100% (1)
Nuevos Acoples Permanentes de 2 Inch XT5ES & XT6ES Perms PDF
3 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
J2534 Diagnostic Tool Software:: Vehicle Communication Interfaces (VCI)
No ratings yet
J2534 Diagnostic Tool Software:: Vehicle Communication Interfaces (VCI)
1 page
Splunk 7 Essentials 3rd
0% (1)
Splunk 7 Essentials 3rd
2 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Edwards Signaling 116DEGEXA-FJ Data Sheet
No ratings yet
Edwards Signaling 116DEGEXA-FJ Data Sheet
4 pages
Technical Development Unit Test Plan
No ratings yet
Technical Development Unit Test Plan
11 pages
Unix For Beginners - SL
No ratings yet
Unix For Beginners - SL
220 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Overlap Add Save
No ratings yet
Overlap Add Save
8 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
LIC Project Synopsis
0% (1)
LIC Project Synopsis
5 pages
Email and Internet Policy
No ratings yet
Email and Internet Policy
7 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Civil Site Design V 1700
No ratings yet
Civil Site Design V 1700
6 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
2015 Summer Model Answer Paper PDF
No ratings yet
2015 Summer Model Answer Paper PDF
49 pages
Module 3
No ratings yet
Module 3
51 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
017 - The Design of Portable Logic Controller (PLC) Training System For Use Outside of Automation Laboratory
No ratings yet
017 - The Design of Portable Logic Controller (PLC) Training System For Use Outside of Automation Laboratory
5 pages
Penetration Testing Framework 0.59 PDF
No ratings yet
Penetration Testing Framework 0.59 PDF
40 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
No ratings yet
Public - Crash Course - Apache Spark - Berlin - 2018 PDF
76 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Fast Data Processing With Spark - Second Edition - Sample Chapter
No ratings yet
Fast Data Processing With Spark - Second Edition - Sample Chapter
18 pages
Apache Spark Component Guide
No ratings yet
Apache Spark Component Guide
84 pages

HDP Developer Apache Pig and Hive

Uploaded by

HDP Developer Apache Pig and Hive

Uploaded by

HDP Developer: Apache Pig and Hive

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

• Higher level API

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

rdd5 = sc.parallelize([1, 2, 3, 4, 5])

mydata = (“lets try this”)

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

– Transformations: the RDD is transformed into a new RDD

– Actions: an action is performed on the RDD and a result is

• Transformations are lazy: they do not compute until an

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

data = [5, 12, -4 , 7, 20]

The output is: 5

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Dataset:[5, 12, -4 , 7, 20]

rdd.reduce(lambda a, b: a if (a>b) else b)

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

rdd.take(3): [5, 12, -4]

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

rdd.map(lambda x: [x, x*2]).collect()

rdd.flatMap(lambda x: [x, x*2]).collect()

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

rdd.filter( lambda x: x%2 == 0).collect()

rdd.filter( lambda x: x<3).collect()

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

• Pair-RDDs are very useful for many applications

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

wordlist = ‘this is my list and it is a nice list’

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

kv_rdd.reduceByKey(lambda a,b: a+b).collect()

Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

• Use dir() to get a list of current variables

• Use sc.setLogLevel(‘WARN’)to limit extra “noise”

Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

You might also like