PySpark Meetup Talk
PySpark Meetup Talk
PySpark Meetup Talk
Immutable, partitioned
collections of objects
Transformations
map
filter
groupBy
join
...
Actions
count
collect
save
...
Example: Log Mining
messages.filter(_.contains(“foo”)).count
What is PySpark?
PySpark at a Glance
sc
=
SparkContext(...)
lines
=
sc.textFile(sys.argv[2],
1)
counts
=
lines.flatMap(lambda
x:
x.split('
'))
\
.map(lambda
x:
(x,
1))
\
.reduceByKey(lambda
x,
y:
x
+
y)
PySpark
Java API
Spark
Local
Mesos Standalone YARN
Mode
Process data
in Python
and persist /
transfer it in Java
Re-uses Spark’s scheduling
broadcast
checkpointing
networking
fault-recovery
HDFS access
PySpark has a small codebase:
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
File
blank
comment
code
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
python/pyspark/rdd.py
115
345
302
core/src/main/scala/spark/api/python/PythonRDD.scala
33
45
231
python/pyspark/context.py
32
101
133
python/pyspark/tests.py
26
11
84
python/pyspark/accumulators.py
37
91
70
python/pyspark/serializers.py
21
7
55
python/pyspark/join.py
15
27
50
python/pyspark/worker.py
8
7
44
core/src/main/scala/spark/api/python/PythonPartitioner.scala
5
9
34
pyspark
9
8
27
python/pyspark/java_gateway.py
5
7
26
python/pyspark/files.py
7
14
17
python/pyspark/broadcast.py
8
16
15
python/pyspark/shell.py
4
6
8
python/pyspark/__init__.py
6
14
7
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
SUM:
331
708
1103
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Local Cluster
Python JVM
Data Flow
Spark
Context
Local Cluster
Python JVM
Data Flow
Py4J
Socket Spark
Context
Spark
Context
Local Cluster
Python JVM
Data Flow
Py4J
Socket Spark
Context
Spark
Context
Local
FS
Local Cluster
Python JVM
Data Flow
Py4J
Spark
Socket Spark
Worker
Context
Spark
Context
Spark
Local
Worker
FS
Local Cluster
Python JVM
Data Flow
Pipe
Py4J Python
Spark
Socket Spark Python
Worker
Context Python
Spark
Context
Python
Spark
Local Python
Worker
FS Python
Local Cluster
Python JVM
Data is stored as Pickled
objects in RDD[Array[Byte]]
Storing batches of Python
objects in one Scala object
reduces overhead
When possible, RDD
transformations are pipelined:
lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1))
MappedRDD MappedRDD
func(x) = x.split(‘’) func(x) = (x, 1)
MappedRDD
func(x) = (x.split(‘’), 1)
Python functions and closures
are serialized using PiCloud’s
CloudPickle module
Roadmap
Available in Spark 0.7
Thanks!
Bonus Slides
Pickle is a miniature stack
language
>>>
x
=
["Hello",
"World!"]
>>>
pickletools.dis(cPickle.dumps(x,
2))
0:
\x80
PROTO
2
2:
]
EMPTY_LIST
3:
q
BINPUT
1
5:
(
MARK
6:
U
SHORT_BINSTRING
'Hello'
13:
q
BINPUT
2
15:
U
SHORT_BINSTRING
'World!'
23:
q
BINPUT
3
25:
e
APPENDS
(MARK
at
5)
26:
.
STOP
highest
protocol
among
opcodes
=
2
You can do crazy stuff, like
converting a collection of
pickled objects into a pickled
collection.
https://gist.github.com/JoshRosen/3384191
Bulk depickling can be faster
even if it involves Pickle
opcode manipulation:
10000
integers:
Bulk
depickle
(chunk
size
=
2):
0.266709804535
Bulk
depickle
(chunk
size
=
10):
0.0797798633575
Bulk
depickle
(chunk
size
=
100):
0.0388460159302
Bulk
depickle
(chunk
size
=
1000):
0.0333180427551
Individual
depickle:
0.0540158748627
https://gist.github.com/JoshRosen/3401373