Day1 Main
Day1 Main
Day1 Main
http://databricks.com/
download slides:
training.databricks.com/workshop/itas_workshop.pdf
Introduction
Installation
• Homebrew on MacOSX
• Cygwin on Windows
Step 1: Install Java JDK 6/7 on MacOSX or Windows
oracle.com/technetwork/java/javase/downloads/
jdk7-downloads-1880260.html
create an
val distData = sc.parallelize(data)
gist.github.com/ceteri/
f2c3486062c9610eac1d#file-01-repl-txt
Installation: Optional Downloads: Python
Spark Deconstructed
lecture: 20 min
Spark Deconstructed:
Worker
Worker
Driver
Worker
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("\t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
discussing the other part
// action 2!
messages.filter(_.contains("php")).count()
Spark Deconstructed: Log Mining Example
scala> messages.toDebugString!
res5: String = !
MappedRDD[4] at map at <console>:16 (3 partitions)!
MappedRDD[3] at map at <console>:16 (3 partitions)!
FilteredRDD[2] at filter at <console>:14 (3 partitions)!
MappedRDD[1] at textFile at <console>:12 (3 partitions)!
HadoopRDD[0] at textFile at <console>:12 (3 partitions)
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("\t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
! Worker
// action 2!
Worker
Driver
Worker
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("\t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
! Worker
// action 2!
block 1
discussing the other part
messages.filter(_.contains("php")).count()
Worker
Driver
block 2
Worker
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("\t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
! Worker
// action 2!
block 1
discussing the other part
messages.filter(_.contains("php")).count()
Worker
Driver
block 2
Worker
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("\t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
! Worker read
// action 2! HDFS
block 1
discussing the other part
block
messages.filter(_.contains("php")).count()
Worker read
HDFS
Driver block
block 2
Worker read
HDFS
block 3 block
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("\t")).map(r => r(1))!
messages.cache()!
!
// action 1! cache 1
messages.filter(_.contains("mysql")).count()! process,
! Worker
cache data
// action 2!
block 1
discussing the other part
messages.filter(_.contains("php")).count()
cache 2
process,
cache data
Worker
Driver
block 2
cache 3
process,
Worker cache data
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("\t")).map(r => r(1))!
messages.cache()!
!
// action 1! cache 1
messages.filter(_.contains("mysql")).count()!
! Worker
// action 2!
block 1
discussing the other part
messages.filter(_.contains("php")).count()
cache 2
Worker
Driver
block 2
cache 3
Worker
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
discussing the other part
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("\t")).map(r => r(1))!
messages.cache()!
!
// action 1! cache 1
messages.filter(_.contains("mysql")).count()!
! Worker
// action 2!
block 1
messages.filter(_.contains("php")).count()
cache 2
Worker
Driver
block 2
cache 3
Worker
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
discussing the other part
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("\t")).map(r => r(1))!
messages.cache()!
!
// action 1! cache 1
messages.filter(_.contains(“mysql")).count()! process
! Worker from cache
// action 2!
block 1
messages.filter(_.contains("php")).count()
cache 2
process
from cache
Worker
Driver
block 2
cache 3
process
Worker from cache
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
discussing the other part
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("\t")).map(r => r(1))!
messages.cache()!
!
// action 1! cache 1
messages.filter(_.contains(“mysql")).count()!
! Worker
// action 2!
block 1
messages.filter(_.contains("php")).count()
cache 2
Worker
Driver
block 2
cache 3
Worker
block 3
Spark Deconstructed:
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("\t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Spark Deconstructed:
RDD
// base RDD!
val lines = sc.textFile("hdfs://...")
Spark Deconstructed:
RDD
RDD
RDD
transformations RDD
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("\t")).map(r => r(1))!
messages.cache()
Spark Deconstructed:
RDD
RDD
RDD
transformations RDD action value
// action 1!
messages.filter(_.contains("mysql")).count()
04: Getting Started
lab: 20 min
Simple Spark Apps: WordCount
Definition:
count
count how ofteneach
how often each word appears
wordappears void map (String doc_id, String text):!
in
in aacollection of text
collection of textdocuments
documents
for each word w in segment(text):!
emit(w, "1");!
!
A distributed computing framework that can run
WordCount efficiently in parallel at scale
can likely handle much larger and more interesting
compute problems
Simple Spark Apps: WordCount
Scala:
val f = sc.textFile("README.md")!
val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)!
wc.saveAsTextFile("wc_out")
Python:
from operator import add!
f = sc.textFile("README.md")!
wc = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)!
wc.saveAsTextFile("wc_out")
Simple Spark Apps: WordCount
Scala:
val f = sc.textFile(
val wc
wc.saveAsTextFile(
Checkpoint:
Python: how many “Spark” keywords?
from operator
f = sc
wc = f
wc.saveAsTextFile(
Simple Spark Apps: Code + Data
cached
stage 1 partition
A: B: RDD
E:
map() map()
stage 2
join()
C: D:
scala> reg.join(clk).toDebugString!
res5: String = !
FlatMappedValuesRDD[46] at join at <console>:23 (1 partitions)!
MappedValuesRDD[45] at join at <console>:23 (1 partitions)!
CoGroupedRDD[44] at join at <console>:23 (1 partitions)!
MappedRDD[36] at map at <console>:16 (1 partitions)!
MappedRDD[35] at map at <console>:16 (1 partitions)!
MappedRDD[34] at textFile at <console>:16 (1 partitions)!
HadoopRDD[33] at textFile at <console>:16 (1 partitions)!
MappedRDD[40] at map at <console>:16 (1 partitions)!
MappedRDD[39] at map at <console>:16 (1 partitions)!
MappedRDD[38] at textFile at <console>:16 (1 partitions)!
HadoopRDD[37] at textFile at <console>:16 (1 partitions)
cached
stage 1 partition
A: B: RDD
E:
map() map()
stage 2
join()
C: D:
cached
stage 1 partition
A: B: RDD
E:
map() map()
stage 2
join()
C: D:
Using the
in the Spark directory:
1. create RDDs to filter each line for the
keyword “Spark”
Checkpoint:
2. perform a WordCount on each, i.e., so the
how many “Spark” keywords?
results are (K,V) pairs of (word, count)
3. join the two RDDs
05: Getting Started
A Brief History
lecture: 35 min
A Brief History:
2004 2010
MapReduce paper Spark paper
2006
Hadoop @ Yahoo!
A Brief History: MapReduce
Open Discussion:
Enumerate several changes in data center
technologies since 2002…
A Brief History: MapReduce
pistoncloud.com/2013/04/storage-
and-the-mobility-gap/
meanwhile, spinny
disks haven’t changed
all that much…
storagenewsletter.com/rubriques/hard-
disk-drives/hdd-technology-trends-ibm/
A Brief History: MapReduce
Pregel Giraph
MapReduce
Impala GraphLab
Storm S4
2004 2010
MapReduce paper Spark paper
2006
Hadoop @ Yahoo!
break: 15 min
03: Intro Spark Apps
Spark Essentials
lecture/lab: 45 min
Spark Essentials:
using, respectively:
./bin/spark-shell!
./bin/pyspark!
Scala:
scala> sc!
res: spark.SparkContext = spark.SparkContext@470d1f30
Python:
>>> sc!
<pyspark.context.SparkContext object at 0x7f7570783350>
Spark Essentials: Master
master description
run Spark locally with one worker thread
local
(no parallelism)
run Spark locally with K worker threads
local[K]
(ideally set to # cores)
connect to a Spark standalone cluster;
spark://HOST:PORT
PORT depends on config (7077 by default)
connect to a Mesos cluster;
mesos://HOST:PORT
PORT depends on config (5050 by default)
Spark Essentials: Master
spark.apache.org/docs/latest/cluster-
overview.html
Worker Node
Executor cache
task task
Worker Node
Executor cache
task task
Spark Essentials: Clusters
Executor cache
task task
Worker Node
Executor cache
task task
Spark Essentials: RDD
Scala:
scala> val data = Array(1, 2, 3, 4, 5)!
data: Array[Int] = Array(1, 2, 3, 4, 5)!
!
scala> val distData = sc.parallelize(data)!
distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e
Python:
>>> data = [1, 2, 3, 4, 5]!
>>> data!
[1, 2, 3, 4, 5]!
!
>>> distData = sc.parallelize(data)!
>>> distData!
ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229
Spark Essentials: RDD
RDD
RDD
RDD
transformations RDD action value
Spark Essentials: RDD
Scala:
scala> val distFile = sc.textFile("README.md")!
distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08
Python:
>>> distFile = sc.textFile("README.md")!
14/04/19 23:42:40 INFO storage.MemoryStore: ensureFreeSpace(36827) called
with curMem=0, maxMem=318111744!
14/04/19 23:42:40 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 36.0 KB, free 303.3 MB)!
>>> distFile!
MappedRDD[2] at textFile at NativeMethodAccessorImpl.java:-2
Spark Essentials: Transformations
transformation description
return a new distributed dataset formed by passing
map(func) each element of the source through a function func
transformation description
when called on a dataset of (K, V) pairs, returns a
groupByKey([numTasks]) dataset of (K, Seq[V]) pairs
Scala:
val distFile = sc.textFile("README.md")!
distFile.map(l => l.split(" ")).collect()!
distFile is a collection of lines
distFile.flatMap(l => l.split(" ")).collect()
Python:
distFile = sc.textFile("README.md")!
distFile.map(lambda x: x.split(' ')).collect()!
distFile.flatMap(lambda x: x.split(' ')).collect()
Spark Essentials: Transformations
Scala:
val distFile = sc.textFile("README.md")!
distFile.map(l => l.split(" ")).collect()!
distFile.flatMap(l => l.split(" ")).collect()
closures
Python:
distFile = sc.textFile("README.md")!
distFile.map(lambda x: x.split(' ')).collect()!
distFile.flatMap(lambda x: x.split(' ')).collect()
Spark Essentials: Transformations
Scala:
val distFile = sc.textFile("README.md")!
distFile.map(l => l.split(" ")).collect()!
distFile.flatMap(l => l.split(" ")).collect()
closures
Python:
distFile = sc.textFile("README.md")!
distFile.map(lambda x: x.split(' ')).collect()!
distFile.flatMap(lambda x: x.split(' ')).collect()
RDD
RDD
RDD
transformations RDD action value
Spark Essentials: Transformations
Java 7:
JavaRDD<String> distFile = sc.textFile("README.md");!
!
// Map each line to multiple words!
JavaRDD<String> words = distFile.flatMap(!
new FlatMapFunction<String, String>() {!
public Iterable<String> call(String line) {!
return Arrays.asList(line.split(" "));!
}!
});
Java 8:
JavaRDD<String> distFile = sc.textFile("README.md");!
JavaRDD<String> words =!
distFile.flatMap(line -> Arrays.asList(line.split(" ")));
Spark Essentials: Actions
action description
aggregate the elements of the dataset using a function
func (which takes two arguments and returns one),
reduce(func) and should also be commutative and associative so
that it can be computed correctly in parallel
return all the elements of the dataset as an array at
the driver program – usually useful after a filter or
collect() other operation that returns a sufficiently small subset
of the data
count() return the number of elements in the dataset
return the first element of the dataset – similar to
first() take(1)
return an array with the first n elements of the dataset
take(n) – currently not executed in parallel, instead the driver
program computes all the elements
action description
write the elements of the dataset as a text file (or set
of text files) in a given directory in the local filesystem,
saveAsTextFile(path) HDFS or any other Hadoop-supported file system.
Spark will call toString on each element to convert
it to a line of text in the file
write the elements of the dataset as a Hadoop
SequenceFile in a given path in the local filesystem,
HDFS or any other Hadoop-supported file system.
Only available on RDDs of key-value pairs that either
saveAsSequenceFile(path) implement Hadoop's Writable interface or are
implicitly convertible to Writable (Spark includes
conversions for basic types like Int, Double, String,
etc).
only available on RDDs of type (K, V). Returns a
countByKey() `Map` of (K, Int) pairs with the count of each key
run a function func on each element of the dataset –
usually done for side effects such as updating an
foreach(func) accumulator variable or interacting with external
storage systems
Spark Essentials: Actions
Scala:
val f = sc.textFile("README.md")!
val words = f.flatMap(l => l.split(" ")).map(word => (word, 1))!
words.reduceByKey(_ + _).collect.foreach(println)
Python:
from operator import add!
f = sc.textFile("README.md")!
words = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))!
words.reduceByKey(add).collect()
Spark Essentials: Persistence
transformation description
Store RDD as deserialized Java objects in the JVM.
If the RDD does not fit in memory, some partitions
MEMORY_ONLY will not be cached and will be recomputed on the fly
each time they're needed. This is the default level.
Store RDD as deserialized Java objects in the JVM.
If the RDD does not fit in memory, store the partitions
MEMORY_AND_DISK that don't fit on disk, and read them from there when
they're needed.
Store RDD as serialized Java objects (one byte array
per partition). This is generally more space-efficient
MEMORY_ONLY_SER than deserialized objects, especially when using a fast
serializer, but more CPU-intensive to read.
Similar to MEMORY_ONLY_SER, but spill partitions
MEMORY_AND_DISK_SER that don't fit in memory to disk instead of recomputing
them on the fly each time they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2, Same as the levels above, but replicate each partition
MEMORY_AND_DISK_2, etc on two cluster nodes.
Spark Essentials: Persistence
Scala:
val f = sc.textFile("README.md")!
val w = f.flatMap(l => l.split(" ")).map(word => (word, 1)).cache()!
w.reduceByKey(_ + _).collect.foreach(println)
Python:
from operator import add!
f = sc.textFile("README.md")!
w = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).cache()!
w.reduceByKey(add).collect()
Spark Essentials: Broadcast Variables
Scala:
val broadcastVar = sc.broadcast(Array(1, 2, 3))!
broadcastVar.value
Python:
broadcastVar = sc.broadcast(list(range(1, 4)))!
broadcastVar.value
Spark Essentials: Accumulators
Scala:
val accum = sc.accumulator(0)!
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)!
!
accum.value
Python:
accum = sc.accumulator(0)!
rdd = sc.parallelize([1, 2, 3, 4])!
def f(x):!
global accum!
accum += x!
!
rdd.foreach(f)!
!
accum.value
Spark Essentials: Accumulators
Scala:
val accum = sc.accumulator(0)!
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)!
!
accum.value
driver-side
Python:
accum = sc.accumulator(0)!
rdd = sc.parallelize([1, 2, 3, 4])!
def f(x):!
global accum!
accum += x!
!
rdd.foreach(f)!
!
accum.value
Spark Essentials: (K,V) pairs
Scala:
val pair = (a, b)!
!
pair._1 // => a!
pair._2 // => b
Python:
pair = (a, b)!
!
pair[0] # => a!
pair[1] # => b
Java:
Tuple2 pair = new Tuple2(a, b);!
!
pair._1 // => a!
pair._2 // => b
Spark Essentials: API Details
Spark Examples
lecture/lab: 10 min
Spark Examples: Estimate Pi
wikipedia.org/wiki/Monte_Carlo_method
Spark Examples: Estimate Pi
import scala.math.random
import org.apache.spark._
!
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
!
val slices = if (args.length > 0) args(0).toInt else 2
val n = 100000 * slices
!
val count = spark.parallelize(1 to n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
!
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
Spark Examples: Estimate Pi
RDD
RDD
RDD
transformations RDD action value
Spark Examples: Estimate Pi
val count
! base RDD
.map
val
val
transformed RDD
if
}!
!
.reduce action
Checkpoint:
what estimate do you get for Pi?
RDD
RDD
RDD
transformations RDD action value
Spark Examples: K-Means
lecture/demo: 45 min
Data Workflows:
MapReduce
Impala GraphLab
Storm S4
Spark
Tachyon
Data Workflows: Spark SQL
val sqlContext
import
!
// Define the schema using a case class.
case class
!
// Create an RDD of Person objects and register it as a table.
val people
Checkpoint:
people.txt"
!
people
! what name do you get?
// SQL statements can be run by using the sql methods provided by
sqlContext.
val teenagers
!
// The results of SQL queries are SchemaRDDs and support all the
// normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
teenagers
Data Workflows: Spark SQL
Comparisons:
• Twitter Storm
• Yahoo! S4
• Google MillWheel
Data Workflows: Spark Streaming
// http://spark.apache.org/docs/latest/streaming-programming-guide.html!
!
import org.apache.spark.streaming._!
import org.apache.spark.streaming.StreamingContext._!
!
// create a StreamingContext with a SparkConf configuration!
val ssc = new StreamingContext(sparkConf, Seconds(10))!
!
// create a DStream that will connect to serverIP:serverPort!
val lines = ssc.socketTextStream(serverIP, serverPort)!
!
// split each line into words!
val words = lines.flatMap(_.split(" "))!
!
// count each word in each batch!
val pairs = words.map(word => (word, 1))!
val wordCounts = pairs.reduceByKey(_ + _)!
!
// print a few of the counts to the console!
wordCounts.print()!
!
ssc.start() // start the computation!
ssc.awaitTermination() // wait for the computation to terminate
Data Workflows: Spark Streaming
// http://spark.apache.org/docs/latest/mllib-guide.html!
!
val train_data = // RDD of Vector!
val model = KMeans.train(train_data, k=10)!
!
// evaluate the model!
val test_data = // RDD of Vector!
test_data.map(t => model.predict(t)).collect().foreach(println)!
demo:
Twitter Streaming Language Classifier
databricks.gitbooks.io/databricks-spark-reference-applications/
twitter_classifier/README.html
Data Workflows: GraphX
GraphX amplab.github.io/graphx/
spark.apache.org/docs/latest/graphx-programming-
guide.html
ampcamp.berkeley.edu/big-data-mini-course/graph-
analytics-with-graphx.html
Data Workflows: GraphX
// http://spark.apache.org/docs/latest/graphx-programming-guide.html!
!
import org.apache.spark.graphx._!
import org.apache.spark.rdd.RDD!
!
case class Peep(name: String, age: Int)!
!
val vertexArray = Array(!
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),!
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),!
(5L, Peep("Leslie", 45))!
)!
val edgeArray = Array(!
Edge(2L, 1L, 7), Edge(2L, 4L, 2),!
Edge(3L, 2L, 4), Edge(3L, 5L, 3),!
Edge(4L, 1L, 1), Edge(5L, 3L, 9)!
)!
!
val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray)!
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)!
val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD)!
!
val results = g.triplets.filter(t => t.attr > 7)!
!
for (triplet <- results.collect) {!
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")!
}
Data Workflows: GraphX
demo:
Simple Graph Query
gist.github.com/ceteri/c2a692b5161b23d92ed1
Data Workflows: GraphX
Introduction to GraphX
Joseph Gonzalez, Reynold Xin
youtu.be/mKEn9C5bRck
(break)
break: 15 min
05: Spark in Production
lecture/lab: 75 min
Spark in Production:
builds:
<project>!
<groupId>edu.berkeley</groupId>!
<artifactId>simple-project</artifactId>!
<modelVersion>4.0.0</modelVersion>!
<name>Simple Project</name>!
<packaging>jar</packaging>!
<version>1.0</version>!
<repositories>!
<repository>!
<id>Akka repository</id>!
<url>http://repo.akka.io/releases</url>!
</repository>!
</repositories>!
<dependencies>!
<dependency> <!-- Spark dependency -->!
<groupId>org.apache.spark</groupId>!
<artifactId>spark-core_2.10</artifactId>!
<version>1.1.0</version>!
</dependency>!
<dependency>!
<groupId>org.apache.hadoop</groupId>!
<artifactId>hadoop-client</artifactId>!
<version>2.2.0</version>!
</dependency>!
</dependencies>!
</project>
Spark in Production: Build: Java
builds:
command description
builds:
spark.apache.org/docs/latest/running-on-yarn.html
# http://spark.apache.org/docs/latest/ec2-scripts.html!
cd $SPARK_HOME/ec2!
!
export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY!
export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_KEY!
./spark-ec2 -k spark -i ~/spark.pem -s 2 -z us-east-1b launch foo!
!
# can review EC2 instances and their security groups to identify master!
# ssh into master!
./spark-ec2 -k spark -i ~/spark.pem -s 2 -z us-east-1b login foo!
!
# use ./ephemeral-hdfs/bin/hadoop to access HDFS!
/root/ephemeral-hdfs/bin/hadoop fs -mkdir /tmp!
/root/ephemeral-hdfs/bin/hadoop fs -put CHANGES.txt /tmp!
!
# now is the time when we Spark!
cd /root/spark!
export SPARK_HOME=$(pwd)!
!
SPARK_HADOOP_VERSION=1.0.4 sbt/sbt assembly!
!
/root/ephemeral-hdfs/bin/hadoop fs -put CHANGES.txt /tmp!
./bin/spark-shell
Spark in Production: Deploy: HDFS examples
review UI features
spark.apache.org/docs/latest/monitoring.html
http://<master>:8080/
http://<master>:50070/
Case Studies
discussion: 30 min
Summary: Spark has lots of activity!
• github.com/ooyala/spark-jobserver
Follow-Up
discussion: 20 min
certification:
spark.apache.org/community.html
video+slide archives: spark-summit.org
local events: Spark Meetups Worldwide
workshops: databricks.com/training
books:
Learning Spark
Fast Data Processing
Holden Karau,
with Spark
Andy Konwinski,
Holden Karau
Matei Zaharia
Packt (2013)
O’Reilly (2015*)
shop.oreilly.com/product/
shop.oreilly.com/product/ 9781782167068.do
0636920028512.do
Spark in Action
Chris Fregly
Manning (2015*)
sparkinaction.com/
events: Strata NY + Hadoop World
NYC, Oct 15-17
strataconf.com/stratany2014
Strata EU
Barcelona, Nov 19-21
strataconf.com/strataeu2014
Strata CA
San Jose, Feb 18-20
strataconf.com/strata2015