Scala PDF
Scala PDF
Functional Programming
Functional operations create new data structures, they do not modify
existing ones
After an operation, the original data still exists in unmodified form
The program design implicitly captures data flows
The order of the operations is not significant
About Scala
Scala is a statically typed language
Support for generics:
All the variables and functions have types that are defined at compile time
The compiler will find many unintended programming errors
The compiler will try to infer the type, say val=2 is implicitly of integer type
Use an IDE for complex types: http://scala-ide.org or IDEA with the Scala plugin
Everything is an object
Functions defined using the def keyword
Laziness, avoiding the creation of objects except when absolutely necessary
Online Scala coding: http://www.simplyscala.com
A Scala Tutorial for Java Programmers
http://www.scala-lang.org/docu/files/ScalaTutorial.pdf
Scala Notation
_ is the default value or wild card
=> Is used to separate match expression from block to be evaluated
The anonymous function (x,y) => x+y can be replaced by _+_
The v=>v.Method can be replaced by _.Method
"->" is the tuple delimiter
Iteration with for:
for (i <- 0 until 10) { // with 0 to 10, 10 is included
println(s"Item: $i")
}
Examples:
import scala.collection.immutable._
lsts.filter(v=>v.length>2) is the same as lsts.filter(_.length>2)
(2, 3) is equal to 2 -> 3
2 -> (3 -> 4) == (2,(3,4))
2 -> 3 -> 4 == ((2,3),4)
Scala Examples
map: lsts.map(x => x * 4)
Instantiates a new list by applying f to each element of the input list.
flatMap: lsts.flatMap(_.toList) uses the given function to create a new list, then places the resulting list
elements at the top level of the collection
lsts.sort(_<_): sorting ascending order
fold and reduce functions combine adjacent list elements using a function. Processes the list starting
from left or right:
lst.foldLeft(0)(_+_) starts from 0 and adds the list values to it iteratively starting from left
tuples: a set of values enclosed in parenthesis (2, z, 3), access with the underscore (2,<)._2
Notice above: single-statement functions do not need curly braces { }
Arrays are indexed with ( ), not [ ]. [ ] is used for type bounds (like Java's < >)
REMEMBER: these do not modify the collection, but create a new one
(you need to assign the return value)
val sorted = lsts.sort(_ < _)
Implicit parallelism
The map function has implicit parallelism as we saw before
This is because the order of the application of the function to the
elements in a list is commutative
We can parallelize or reorder the execution
MapReduce and Spark build on this parallelism
Apache Spark
Spark is a general-purpose computing framework for iterative tasks
API is provided for Java, Scala and Python
The model is based on MapReduce enhanced with new operations
and an engine that supports execution graphs
Tools include Spark SQL, MLLlib for machine learning, GraphX for
graph processing and Spark Streaming
Obtaining Spark
Spark can be obtained from the spark.apache.org site
Spark packages are available for many different HDFS versions
Spark runs on Windows and UNIX-like systems such as Linux and MacOS
The easiest setup is local, but the real power of the system comes from
distributed operation
Spark runs on Java6+, Python 2.6+, Scala 2.1+
Newest version works best with Java7+, Scala 2.10.4
Installing Spark
We use Spark 1.2.1 or newer on this course
For local installation:
Download http://is.gd/spark121
Extract it to a folder of your choice and run bin/spark-shell in a terminal
(or double click bin/spark-shell.cmd on Windows)
For the IDE, take the assembly jar from spark-1.2.1/assembly/target/scala-2.10 OR
spark-1.2.1/lib
You need to have
Java 6+
For pySpark: Python 2.6+
First examples
# Running the shell with your own classes, given amount of memory, and
# the local computer with two threads as slaves
./bin/spark-shell --driver-memory 1G \
--jars your-project-jar-here.jar \
--master
"local[2]"
SparkContext sc
A Spark program creates a SparkContext object, denoted by the sc
variable in Scala and Python shell
Outside shell, a constructor is used to instantiate a SparkContext
val conf = new SparkConf().setAppName("Hello").setMaster("local[2]")
val sc = new SparkContext(conf)
Spark overview
Worker Node
Executor
Tasks
Cache
Driver Program
Cluster Manager
SparkContext
Worker Node
Executor
Tasks
Cache
Distributed
Storage
WordCounting
/* When giving Spark file paths, those files need to be accessible
with the same path from all slaves */
val file = sc.textFile("README.md")
val wc = file.flatMap(l => l.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
wc.saveAsTextFile("wc_out.txt")
wc.collect.foreach(println)
Join
val f1 = sc.textFile("README.md")
val sparks = f1.filter(_.startsWith("Spark"))
val wc1 = sparks.flatMap(l => l.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
val f2 = sc.textFile("CHANGES.txt")
val sparks2 = f2.filter(_.startsWith("Spark"))
val wc2 = sparks2.flatMap(l => l.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
wc1.join(wc2).collect.foreach(println)
Transformations
Create a new dataset from an existing dataset
All transformations are lazy and computed when the results are needed
Transformation history is retained in RDDs
calculations can be optimized
data can be recovered
Some operations can be given the number of tasks. This can be very important for
performance. Spark and Hadoop prefer larger files and smaller number of tasks if the
data is small. However, the number of tasks should always be at least the number
of CPU cores in the computer / cluster running Spark.
Description
map(func)
filter(func)
flatMap(func)
mapPartitions(func)
Transformations II/IV
Transformation
Description
sample(withReplac,
frac, seed)
union(other)
intersection(other)
distinct([nTasks])
Description
groupByKey([numTask])
reduceByKey(func,
[numTasks])
join(inputdataset,
[numTask])
cogroup(inputdataset,
[numTask])
cartesian(inputdataset)
Spark Transformations IV
Transformation
Description
pipe(command, [envVars])
coalesce(numPartitions)
repartition(numPartitions)
repartitionAndSortWithinPartitio
ns(partitioner)
Description
reduce(func)
collect()
count()
first()
take(n)
takeSample(withReplac,
frac, seed)
takeOrdered(n,
[ordering])
Spark Actions II
Transformation
Description
saveAsTextFile(path)
saveAsSequenceFile(path)
saveAsObjectFile(path)
countByKey()
foreach(func)
Spark API
https://spark.apache.org/docs/1.2.1/api/scala/index.html
For Python
https://spark.apache.org/docs/latest/api/python/
Spark Programming Guide:
https://spark.apache.org/docs/1.2.1/programming-guide.html
Check which version's documentation (stackoverflow, blogs, etc) you
are looking at, the API had big changes after version 1.0.0.
More information
These slides: http://is.gd/bigdatascala
Intro to Apache Spark: http://databricks.com
Project that can be used to start (If using Maven):
https://github.com/Kauhsa/spark-code-camp-example-project
This is for Spark 1.0.2, so change the version in pom.xml.