0% found this document useful (0 votes)
266 views37 pages

Scala and Spark Overview PDF

This document provides an overview of Scala, Apache Spark, and the big data ecosystem. It discusses how Scala was designed as a general purpose language that compiles to Java bytecode and can use Java libraries. It then explains key concepts in big data including Hadoop, MapReduce, and how Spark improves on MapReduce by keeping more data in memory and being faster. Resilient Distributed Datasets (RDDs) are introduced as the core of Spark, including transformations and actions. Finally, it briefly mentions Spark DataFrames.

Uploaded by

ingrobertorivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
266 views37 pages

Scala and Spark Overview PDF

This document provides an overview of Scala, Apache Spark, and the big data ecosystem. It discusses how Scala was designed as a general purpose language that compiles to Java bytecode and can use Java libraries. It then explains key concepts in big data including Hadoop, MapReduce, and how Spark improves on MapReduce by keeping more data in memory and being faster. Resilient Distributed Datasets (RDDs) are introduced as the core of Spark, including transformations and actions. Finally, it briefly mentions Spark DataFrames.

Uploaded by

ingrobertorivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Scala and Spark

Overview
Scala and Spark

● In this lecture we will give an overview of the


Scala programming language
● Then we will discuss the general Big Data
Ecosystem
● Afterwards we will show how Apache Spark
fits into all of this.
Scala

● Scala is a general purpose programming


language
● It was designed by Matrin Odersky in the
early 2000s at EPFL (École Polytechnique
Fédérale de Lausanne)
● It was designed to overcome criticism of
Java’s shortcomings.
Scala

● Scala source code is intended to be compiled


to Java bytecode to run on a Java Virtual
Machine (JVM)
● Java libraries may be used directly in Scala
● Unlike Java, Scala has many features of
functional programming
Scala

● A large reason Scala demand has


dramatically risen in recent years is because
of Apache Spark.
● Let’s discuss what Spark is in the context of
Big Data.
● We’ll begin with a general explanation of what
Big Data is and related technologies.
Big Data Overview

● Explanation of Hadoop, MapReduce,and Spark


● Local versus Distributed Systems
● Overview of Hadoop Ecosystem
● Overview of Spark
Big Data

● Data that can fit on a local computer, in the scale of


0-32 GB depending on RAM.
● But what can we do if we have a larger set of data?
○ Try using a SQL database to move storage onto
hard drive instead of RAM
○ Or use a distributed system, that distributes the
data to multiple machines/computer.
Local versus Distributed

Local

Distributed
Local versus Distributed

Core Core Core Core Core

Local

Core Core Core Core Core Core

Distributed
Big Data

● A local process will use the computation resources of a


single machine
● A distributed process has access to the computational
resources across a number of machines connected
through a network
● After a certain point, it is easier to scale out to many lower
CPU machines, than to try to scale up to a single machine
with high a CPU
Big Data

● Distributed machines also have the advantage of easily


scaling, you can just add more machines
● They also include fault tolerance, if one machine fails, the
whole network can still go on.
● Let’s discuss the typical format of a distributed
architecture that uses Hadoop
Hadoop

● Hadoop is a way to distribute very large files across


multiple machines.
● It uses the Hadoop Distributed File System (HDFS)
● HDFS allows a user to work with large data sets
● HDFS also duplicates blocks of data for fault tolerance
● It also then uses MapReduce
● MapReduce allows computations on that data
Distributed Storage - HDFS
Name Node

CPU RAM

Data Node Data Node Data Node

CPU RAM CPU RAM CPU RAM


Distributed Storage - HDFS

● HDFS will use blocks of Name Node


data, with a size of 128 CPU RAM
MB by default
● Each of these blocks is
replicated 3 times
● The blocks are Data Node Data Node Data Node
distributed in a way to
support fault tolerance CPU RAM CPU RAM CPU RAM
Distributed Storage - HDFS

● Smaller blocks provide Name Node


more parallelization CPU RAM
during processing
● Multiple copies of a
block prevent loss of
data due to a failure of a Data Node Data Node Data Node
node
CPU RAM CPU RAM CPU RAM
MapReduce

● MapReduce is a way of Job Tracker


splitting a computation CPU RAM
task to a distributed set
of files (such as HDFS)
● It consists of a Job
Tracker and multiple Task Task Task
Tracker Tracker Tracker
Task Trackers
CPU RAM CPU RAM CPU RAM
MapReduce

● The Job Tracker sends Job Tracker


code to run on the Task CPU RAM
Trackers
● The Task trackers
allocate CPU and
memory for the tasks Task Task Task
Tracker Tracker Tracker
and monitor the tasks
on the worker nodes CPU RAM CPU RAM CPU RAM
Big Data

● What we covered can be thought of in two distinct parts:


○ Using HDFS to distribute large data sets
○ Using MapReduce to distribute a computational task to
a distributed data set
● Next we will learn about the latest technology in this space
known as Spark.
● Spark improves on the concepts of using distribution
Spark

● This lecture will be an abstract overview, we will discuss:


○ Spark
○ Spark vs MapReduce
○ Spark RDDs
○ Spark DataFrames
Spark

● Spark is one of the latest technologies being used to


quickly and easily handle Big Data
● It is an open source project on Apache
● It was first released in February 2013 and has exploded in
popularity due to it’s ease of use and speed
● It was created at the AMPLab at UC Berkeley
Spark

● You can think of Spark as a flexible alternative to


MapReduce
● Spark can use data stored in a variety of formats
○ Cassandra
○ AWS S3
○ HDFS
○ And more
Spark vs MapReduce

● MapReduce requires files to be stored in HDFS, Spark


does not!
● Spark also can perform operations up to 100x faster than
MapReduce
● So how does it achieve this speed?
Spark vs MapReduce

● MapReduce writes most data to disk after each map and


reduce operation
● Spark keeps most of the data in memory after each
transformation
● Spark can spill over to disk if the memory is filled
Spark RDDs

● At the core of Spark is the idea of a Resilient Distributed


Dataset (RDD)
● Resilient Distributed Dataset (RDD) has 4 main features:
○ Distributed Collection of Data
○ Fault-tolerant
○ Parallel operation - partioned
○ Ability to use many data sources
Spark RDDs
Spark RDDs
Spark RDDs

● RDDs are immutable, lazily evaluated, and cacheable


● There are two types of RDD operations:
○ Transformations
○ Actions
● Transformations are basically a recipe to follow.
● Actions actually perform what the recipe says to do and
returns something back.
Spark RDDs

● When discussing Spark syntax you will see RDD versus


DataFrame syntax show up.
● With the release of Spark 2.0, Spark is moving towards a
DataFrame based syntax, but keep in mind that the way
files are being distributed can still be thought of as RDDs,
it is just the typed out syntax that is changing
Spark RDDs

● We’ve covered a lot!


● Don’t worry if you didn’t memorize all these details, a lot of
this will be covered again as we learn about how to
actually code out and utilize these ideas!
Spark RDDs

● Basic Actions
○ First
○ Collect
○ Count
○ Take
Spark RDDs

● Collect - Return all the elements of the RDD as an array at


the driver program.
● Count - Return the number of elements in the RDD
● First - Return the first element in the RDD
● Take - Return an array with the first n elements of the
RDD
Spark RDDs

● Basic Transformations
○ Filter
○ Map
○ FlatMap
Spark DataFrames

● Spark DataFrames are also now the standard


way of using Spark’s Machine Learning
Capabilities.
● Spark DataFrame documentation is still pretty
new and can be sparse.
● Let’s get a brief tour of the documentation!
http://spark.apache.org/
Computer Machine Math &
Learning
Science Statistics
DS

Software Research

Domain
Knowledge
Computer Machine Math &
Learning
Science Statistics
DS

Software Research

Domain
Knowledge

You might also like