0% found this document useful (0 votes)

266 views37 pages

Scala and Spark Overview PDF

This document provides an overview of Scala, Apache Spark, and the big data ecosystem. It discusses how Scala was designed as a general purpose language that compiles to Java bytecode and can use Java libraries. It then explains key concepts in big data including Hadoop, MapReduce, and how Spark improves on MapReduce by keeping more data in memory and being faster. Resilient Distributed Datasets (RDDs) are introduced as the core of Spark, including transformations and actions. Finally, it briefly mentions Spark DataFrames.

Uploaded by

ingrobertorivas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

266 views37 pages

Scala and Spark Overview PDF

Uploaded by

ingrobertorivas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Scala and Spark

Overview
Scala and Spark

● In this lecture we will give an overview of the

Scala programming language
● Then we will discuss the general Big Data
Ecosystem
● Afterwards we will show how Apache Spark
fits into all of this.
Scala

● Scala is a general purpose programming

language
● It was designed by Matrin Odersky in the
early 2000s at EPFL (École Polytechnique
Fédérale de Lausanne)
● It was designed to overcome criticism of
Java’s shortcomings.
Scala

● Scala source code is intended to be compiled

to Java bytecode to run on a Java Virtual
Machine (JVM)
● Java libraries may be used directly in Scala
● Unlike Java, Scala has many features of
functional programming
Scala

● A large reason Scala demand has

dramatically risen in recent years is because
of Apache Spark.
● Let’s discuss what Spark is in the context of
Big Data.
● We’ll begin with a general explanation of what
Big Data is and related technologies.
Big Data Overview

● Explanation of Hadoop, MapReduce,and Spark

● Local versus Distributed Systems
● Overview of Hadoop Ecosystem
● Overview of Spark
Big Data

● Data that can fit on a local computer, in the scale of

0-32 GB depending on RAM.
● But what can we do if we have a larger set of data?
○ Try using a SQL database to move storage onto
hard drive instead of RAM
○ Or use a distributed system, that distributes the
data to multiple machines/computer.
Local versus Distributed

Local

Distributed
Local versus Distributed

Core Core Core Core Core

Local

Core Core Core Core Core Core

Distributed
Big Data

● A local process will use the computation resources of a

single machine
● A distributed process has access to the computational
resources across a number of machines connected
through a network
● After a certain point, it is easier to scale out to many lower
CPU machines, than to try to scale up to a single machine
with high a CPU
Big Data

● Distributed machines also have the advantage of easily

scaling, you can just add more machines
● They also include fault tolerance, if one machine fails, the
whole network can still go on.
● Let’s discuss the typical format of a distributed
architecture that uses Hadoop
Hadoop

● Hadoop is a way to distribute very large files across

multiple machines.
● It uses the Hadoop Distributed File System (HDFS)
● HDFS allows a user to work with large data sets
● HDFS also duplicates blocks of data for fault tolerance
● It also then uses MapReduce
● MapReduce allows computations on that data
Distributed Storage - HDFS
Name Node

CPU RAM

Data Node Data Node Data Node

CPU RAM CPU RAM CPU RAM

Distributed Storage - HDFS

● HDFS will use blocks of Name Node

data, with a size of 128 CPU RAM
MB by default
● Each of these blocks is
replicated 3 times
● The blocks are Data Node Data Node Data Node
distributed in a way to
support fault tolerance CPU RAM CPU RAM CPU RAM
Distributed Storage - HDFS

● Smaller blocks provide Name Node

more parallelization CPU RAM
during processing
● Multiple copies of a
block prevent loss of
data due to a failure of a Data Node Data Node Data Node
node
CPU RAM CPU RAM CPU RAM
MapReduce

● MapReduce is a way of Job Tracker

splitting a computation CPU RAM
task to a distributed set
of files (such as HDFS)
● It consists of a Job
Tracker and multiple Task Task Task
Tracker Tracker Tracker
Task Trackers
CPU RAM CPU RAM CPU RAM
MapReduce

● The Job Tracker sends Job Tracker

code to run on the Task CPU RAM
Trackers
● The Task trackers
allocate CPU and
memory for the tasks Task Task Task
Tracker Tracker Tracker
and monitor the tasks
on the worker nodes CPU RAM CPU RAM CPU RAM
Big Data

● What we covered can be thought of in two distinct parts:

○ Using HDFS to distribute large data sets
○ Using MapReduce to distribute a computational task to
a distributed data set
● Next we will learn about the latest technology in this space
known as Spark.
● Spark improves on the concepts of using distribution
Spark

● This lecture will be an abstract overview, we will discuss:

○ Spark
○ Spark vs MapReduce
○ Spark RDDs
○ Spark DataFrames
Spark

● Spark is one of the latest technologies being used to

quickly and easily handle Big Data
● It is an open source project on Apache
● It was first released in February 2013 and has exploded in
popularity due to it’s ease of use and speed
● It was created at the AMPLab at UC Berkeley
Spark

● You can think of Spark as a flexible alternative to

MapReduce
● Spark can use data stored in a variety of formats
○ Cassandra
○ AWS S3
○ HDFS
○ And more
Spark vs MapReduce

● MapReduce requires files to be stored in HDFS, Spark

does not!
● Spark also can perform operations up to 100x faster than
MapReduce
● So how does it achieve this speed?
Spark vs MapReduce

● MapReduce writes most data to disk after each map and

reduce operation
● Spark keeps most of the data in memory after each
transformation
● Spark can spill over to disk if the memory is filled
Spark RDDs

● At the core of Spark is the idea of a Resilient Distributed

Dataset (RDD)
● Resilient Distributed Dataset (RDD) has 4 main features:
○ Distributed Collection of Data
○ Fault-tolerant
○ Parallel operation - partioned
○ Ability to use many data sources
Spark RDDs
Spark RDDs
Spark RDDs

● RDDs are immutable, lazily evaluated, and cacheable

● There are two types of RDD operations:
○ Transformations
○ Actions
● Transformations are basically a recipe to follow.
● Actions actually perform what the recipe says to do and
returns something back.
Spark RDDs

● When discussing Spark syntax you will see RDD versus

DataFrame syntax show up.
● With the release of Spark 2.0, Spark is moving towards a
DataFrame based syntax, but keep in mind that the way
files are being distributed can still be thought of as RDDs,
it is just the typed out syntax that is changing
Spark RDDs

● We’ve covered a lot!

● Don’t worry if you didn’t memorize all these details, a lot of
this will be covered again as we learn about how to
actually code out and utilize these ideas!
Spark RDDs

● Basic Actions
○ First
○ Collect
○ Count
○ Take
Spark RDDs

● Collect - Return all the elements of the RDD as an array at

the driver program.
● Count - Return the number of elements in the RDD
● First - Return the first element in the RDD
● Take - Return an array with the first n elements of the
RDD
Spark RDDs

● Basic Transformations
○ Filter
○ Map
○ FlatMap
Spark DataFrames

● Spark DataFrames are also now the standard

way of using Spark’s Machine Learning
Capabilities.
● Spark DataFrame documentation is still pretty
new and can be sparse.
● Let’s get a brief tour of the documentation!
http://spark.apache.org/
Computer Machine Math &
Learning
Science Statistics
DS

Software Research

Domain
Knowledge
Computer Machine Math &
Learning
Science Statistics
DS

Software Research

Domain
Knowledge

MidaCrochet PSYDUCK CAPTAIN
100% (1)
MidaCrochet PSYDUCK CAPTAIN
15 pages
Volume 1
No ratings yet
Volume 1
234 pages
CHAPTER 19 - Industrialization and Nationalism
No ratings yet
CHAPTER 19 - Industrialization and Nationalism
27 pages
2025-26 10th Science 1 Pratical (Journal Writing0 - Ex 1-4,6,8,9,10,11.1745912089
No ratings yet
2025-26 10th Science 1 Pratical (Journal Writing0 - Ex 1-4,6,8,9,10,11.1745912089
15 pages
Structured and Unstructured Data: Learning Outcomes
100% (1)
Structured and Unstructured Data: Learning Outcomes
13 pages
Test Cases With SQL Test Scripts
No ratings yet
Test Cases With SQL Test Scripts
68 pages
SurgeTesting EARbasics 0716
100% (1)
SurgeTesting EARbasics 0716
2 pages
Qubit
100% (1)
Qubit
17 pages
AI Introduction
No ratings yet
AI Introduction
59 pages
Basics of Essay Writing
No ratings yet
Basics of Essay Writing
20 pages
Digging Tools PDF
No ratings yet
Digging Tools PDF
6 pages
Agile Database Techniques
100% (1)
Agile Database Techniques
35 pages
Data Flow Diagrams
No ratings yet
Data Flow Diagrams
18 pages
Facial Recognition As A Less Bad Option
100% (1)
Facial Recognition As A Less Bad Option
16 pages
PDF 24
0% (1)
PDF 24
2 pages
Input Output Devices
No ratings yet
Input Output Devices
44 pages
EndToEnd Business Framework
No ratings yet
EndToEnd Business Framework
22 pages
D D D D D D D D: TL5001, TL5001A Pulse-Width-Modulation Control Circuits
No ratings yet
D D D D D D D D: TL5001, TL5001A Pulse-Width-Modulation Control Circuits
33 pages
Mistakes in Enterprisal Data Modeling
100% (1)
Mistakes in Enterprisal Data Modeling
18 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
5 Capabilities For The Best Azure Backup and Recovery-Veeam - PG
No ratings yet
5 Capabilities For The Best Azure Backup and Recovery-Veeam - PG
12 pages
When Where and Why To Use NoSQL
No ratings yet
When Where and Why To Use NoSQL
13 pages
Day 1
No ratings yet
Day 1
7 pages
DesignMind Data Warehouse
No ratings yet
DesignMind Data Warehouse
31 pages
What Is Networking Hardware?: Workstations Hubs Bridges Firewalls Routers Fileservers Repeaters
100% (1)
What Is Networking Hardware?: Workstations Hubs Bridges Firewalls Routers Fileservers Repeaters
6 pages
Role of Statistics in Psychology
No ratings yet
Role of Statistics in Psychology
4 pages
Best Practices For Data Archiving Sponsored by CA
100% (1)
Best Practices For Data Archiving Sponsored by CA
11 pages
MDM 901 Install Guide
No ratings yet
MDM 901 Install Guide
400 pages
Data Hub
No ratings yet
Data Hub
152 pages
HCP Portal Technical Requirements - Manuscript - 09142020
No ratings yet
HCP Portal Technical Requirements - Manuscript - 09142020
50 pages
G+Suite Make It Work The Future of Collaboration & Productivity
No ratings yet
G+Suite Make It Work The Future of Collaboration & Productivity
56 pages
Client Server Model: Many Databases Applications Are Built in
No ratings yet
Client Server Model: Many Databases Applications Are Built in
52 pages
Lucene 4 Cookbook - Sample Chapter
No ratings yet
Lucene 4 Cookbook - Sample Chapter
28 pages
Manual de Mantenimiento S331D
No ratings yet
Manual de Mantenimiento S331D
32 pages
Er Studio Da Quick
No ratings yet
Er Studio Da Quick
72 pages
ETL Processing Tools Comparision
0% (1)
ETL Processing Tools Comparision
21 pages
Software Metrics
No ratings yet
Software Metrics
62 pages
Unit No. Topics To Be Covered Hours: Ground Breaking Operation
No ratings yet
Unit No. Topics To Be Covered Hours: Ground Breaking Operation
32 pages
2022 - Digital Transformation Towards Education 4.0
No ratings yet
2022 - Digital Transformation Towards Education 4.0
28 pages
Cloud Native Success
No ratings yet
Cloud Native Success
17 pages
Week-09-10-11-12 Fundamentals of Cybersecurity
No ratings yet
Week-09-10-11-12 Fundamentals of Cybersecurity
67 pages
Functional Decomposition User Guide by Neville Turbit
No ratings yet
Functional Decomposition User Guide by Neville Turbit
13 pages
DWH ETL and Business Intelligence Report
No ratings yet
DWH ETL and Business Intelligence Report
7 pages
Exam AZ-104: Microsoft Azure Administrator - Skills Measured
No ratings yet
Exam AZ-104: Microsoft Azure Administrator - Skills Measured
12 pages
Orium+Commercetools+Contentstack - Get MACH Ready Report
No ratings yet
Orium+Commercetools+Contentstack - Get MACH Ready Report
28 pages
Virtusa Pegasystems Engagement Success Stories
No ratings yet
Virtusa Pegasystems Engagement Success Stories
20 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Business Research Methods: Problem Definition and The Research Proposal
No ratings yet
Business Research Methods: Problem Definition and The Research Proposal
37 pages
Merit Scholarship Form 2021
No ratings yet
Merit Scholarship Form 2021
2 pages
Test Data Management
No ratings yet
Test Data Management
7 pages
Cost Benefits Analysis of Test Automation
No ratings yet
Cost Benefits Analysis of Test Automation
20 pages
App Selection Checklist: The Padagogy Wheel ENG V5.0 For Both Apple iOS and Android
No ratings yet
App Selection Checklist: The Padagogy Wheel ENG V5.0 For Both Apple iOS and Android
8 pages
UiPath-Future of Work TLP
No ratings yet
UiPath-Future of Work TLP
16 pages
Palo Alto Networks Cybersecurity Academy
No ratings yet
Palo Alto Networks Cybersecurity Academy
5 pages
Ethics and Facial Recognition Technology An Integrative Review
No ratings yet
Ethics and Facial Recognition Technology An Integrative Review
10 pages
Data Security Posture Management in Aws With Zscaler
No ratings yet
Data Security Posture Management in Aws With Zscaler
6 pages
Dynamic Business Rule Engine
No ratings yet
Dynamic Business Rule Engine
19 pages
Project Structure On GitHub
No ratings yet
Project Structure On GitHub
4 pages
Basic Use Cases
No ratings yet
Basic Use Cases
22 pages
PBL2 SME Governance Problem Statement-V2
No ratings yet
PBL2 SME Governance Problem Statement-V2
3 pages
CA Test Data Manager Key Scenarios
No ratings yet
CA Test Data Manager Key Scenarios
12 pages
Theo Notes
No ratings yet
Theo Notes
5 pages
Grade 11 Courage (Engineering)
No ratings yet
Grade 11 Courage (Engineering)
8 pages
Ramadan in Java The Joy Jihad of Ritual Fasting Lund Studies in History of Religions Andre Moller Instant Download
No ratings yet
Ramadan in Java The Joy Jihad of Ritual Fasting Lund Studies in History of Religions Andre Moller Instant Download
70 pages
Attribute Based Access Control (Abac)
No ratings yet
Attribute Based Access Control (Abac)
4 pages
Raspberry Pi Factsheet
No ratings yet
Raspberry Pi Factsheet
9 pages
DW Reference Documetns
No ratings yet
DW Reference Documetns
9 pages
T2222-Advanced Operation Research
No ratings yet
T2222-Advanced Operation Research
3 pages
Bulldog Adhesion Promoter TPO123 TDS Rev 07 2010
No ratings yet
Bulldog Adhesion Promoter TPO123 TDS Rev 07 2010
7 pages
Domain Name System: Prepared By: Kirti Tandel
No ratings yet
Domain Name System: Prepared By: Kirti Tandel
14 pages
A Tale of Two Architectures
No ratings yet
A Tale of Two Architectures
16 pages
Multi Tenancy Issues in Cloud Security
No ratings yet
Multi Tenancy Issues in Cloud Security
7 pages
Marisela Frasuto - Beverly Hills Cop
No ratings yet
Marisela Frasuto - Beverly Hills Cop
5 pages
FR 107 Datasheet PDF
No ratings yet
FR 107 Datasheet PDF
2 pages
PythonProgrammingTutorial Day01
No ratings yet
PythonProgrammingTutorial Day01
6 pages
Democracy in Athens
No ratings yet
Democracy in Athens
2 pages
Tentative Schedule Summer, Carryover, Supplementrary 2024-25
No ratings yet
Tentative Schedule Summer, Carryover, Supplementrary 2024-25
11 pages
For My Grandmother Knitting - Liz Lockhead
No ratings yet
For My Grandmother Knitting - Liz Lockhead
2 pages
Chicago Boogie - Alto Sax
No ratings yet
Chicago Boogie - Alto Sax
2 pages
Cloud Computing
No ratings yet
Cloud Computing
30 pages
Self Service BI
No ratings yet
Self Service BI
6 pages
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
From Everand
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
Gourab Mukherjee
No ratings yet
Ultimate SQL Server and Azure SQL for Data Management and Modernization
From Everand
Ultimate SQL Server and Azure SQL for Data Management and Modernization
Amit Khandelwal
No ratings yet
Programming Microsoft Dynamics® NAV 2013
From Everand
Programming Microsoft Dynamics® NAV 2013
David A. Studebaker
No ratings yet
Pentaho Data Integration Cookbook - Second Edition
From Everand
Pentaho Data Integration Cookbook - Second Edition
María Carina Roldán
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
From Everand
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
vivian njoroge
No ratings yet
Business rules A Complete Guide
From Everand
Business rules A Complete Guide
Gerardus Blokdyk
No ratings yet
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
From Everand
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
Carl A. Bolton
No ratings yet
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet

Scala and Spark Overview PDF

Uploaded by

Scala and Spark Overview PDF

Uploaded by

Scala and Spark

● In this lecture we will give an overview of the

● Scala is a general purpose programming

● Scala source code is intended to be compiled

● A large reason Scala demand has

● Explanation of Hadoop, MapReduce,and Spark

● Data that can fit on a local computer, in the scale of

Core Core Core Core Core

Core Core Core Core Core Core

● A local process will use the computation resources of a

● Distributed machines also have the advantage of easily

● Hadoop is a way to distribute very large files across

Data Node Data Node Data Node

CPU RAM CPU RAM CPU RAM

● HDFS will use blocks of Name Node

● Smaller blocks provide Name Node

● MapReduce is a way of Job Tracker

● The Job Tracker sends Job Tracker

● What we covered can be thought of in two distinct parts:

● This lecture will be an abstract overview, we will discuss:

● Spark is one of the latest technologies being used to

● You can think of Spark as a flexible alternative to

● MapReduce requires files to be stored in HDFS, Spark

● MapReduce writes most data to disk after each map and

● At the core of Spark is the idea of a Resilient Distributed

● RDDs are immutable, lazily evaluated, and cacheable

● When discussing Spark syntax you will see RDD versus

● We’ve covered a lot!

● Collect - Return all the elements of the RDD as an array at

● Spark DataFrames are also now the standard

You might also like