0% found this document useful (0 votes)

79 views22 pages

Introduccion A Spark

Spark is a framework for processing big data in a distributed manner across clusters of machines. It provides abstractions like resilient distributed datasets (RDDs) that make it easy to program distributed applications where processes on different machines can communicate with each other. RDDs can be created from data sources like HDFS files or collections and transformed using operations like map, filter, reduceByKey. Transformations create new RDDs lazily. Actions trigger job execution by returning a result or writing data to an external storage. Spark runs on standalone mode, YARN, or Mesos and supports running on AWS.

Uploaded by

dabid8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ZIP, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views22 pages

Introduccion A Spark

Uploaded by

dabid8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ZIP, PDF, TXT or read online on Scribd

You are on page 1/ 22

SPARK

FOR BIG DATA

PROGRAMACION
DISTRIBUIDA
• PROBLEMA: BIG DATA.

• SOLUCION: Proveer abstracciones que hacen fácil programar

sistemas en los cuales los procesos (potencialmente en
diferentes maquinas) se comunican uno con otro.
streaming

buildReco (2)
{event:001,
Aggregate
user:xt10,
(1) app:netflix, serie:3%, userxt10,reco:{narcos}
episode:2, top
genre:drama
userxt10,
} [ {},{},{},{} ] usertx11, reco:mindhunter 1. orange is the new black
{event:002,
userxt11,
user:xt11,
[ {}, {} app:netflix,
] serie:blackmirror:, episode:1, 2. narcos
genre:sci-fi } 3. money heist
FLUJO
driver
spark
driver

NODE 11
NODE

executor e
RDD T1 RDD
T2
P1 P3

RDD P2 T3 T4 RDD P
PROGRAMACION
• RDDS - DAG

• DAG - STAGES

• TASK SCHEDLUER - EMITE

• EXECUTOR - EJECUTA
RDD

La abstraction principal de
Spark es Resilent
Distributed Dataset.
1. List<Integer> numbers = Arrays.asList(1,2,3,4,5);
•
JavaRDD<Integer> numbersRDD = sc.parallelize(numbers);

2. JavaRDD<String> lines = sc.textFile(“s3n://srt-development1/file.txt”);

•
TRANSFORMACIONES
• Map( )

• FlatMap( )

• Filter( )

• Distinct( )

• ReduceByKey( )
ACCIONES
• Count( )

• Collect( )

• saveAsTextFile( )

• cache( )

• persist( )
VARIABLES
COMPARTIDAS
• Broadcast : Broadcast es uno variable de solo lecture
compartida entire cada executor.

• Accumulators: Se utilizes para contadores y sumas distribuidas.

RUNNING SPARK
• LOCAL

• STANDALONE

• YARN

• MESOS
SPARK ON AWS
• maven clean install

• aws s3 cp spark_apps-0.0.1-SNAPSHOT.jar s3://srt-

development1/jars/spark_apps-0.0.1-SNAPSHOT.jar
spark-submit
nohup spark-submit\

--deploy-mode cluster\

--master yarn\

--executor-memory 1g\

--driver-memory 1g\

--class com.srt.spark.main.WordCount s3://srt-development1/jars/sparks_apps.jar &

PROPERTIES
• spark.executor.cores: The number of cores utilized by an
executor default value is 1 in YARN and all available cores on
worker in stand alone mode

• spark.driver.memory: The amount of memory that can be

utilized by the driver

• spark.executor.memory: The that can be utilized by each

executor

• spark.cores.max: maximum amount of CPU cores to request for

the application from across the cluster
REFERENCES
• Learning Spark, Lighting-Fast Data Analysis

• Apache Spark 2.x for Java Developers

• https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A

Spark PPT
No ratings yet
Spark PPT
55 pages
SPARK
No ratings yet
SPARK
66 pages
Spark Streaming
No ratings yet
Spark Streaming
3 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
SPARK
No ratings yet
SPARK
125 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
SPARK
No ratings yet
SPARK
47 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Bda 5
No ratings yet
Bda 5
21 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Spark
No ratings yet
Spark
96 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
BDA GTU Study Material Presentations Unit-6 03102021061221PM
No ratings yet
BDA GTU Study Material Presentations Unit-6 03102021061221PM
23 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Bda U4
No ratings yet
Bda U4
49 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Spark Streaming: Tathagata "TD" Das
No ratings yet
Spark Streaming: Tathagata "TD" Das
28 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Berkeley Data Analytics Stack: Prof. Harold Liu 15 December 2014
No ratings yet
Berkeley Data Analytics Stack: Prof. Harold Liu 15 December 2014
48 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Spark
No ratings yet
Spark
9 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Module 3
No ratings yet
Module 3
51 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Tarea 8
0% (2)
Tarea 8
13 pages
Unit 5
100% (1)
Unit 5
109 pages
Apache Spark and Ignite
No ratings yet
Apache Spark and Ignite
4 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Apache Spark Streaming Presentation
100% (1)
Apache Spark Streaming Presentation
28 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Computer Network Lab PDF
No ratings yet
Computer Network Lab PDF
79 pages
Operating System Interface Between User and Hardware
No ratings yet
Operating System Interface Between User and Hardware
19 pages
Adv Customer Support Service Desc 3758596
No ratings yet
Adv Customer Support Service Desc 3758596
158 pages
E Business & E Commerce (Modul 14)
No ratings yet
E Business & E Commerce (Modul 14)
36 pages
Big Data Analytics
No ratings yet
Big Data Analytics
2 pages
Weekly Note 416 - Difference Between Solution and Target Architecture
No ratings yet
Weekly Note 416 - Difference Between Solution and Target Architecture
3 pages
Usability Report Full Final
No ratings yet
Usability Report Full Final
7 pages
Ex - No: 1 Study of Basic Network Commands
No ratings yet
Ex - No: 1 Study of Basic Network Commands
6 pages
Infosys SAP AMS-framework
No ratings yet
Infosys SAP AMS-framework
4 pages
Internet Forensics and Cyber Security: Unit 3
No ratings yet
Internet Forensics and Cyber Security: Unit 3
55 pages
Information Age
No ratings yet
Information Age
33 pages
SYBSc (CS) SE CH 4 1
No ratings yet
SYBSc (CS) SE CH 4 1
26 pages
Unmask Bank Account - 1
No ratings yet
Unmask Bank Account - 1
2 pages
IDS, IPS and Firewalls
No ratings yet
IDS, IPS and Firewalls
2 pages
7139 - Colorkrew - Full Stack Engineer (Junior)
No ratings yet
7139 - Colorkrew - Full Stack Engineer (Junior)
3 pages
LAN and WAN Network
No ratings yet
LAN and WAN Network
1 page
Data Engineer - Freshers JD
No ratings yet
Data Engineer - Freshers JD
4 pages
Robust Methods Input Validation
No ratings yet
Robust Methods Input Validation
4 pages
Partitioning Method
No ratings yet
Partitioning Method
8 pages
Current Log
No ratings yet
Current Log
40 pages
EGB I QB III Sem OCT 2024
No ratings yet
EGB I QB III Sem OCT 2024
6 pages
DB Report SAP ERP Report
No ratings yet
DB Report SAP ERP Report
18 pages
BCS302 Unit-4 (Part-I)
No ratings yet
BCS302 Unit-4 (Part-I)
8 pages
Proposed-Org-Chart MISO
No ratings yet
Proposed-Org-Chart MISO
1 page
70 762 PDF
No ratings yet
70 762 PDF
20 pages
GD Cyber Threats and Emerging
No ratings yet
GD Cyber Threats and Emerging
19 pages
Clearance Form (For Management Staff) OGC-HR-CFM-29 V8
No ratings yet
Clearance Form (For Management Staff) OGC-HR-CFM-29 V8
3 pages
Data Quality Management Maturity Model A Case Study in BPS-Statistics of Kaur Regency, Bengkulu Province, 2017
No ratings yet
Data Quality Management Maturity Model A Case Study in BPS-Statistics of Kaur Regency, Bengkulu Province, 2017
4 pages
Resume Mayur
No ratings yet
Resume Mayur
1 page
Upload A Document - Socribd
No ratings yet
Upload A Document - Socribd
2 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
Linux Services Deployment
From Everand
Linux Services Deployment
Fabian Mestre
No ratings yet
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet

Introduccion A Spark

Uploaded by

Introduccion A Spark

Uploaded by

SPARK

FOR BIG DATA

• SOLUCION: Proveer abstracciones que hacen fácil programar

• TASK SCHEDLUER - EMITE

2. JavaRDD<String> lines = sc.textFile(“s3n://srt-development1/file.txt”);

• Accumulators: Se utilizes para contadores y sumas distribuidas.

• aws s3 cp spark_apps-0.0.1-SNAPSHOT.jar s3://srt-

--class com.srt.spark.main.WordCount s3://srt-development1/jars/sparks_apps.jar &

• spark.driver.memory: The amount of memory that can be

• spark.executor.memory: The that can be utilized by each

• spark.cores.max: maximum amount of CPU cores to request for

• Apache Spark 2.x for Java Developers

You might also like