0% found this document useful (0 votes)
79 views22 pages

Introduccion A Spark

Spark is a framework for processing big data in a distributed manner across clusters of machines. It provides abstractions like resilient distributed datasets (RDDs) that make it easy to program distributed applications where processes on different machines can communicate with each other. RDDs can be created from data sources like HDFS files or collections and transformed using operations like map, filter, reduceByKey. Transformations create new RDDs lazily. Actions trigger job execution by returning a result or writing data to an external storage. Spark runs on standalone mode, YARN, or Mesos and supports running on AWS.

Uploaded by

dabid8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ZIP, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views22 pages

Introduccion A Spark

Spark is a framework for processing big data in a distributed manner across clusters of machines. It provides abstractions like resilient distributed datasets (RDDs) that make it easy to program distributed applications where processes on different machines can communicate with each other. RDDs can be created from data sources like HDFS files or collections and transformed using operations like map, filter, reduceByKey. Transformations create new RDDs lazily. Actions trigger job execution by returning a result or writing data to an external storage. Spark runs on standalone mode, YARN, or Mesos and supports running on AWS.

Uploaded by

dabid8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ZIP, PDF, TXT or read online on Scribd
You are on page 1/ 22

SPARK

FOR BIG DATA


PROGRAMACION
DISTRIBUIDA
• PROBLEMA: BIG DATA.

• SOLUCION: Proveer abstracciones que hacen fácil programar


sistemas en los cuales los procesos (potencialmente en
diferentes maquinas) se comunican uno con otro.
streaming

buildReco (2)
{event:001,
Aggregate
user:xt10,
(1) app:netflix, serie:3%, userxt10,reco:{narcos}
episode:2, top
genre:drama
userxt10,
} [ {},{},{},{} ] usertx11, reco:mindhunter 1. orange is the new black
{event:002,
userxt11,
user:xt11,
[ {}, {} app:netflix,
] serie:blackmirror:, episode:1, 2. narcos
genre:sci-fi } 3. money heist
FLUJO
driver
spark
driver

NODE 11
NODE

executor e
RDD T1 RDD
T2
P1 P3

RDD P2 T3 T4 RDD P
PROGRAMACION
• RDDS - DAG

• DAG - STAGES

• TASK SCHEDLUER - EMITE

• EXECUTOR - EJECUTA
RDD

La abstraction principal de
Spark es Resilent
Distributed Dataset.
1. List<Integer> numbers = Arrays.asList(1,2,3,4,5);

JavaRDD<Integer> numbersRDD = sc.parallelize(numbers);

2. JavaRDD<String> lines = sc.textFile(“s3n://srt-development1/file.txt”);



TRANSFORMACIONES
• Map( )

• FlatMap( )

• Filter( )

• Distinct( )

• ReduceByKey( )
ACCIONES
• Count( )

• Collect( )

• saveAsTextFile( )

• cache( )

• persist( )
VARIABLES
COMPARTIDAS
• Broadcast : Broadcast es uno variable de solo lecture
compartida entire cada executor.

• Accumulators: Se utilizes para contadores y sumas distribuidas.


RUNNING SPARK
• LOCAL

• STANDALONE

• YARN

• MESOS
SPARK ON AWS
• maven clean install

• aws s3 cp spark_apps-0.0.1-SNAPSHOT.jar s3://srt-


development1/jars/spark_apps-0.0.1-SNAPSHOT.jar
spark-submit
nohup spark-submit\

--deploy-mode cluster\

--master yarn\

--executor-memory 1g\

--driver-memory 1g\

--class com.srt.spark.main.WordCount s3://srt-development1/jars/sparks_apps.jar &


PROPERTIES
• spark.executor.cores: The number of cores utilized by an
executor default value is 1 in YARN and all available cores on
worker in stand alone mode

• spark.driver.memory: The amount of memory that can be


utilized by the driver

• spark.executor.memory: The that can be utilized by each


executor

• spark.cores.max: maximum amount of CPU cores to request for


the application from across the cluster
REFERENCES
• Learning Spark, Lighting-Fast Data Analysis

• Apache Spark 2.x for Java Developers


• https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A

You might also like