Introduccion A Spark
Introduccion A Spark
buildReco (2)
{event:001,
Aggregate
user:xt10,
(1) app:netflix, serie:3%, userxt10,reco:{narcos}
episode:2, top
genre:drama
userxt10,
} [ {},{},{},{} ] usertx11, reco:mindhunter 1. orange is the new black
{event:002,
userxt11,
user:xt11,
[ {}, {} app:netflix,
] serie:blackmirror:, episode:1, 2. narcos
genre:sci-fi } 3. money heist
FLUJO
driver
spark
driver
NODE 11
NODE
executor e
RDD T1 RDD
T2
P1 P3
RDD P2 T3 T4 RDD P
PROGRAMACION
• RDDS - DAG
• DAG - STAGES
• EXECUTOR - EJECUTA
RDD
La abstraction principal de
Spark es Resilent
Distributed Dataset.
1. List<Integer> numbers = Arrays.asList(1,2,3,4,5);
•
JavaRDD<Integer> numbersRDD = sc.parallelize(numbers);
• FlatMap( )
• Filter( )
• Distinct( )
• ReduceByKey( )
ACCIONES
• Count( )
• Collect( )
• saveAsTextFile( )
• cache( )
• persist( )
VARIABLES
COMPARTIDAS
• Broadcast : Broadcast es uno variable de solo lecture
compartida entire cada executor.
• STANDALONE
• YARN
• MESOS
SPARK ON AWS
• maven clean install
--deploy-mode cluster\
--master yarn\
--executor-memory 1g\
--driver-memory 1g\