Presentation On Apache Spark
Presentation On Apache Spark
High-Productivity Language
Support
Python
lines = sc.textFile(...)
lines.fi
lter(lam bda s: ERRO R in s).count()
Scala
val lines = sc.textFile(...)
lines.fi
lter(s = >
s.contains(ERRO R)).count()
Java
Apache Spark is best for performing iterative jobs like machine learning.
Apache spark maintains an in-memory copy of map reduce job using RDD(Resilient
Distributed Datasets)
With this, if the map reduce job is once loaded into in-memory copy.
The need for loading that map reduce job again and again from memory gets
reduced. With this there is a tremendous increase in speed.
The in-memory copy holds the frequently used map reduce job within itself.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster
on disk.
Spark libraries
Spark SQL:
It is a Spark's module for working with structured data.
Spark Streaming:
Which makes it easy to build scalable fault-tolerant streaming
applications.
MLlib:
It is Apache Spark's scalable machine learning library.
GraphX:
It is Apache Spark's API for graphs and graph-parallel computation.
Thank You!