0% found this document useful (0 votes)
49 views14 pages

Data Platform and Analytics Foundational Training: (Speaker Name)

Uploaded by

Kathalina Suarez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views14 pages

Data Platform and Analytics Foundational Training: (Speaker Name)

Uploaded by

Kathalina Suarez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Microsoft C+E Technology Training

Data Platform and


Analytics
Foundational Training
Solution Area
Data Analytics
Solution
Big Data
Technology
Apache Spark

[Speaker Name]
Apache Spark: A unified framework
A unified, open source, parallel data processing framework for big data analytics

Spark SQL Spark Spark MLlib GraphX


Interactive Streaming Machine Graph
queries Stream processing learning computation

Spark core engine

Yarn Mesos Standalone scheduler


Apache Spark benefits
Performance Developer
productivity

Unified engine Ecosystem


Advantages of a unified platform
In many pipelines, data exchange between engines is the dominant cost

Input streams of Machine


Spark Streaming Spark SQL
events learning NoSQL DB
Spark integrates well with Hadoop

Alternative
resource
Primary resource
Spark managers:
Mesos or managers: Hadoop
the Spark Hadoop 1.0+ or
resource Hadoop YARN
manager
Faster data, faster results
140 50400 Spark is the 2014 Sort Benchmark
winner.
120 Hadoop 2100
3x faster than 2013 winner
(Hadoop).
100
Spark is fast not just for in-memory,
Running time(s)

80 but for on-disk computation too

60 102.5 100

40 72

20
Spark 0.9
6592
23 206
0
Logistic regression
1 2

Logistic regression on a 100-node cluster


with 100 GB of data. tinyurl.com/spark-sort
What makes Spark fast?
Data sharing between steps of a job

Reads from Writes to Reads from Writes to


In traditional HDFS HDFS HDFS HDFS
MapReduce Step 1 Step 2

Reads and writes


from HDFS

In Spark Step 1
Spark cluster architecture
Driver program
SparkContext

• The driver runs the user’s main function


and executes the various parallel Cluster manager
operations on the worker nodes
• The driver collects the results of the
operations Worker node Worker node Worker node
• Worker nodes read and write data Cache Cache Cache
from/to HDFS
Task Task Task
• Worker nodes also cache transformed
data in-memory as RDDs
Read Read Read

HDFS
Cluster Worker node 1
Worker 1
Task
Spark Job
Browser Gateway Zeppelin Jupyter
submit Task

Worker node 2
Worker 2
Head node Task
Spark master Job
Task

Worker node 3
App 0 App 1 App 2
Worker 3
Task
Job
Task

Spark driver
Worker node 4
RDD
Spark Worker 4
Task
context
Job
RDD Task
Use Cases
Apache Spark use cases
High performance Interactive analytics
batch computation

Machine learning Real-time stream Data integration


processing and ETL
Azure HDInsight supports Spark
Microsoft delivers interactive analytics on
Big Data with Azure HDInsight
Power BI supports Spark
Power BI includes an out-of-the-box
connector for Spark, enabling the
creation and sharing of interactive
reports and
dashboards to
any device
© 2016 Microsoft Corporation. All rights reserved. Microsoft, Windows, Microsoft Azure, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The
information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions,
it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION

You might also like