0% found this document useful (0 votes)
55 views37 pages

2016 Spark Survey

Uploaded by

Mukesh Panchal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views37 pages

2016 Spark Survey

Uploaded by

Mukesh Panchal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

APACHE SPARK

® ™

SURVEY 2016
REPORT
Table of Contents
Introduction 3 APACHE SPARK IN THE CLOUD IS GROWING 27
Trend: Increase in Public Cloud Deployments 28
Foreword: Matei Zaharia 4
Trend: Percentage Decrease in On-Premises Deployments 29
REPORT HIGHLIGHTS 5 Section Summary 30
APACHE SPARK’S GROWTH CONTINUES 13 APACHE SPARK STREAMING AND MACHINE LEARNING
The Apache Spark Community is Growing 14 SURGE IN USAGE 31
Spark’s Fastest Growing Areas from 2015 to 2016 17 Apache Spark Streaming is Growing 32
Spark Users are Growing 18 Apache Spark Streaming Engine is the Preferred Choice 34
Spark Users Employ Multiple Languages 19 Section Summary 35
Spark Components Used in Production 20
Afterword: Reynold Xin 36
Spark is Used Widely in Organizations 21
Users Solve Complex Problems 22 About Databricks 37
Users Employ Multiple Components 23
What Users Consider Important 24
Top Three Storage Technologies 25
Section Summary 26

2
Introduction
SPARK SURVEY 2016
In July 2016, Databricks conducted an Apache® Spark™ Survey to
identify insights into how organizations are using Spark as well
as highlight growth trends since the last Spark Survey 2015. In 23 % 21 %
AR CHITECTS
this report, the results reflect answers from over 900 distinct DATA SC IE N TISTS
organizations and 1615 respondents, who were predominantly
Apache Spark users.

As in 2015, which was a tremendous year in growth for Apache Spark,


this year, too, its growth remains unabated—not only in areas like
1615
R ES PO N D EN TS
the public cloud, but also with the increased use of Spark Streaming
10%
900
and the use of Machine Learning. 2016 also shows Spark’s robust
adoption across a variety of organizations and users from many TECHN ICAL
MAN AG EMEN T
functional roles to build complex solutions, using multiple Spark
D ISTIN CT O R GAN IZATIO N S
components. Of the roles represented in the survey, 41% identified
themselves as data engineers, while 23% as data scientists and 21%
as architects; the rest of the 10% came from technical management 5%
ACAD EMICS
and 5% from academia.

41 %
DATA E N G IN EERS

3
Foreword: Matei Zaharia
I’m delighted to share the results of this year’s Databricks Apache Spark’s growth continues across various industries building complex data
Spark Survey. As I noted in the previous Spark Survey 2015, we solutions by people in various functional roles. It has moved well beyond
witnessed a rapid adoption of Spark and the precipitous growth the early-adopter phase at tech companies and is now mainstream in
of the Spark community. And this year’s Spark’s growth trajectory large data-driven enterprises.
and trends continue. In particular, I’m excited to see more Spark
deployments in the cloud and more interest in people building real-
time applications using Spark Streaming with multiple components,
such as Machine Learning. Given that Apache Spark 2.0 lays the Since its inception, Spark’s core mission has been to
foundational steps for Structured Streaming, by providing simplified make Big Data simple and accessible for everyone—
and unified APIs to write end-to-end streaming applications called for organizations of all sizes and across all industries.
continuous applications, I anticipate this interest will surge further in
And we have not deviated from that mission...
the coming months—with subsequent releases of Spark.

MATEI ZAHARIA
Since its inception, Spark’s core mission has been to make Big Chief Technologist at Databricks,
Data simple and accessible for everyone—for organizations of all VP of Apache Spark at the Apache Software Foundation
sizes and across all industries. And we have not deviated from that @matei_zaharia

mission. In Apache Spark 2.0, we strived to make Spark easier, faster


and smarter. And we remain committed to our vision of simplicity.
Seventy-six percent of respondents in this survey indicate ease-of-
programing as one of the most important features of Spark.

4
REPORT HIGHLIGHTS

TOP THREE APACHE SPARK TAKEAWAYS

SPARK’S GROWTH
CONTINUES

SPARK IN THE CLOUD IS GROWING

SPARK STREAMING AND MACHINE


LEARNING SURGE IN USAGE

5
REPORT HIGHLIGHTS

SPARK MEETUP CODE SPARK SUMMIT NUMBER OF COMPANIES


MEMBERS CONTRIBUTORS ATTENDEES AT SUMMITS
This year the growth trend continues in the
community. Increased growth of Apache Spark 240% 67% 30% 57%
Meetup members, a jump in Spark Summit 2015 2016
2015 2016 2015 2016 2015 2016

attendees, more code contributors, and a surge 66,000 225,000 600 1000 3912 5100 1144 1800

in companies represented at the Spark Summit


(from several vertical industries) suggest a
growing and thriving Spark community.
NOTABL E SPARK USERS WHO PRESENTED AT SPARK SU M M IT 2016

6
REPORT HIGHLIGHTS

Asked what Apache Spark components developers use to build complex solutions
for their use cases, 74% of respondents said they use two or more components
to build different types of products.

TYPE S OF P R O D U CTS B U I LT NU M B ER OF COM PONENTS USED


% of respondents who use Spark to create each product (more than one product could be selected)

52%
68%
DATA WAREHOUSING
BUSINESS / CUSTOMER INTELLIGENCE
74%
of respondents
USE T WO O R MO RE
64%
of respondents
USE T HREE O R
CO MPO NENTS MO RE CO MPO NENTS
45% REAL-TIME / STREAMING SOLUTIONS

40% RECOMMENDATION ENGINES


37% LOG PROCESSING
36% USER-FACING SERVICES

29% FRAUD DETECTION / SECURITY

7
REPORT HIGHLIGHTS

In addition to using multiple Apache Spark components, many respondents indicated that they
use multiple programing languages in Spark. They also are using multiple components in
production, including increased use of Spark Streaming and MLlib.

LAN G UAG E S US ED I N S PAR K YEAR- OVE R-YE AR SPARK COM PONENTS USED IN PRODU CTION YEAR-OVER-YEAR
% of respondents who use each language (more than one language could be selected) % of respondents who use each component in production (more than one component could be selected)

71%
62% 65%
58%

44%
31 29% 38% 40%
36 % %

18 20 % 24% 22%
13% 18
% %
15 % 14 %

2015 2016 2015 2016 2015 2016 2015 2016 2015 2016 2015 2016 2015 2016 2015 2016 2015 2016

S QL R PYT H O N S CALA JAVA DATAFRAM ES SQL STREAM ING ADVANC ED


ANALYTICS
( M Llib)

8
REPORT HIGHLIGHTS

A PAC HE S PA R K’ S FA ST EST G ROW ING A RE A S IN 2016

DATAFRAME * SPARK SQL* STREAMING* ADVANCED ANALYTICS *


USERS USERS USERS USERS (MLlib)

153 %
67 %
57 %
38 %

2015 2016 2015 2016 2015 2016 2015 2016


15% 38% 24% 40% 14% 22% 13% 18%
OF RESPONDENTS OF RESPONDENTS OF RESPONDENTS OF RESPONDENTS OF RESPONDENTS OF RESPONDENTS OF RESPONDENTS OF RESPONDENTS

*component used in production


9
REPORT HIGHLIGHTS

51% of users in the 2015 Spark Survey said they A PAC H E S PA R K 2015 2016

51 61 %
D E PLOY M E N T
deployed Apache Spark in the public cloud,
compared with 61% of users in 2016, showing
I N PU B LIC C LO U D S
I N CR E A S E D BY 1 0 %
%
a growth of 20%. SI N C E 2 0 1 5 . of respondents deployed of respondents deploy
in a public cloud in a public cloud

While Apache Spark deployments in the public O N -P R E M I S E S D E P LOY M E N TS Y E A R-OV E R-Y E A R


% of respondents who use each (more than one deployment could be selected)
cloud increased in 2016, the percentage of Spark
deployments on-premises decreased. For
48% 42%
example, 48% of users in 2015 Spark survey and 40 36%%

42% in 2016 survey said they used Standalone


cluster managers for their on-premises Spark 11% 7%
deployments, showing a 13% percentage decrease.
2015 2016 2015 2016 2015 2016
Similarly, YARN and Mesos show 10% and 36%
M ES OS YA R N STA N DA LON E
percentage decreases respectively in deployments.

10
REPORT HIGHLIGHTS

Investments in fast data analytics has surged,


according to Datanami. Since companies
are shifting investments from batch to
51 %
of respondents
14 %
NOT
IMPORTANT

CONSIDER APACHE
real-time applications, respondents in this SPARK ST R EAMIN G
VERY IMPO RTAN T
survey show an affinity toward building real-
time applications using the Spark Streaming
framework.

35%
SOMEWHAT
IMPORTANT

33 %
Among all the streaming engines, 33% of of respondents
US E A PAC H E S PA R K
respondents said they were heavy users of ST R E A M I N G A LOT
Spark Streaming.

11
REPORT HIGHLIGHTS

Respondents indicated that Spark Streaming


is very important for building real-time
Q: WH I C H K IN D S O F P R O D U CTS D O E S YO U R
O R GA N IZAT I O N D E V E LO P ? Select all that apply.
streaming, recommendation engines, and
fraud detection applications.
29%
of respondents develop
45%
of respondents develop
40%
of respondents develop
FRAUD D E T E CTI ON / R EA L-TI M E STR EA M I NG R ECOM M ENDATI ON
S E CUR I TY P R ODU CTS P R ODU CTS ENGI NE P R ODU CTS

Machine Learning has seen an increase in ML li b USE IN P R O D U CT I O N


% of respondents who use the component in production
18%
production usage.
13%

38
ADVANCED ANALYTICS
PRODUCTION CASES
%

2015 2016

12
APACHE SPARK’S
GROWTH
CONTINUES

13
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT

The Apache Spark


Community is Growing
The section identifies key growth areas in all aspects
of Spark that are propelling this uptake. Both 2015
and 2016 have seen a tremendous growth in the 67 %
CODE CONTRIBUTORS
Spark community and Spark usage in many vertical
2015 2016
industries. 600 1000

Spark today remains the most active open source


project in Big Data. Today, there are over 1000

240
Spark contributors, compared to 600 in 2015 from
250+ organizations. With such large numbers of
%
contributors and organizations investing in Spark’s SPARK MEETUP MEMBERS
future development, it has engaged a community 2015 2016
66,000 225,000
of developers globally. The Apache Spark Meetup
groups’ membership continues to flourish, both
nationally and internationally.

14
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT

30 %
SPARK SUMMIT
57 %
COMPANIES REPRESENTED
ATTENDEES AT SUMMITS
Every year, more users attend Spark Summit, the
2015 2016 2015 2016
largest dedicated conference to the Apache Spark 3912 5100 1144 1800

project. In 2016 there has been an increased number


of attendees from a broad range of organizations
NOTAB LE SPARK USERS WHO PRESENTED AT SPARK SU M M IT 2016
attending this event, with attendees ranging from
developers to data scientists and engineers; to
business users and analysts; and executive level
decision makers. A number of notable users
presented how they use Spark at the Spark Summit
San Francisco 2016.

15
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT

4 R EL EAS ES I N 2015
1.2, 1.3, 1.4, 1.5

2
In just two years, the Spark community has
released six Spark releases. When asked which
M AJOR
version of Apache Spark they are using, 75% R EL EAS ES I N 2016
responded that they are using Spark 1.6, while 18% 1.6, 2.0 as of September 2016

are using Spark 2.0 (respondents could choose


multiple releases, such as 1.3, or 1.4 or 1.5).

75
US E S PAR K 1.6
%
18 %
US E S PAR K 2.0

7%
OTH E R

16
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT

Spark’s Fastest Growing


Areas from 2015 to 2016
Spark Streaming, in particular, has taken
DATAFRAME USERS IN P RODUCT ION SPA RK SQL USERS IN P RODUCT ION
a notable increase in its usage, so has SQL,
MLlib, and Windows users from 2015.
153 % 2015
15%
OF RESPONDENTS
2016
38%
OF RESPONDENTS 67 % 2015
24%
OF RESPONDENTS
2016
40%
OF RESPONDENTS

ST R E A M I NG US E RS IN PR ODUCTION WIN DOWS USERS IN DEVELOP MENT A DVA NCED A NA LYT ICS USERS (M Llib)

57 39
IN P RODUCT ION
% % 2015 2016

38
2015 2016
23% 32%
14%
OF RESPONDENTS
22%
OF RESPONDENTS
OF RESPONDENTS OF RESPONDENTS
% 2015
13%
2016
18%
OF RESPONDENTS OF RESPONDENTS

17
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT

Spark Users are Growing D E VE LO P M E N T E N V I R O N M E N TS


% of respondents who use each development environment (more than one environment could be selected)

Spark is attractive not only to highly-skilled and


technically advanced users. It crosses barriers,
and other users such as business analysts
increasingly use Spark and develop Spark-based 75% 74%
39
WINDOWS USERS
%
YEAR-OVER-YEAR
applications in environments other than Linux. 22%
From last year, the percentage of Windows users 14%
employing Spark has increased. 32%
23%

2015 2016 2015 2016 2015 2016

L IN UX / UNIX WINDOWS MAC OSX

18
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT

Spark Users Employ


Multiple Languages Q: W H IC H LA N G UAG E S D O YO U US E S PA R K IN ?
% of respondents who use each language (more than one language could be selected)

Spark is becoming the key data processing and


computing platform used by a broad range of users.
These users span many vertical industries and use 71%
65% 62%
a variety of programming languages. One reason for 58 %

this broad adoption is because Spark is easy to use


and supports familiar programming APIs across 44% 31% 29%
these languages. 36 %

18% 20%
Usage of Spark in Python, SQL, and R increased,
while Scala and Java usage decreased. This
2015 2016 2015 2016 2015 2016 2015 2016 2015 2016
indicates that more data analysts are drawn
SC ALA SQL PYT HON R JAVA
to Spark from areas other than pure data
engineering, suggesting that Spark usage is
expanding to new and diverse users.

19
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT

Spark Components Used Q: W H IC H CO M P O N E N TS O F T H E


A PAC H E S PA R K STAC K A R E YO U US I N G ?
in Production % of respondents who use each component in production (more than one component could be selected)

Since last year, the use of Spark components in 38% 40%


production has increased, especially in Spark
Streaming and advanced analytics with Apache
Spark MLlib (machine learning). This corroborates
24% 22%
15 % 18%
with the observation in this report about increased 14% 13%
interest among Spark users to build real-time
streaming applications with Spark Streaming,
using multiple components, including MLlib.
2015 2016 2015 2016 2015 2016 2015 2016

DATAFRA M ES SQL ST REA M ING A DVA NCED


A NA LYT ICS (M Llib)

153
DATAFRAMES
%
67 SQL
%
57
STREAMING
%
11
ADVANCED ANALYTICS
%

USERS USERS USERS USERS

20
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT

Spark is Used Widely Q: W H AT IN D UST RY V E RT I C A L


B E ST D E S C R I B E S YO U R 5%
HEALTH / MEDICAL /
PHARMACY / BIOTECH

in Organizations O R GA N I ZAT IO N ?
Percentages rounded to the nearest integer.

Spark’s adoption continues to grow across varied EDUCATION


7%
25
industries because of its unified engine, and because %
4
ADVERTISING /
% MARKETING /
of its proven performance and versatility that PR

11
SOFTWARE
enables it to process diverse workloads. % (SAAS, WEB, MOBILE)

BANKING /
The banking sector saw the highest percentage
18
FINANCE

change in the usage of Spark since 2015, as did the %


Health, Medical, Biotech and Pharmacy verticals. CONSULTING
5 % (IT)

PUBLISHING / MEDIA
CARRIERS / TELECOM

63 39 29
3%
% % %
BANKING HEALTH / MEDICAL / CONSULTING (IT)
3 %
6% 13% OTHER
USERS PHARMACY / BIOTECH USERS USERS COMPUTERS / HARDWARE ECOMMERCE / RETAIL

2015 2016 2015 2016 2015 2016


6.48% 10.58% 3.89% 5.42% 13.98% 18.09%

21
APACHE SPARK’S GROWTH CONTINUES

Users Solve Complex Q: W H IC H K I N D S O F P R O D U CTS D O E S


YO U R O R GA N IZAT I O N D E V E LO P ? Select all that apply.
Problems
Users are solving complex data problems across 68%
BUSINESS / CUSTOMER INTELLIGENCE
varied industry verticals, as Spark’s unified platform
enables users to build complex solutions using 52%
DATA WAREHOUSING
multiple Spark components for their multiple
data workloads. 45%
REAL-TIME / STREAMING SOLUTIONS

40%
RECOMMENDATION ENGINES

37%
LOG PROCESSING

36%
USER-FACING SERVICES

29%
FRAUD DETECTION / SECURITY

22
APACHE SPARK’S GROWTH CONTINUES

Users Employ Multiple CO MP ON E N TS US E D I N P R OTOTY P I N G


A N D P R O D U CT I O N
Components More than one component could be selected. 67%
SPARK SQL
67%
DATAFRAMES

Because of Spark’s unified engine and its ability 43% 43%


to process multiple workloads within the same
cluster, many Spark users within organizations use
31
DATASETS
\
% MLlib SPARK STREAMING

multiple components of Spark for their use cases


and their respective workloads.
14%
GRAPHX

Not only are Spark components used separately;


two or more components are often used in
prototyping and production. This unification
blurs the barriers between data scientists, data
engineers, and data analysts—all using the same
unified compute engine.
74
of Spark users
%
64 %
of Spark users
US E TWO O R MO R E US E THR EE O R
CO MPO N EN TS MO R E CO MPO N EN TS

23
APACHE SPARK’S GROWTH CONTINUES

What Users Consider % O F R E S P O N D E N TS W H O CO N S I D E R E D T H E F E AT U R E


VE RY IM P O RTA N T
Important
69
More than one feature could be selected.

%
Users are drawn to Spark for a number of reasons:
51 % EASE OF
D E P LOY ME N T
76 %
it’s easier to get started quickly because of simple and
consistent APIs; it’s faster because of improvements
RE AL-TIME
STRE AMING
91 %
PER FO R MAN CE
EASE OF
P R O G RA MMI N G

in Apache Spark 2.0; and it’s smarter because of


simplified Structured Streaming APIs, allowing users
to build end-to-end continuous applications.
82
ADVAN CED
%
AN ALYTICS

According to our 2015 Spark Survey, 91% of users


consider performance as the most important
aspect of Apache Spark, along with ease of At the time of this survey, Apache Spark
programming, real-time streaming and advanced
analytics. In this year’s survey, Spark users reflect
75 % 2.0 had just been officially released, and
users displayed a keen interest in using
18%
R U N S PA RK 1 . 6
these as equally important. it. Even though most users run Spark 1.6,
RUN
SPARK 2.0
the 2016 survey results suggest they had
quickly started using Spark 2.0.

24
APACHE SPARK’S GROWTH CONTINUES
SPARK

Top Three Storage


Technologies
A large number of Spark users use technologies
for storage other than Apache® Hadoop®, such as Q: W H I C H O F T H E S E T E C H N O LO G I E S D O YO U
CU R R E N T LY US E ? Select all that apply.
Cassandra, MongoDB and NoSQL as well as other

82 73
open-source and proprietary SQL data stores.
% %
of respondents use of respondents use
O PEN -S O UR CE S Q L DATAB AS ES K EY-VALUE STO R ES (N oS Q L )

58 %
of respondents use
PR O PR IETARY S Q L DATAB AS ES

25
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT

Section Summary

Apache Spark’s growth and adoption continues as users, industries, development


environments, disciplines, and programming languages embrace its ease of use and
programming, its unified compute engine, and its performance to solve complex data
problems at scale. Spark allows multiple components to work on multiple workloads
and access data from multiple data sources. All of these factors make Spark an attractive
choice as a unified compute data platform.

26
APACHE SPARK IN THE
CLOUD IS GROWING

27
APACHE SPARK IN
STREAMING
THE CLOUD
IS IS
IMPORTANT
GROWING

Trend: Increase in Public Cloud


Deployments
The rise of cloud computing is rapid, inexorable and SPA R K D E P LOY M E N T I N P U B LIC C LO U D S
causing a huge upheaval in the tech industry, writes H A S I N C R E A S E D BY 1 0 % S I N C E 2 0 1 5 .

The Economist. “Gartner estimates that about $205


billion, or 6% of the world’s IT budget of $3.4 trillion,
will be spent on cloud computing in 2016—a number
2 01 5 2 01 6
it expects to grow to $240 billion next year,” according
51%
61
2016
to another article in The Economist.
% of respondents
deploy Spark
in a public cloud
This survey reflects this trend, as many respondents
are electing to deploy Spark in the public cloud,
mitigating both cost and infrastructure headaches.

Since 2015, we have seen a 20% growth of users


deploying Spark in the public cloud. That is, 61% users
in the 2016 survey said they deployed Spark in the public
cloud compared to 51% in 2015.
28
APACHE SPARK IN
STREAMING
THE CLOUD
IS IS
IMPORTANT
GROWING

Trend: Percentage Decrease Q: W H E R E D O YO U R U N S PA R K?


Select all that apply.
48%
in On-Premises Deployments 42%
40 %
36%
Although many Spark users run Spark
on-premises alongside Hadoop and other data
sources, some deployment modes in 2016 have
seen a percentage decrease.
11%
7%

2015 2016 2015 2016 2015 2016

MES O S YAR N STAN DALO N E

36
MESOS
%
10 YARN
%
13 %
STANDALONE
SPARK DEPLOYMENTS SPARK DEPLOYMENTS SPARK DEPLOYMENTS

29
APACHE SPARK IN THE CLOUD IS GROWING

Section Summary

Not only do cloud deployments have lower deployment costs and fewer management headaches,
they have higher and proven performance benefits.

Using Apache Spark on 206 EC2 machines, we sorted 100TB of data on disk in
23 minutes. In comparison, the previous world record set by Hadoop MapReduce
used 2100 machines and took 72 minutes. This means that Spark sorted the
same data 3X faster using 10X fewer machines.

REYNOLD XIN
Chief Architect & Co-Founder of Databricks

30
APACHE SPARK STREAMING
AND MACHINE LEARNING
SURGE IN USAGE

31
APACHE SPARK STREAMING AND
IS IMPORTANT
MACHINE LEARNING SURGE IN USAGE

Apache Spark Streaming Q:


is Growing H OW I M P O RTA N T I S
14 %
51


SPA R K ST R E A M IN G
Since its release, Spark Streaming has become TO YO UR US E C A S E ? % NOT
IMPORTAN T

one of the most widely used distributed VERY


IMPORTAN T
streaming engines. Interest in developing real-time
applications and advanced analytics is on the rise.
35
SOMEW HAT
%

IMPORTAN T

Over half of the survey respondents indicate that


streaming is vital and important for developing
Q:
WH I CH K I N D S O F P R O D U CTS D O E S YO U R
valuable real-time streaming, recommendation O R GA N I ZAT IO N D E V E LO P ? Select all that apply.
engines, and fraud-detection and security solutions.

29%
of respondents develop
45%
of respondents develop
40%
of respondents develop
FRAUD D E TECTI ON / R EA L-TI M E STR EA M I NG R ECOM M ENDATI ON
S E CUR I TY P R ODU CTS P R ODU CTS ENGI NE P R ODU CTS

32
APACHE SPARK STREAMING AND
IS IMPORTANT
MACHINE LEARNING SURGE IN USAGE

SPA R K ST R E A M IN G A N D M Ll ib US E IN P R O D U CT I O N
% of respondents who use the component in production (more than one component could be selected)

22%
18%
14% 13%
Organizations use Spark Streaming along with
Spark’s other multiple components to develop
streaming applications. Both Spark Streaming and
MLlib saw a notable increase in production use.

2015 2016 2015 2016

ST R EAMIN G ADVAN CED AN ALYTICS (ML l ib)

57
STREAMING
%
38
ADVANCED ANALYTICS
%
PRODUCTION CASES PRODUCTION CASES

33
APACHE SPARK STREAMING AND
IS IMPORTANT
MACHINE LEARNING SURGE IN USAGE

Apache Spark Q: W H IC H O F T H E S E T E C H N O LO G IE S D O YO U CU R R E N T LY
US E A LOT FO R ST R E A M I N G A N D/ O R CO M P LE X E V E N T
P R O C E SS I N G C A S E S ? Select all that apply.
Streaming Engine 33% APACHE SPARK

is the Preferred Choice 29% APACHE KAFKA

4% KINESIS

Compared to other streaming engines, Spark <1% APACHE APEX

is the preferred choice at 33%. 1% APACHE FLINK

6% APACHE STORM
Note: Respondents were predominately Spark users.
When compared to other Spark components,
Spark Streaming matches MLlib at 71% in use,
APAC HE SPAR K
from evaluation to production. COMPONEN T PO PUL AR ITY
71% 83
83% 89% 71% 88%
% of respondents who use the component
anywhere from evaluation to production
In the 2015 Spark survey, 14% of users said they (more than one component could be selected) SPAR K ST R E AM IN G RDDS DATAFRAM E S M Llib SQL

used Spark Streaming in production, compared to


22% of users in 2016. Overall, we saw a 57% growth
of users using Spark Streaming in production.
Q: D O YO U
CU R R E N T LY
US E S PA R K 14% 22%
are using it today
+57%
used it in 2015
ST R E A M IN G SPARK STREAMING
PRODUCTION CASES
I N P R O D U CT I O N ?

34
APACHE SPARK STREAMING AND
IS IMPORTANT
MACHINE LEARNING SURGE IN USAGE

Section Summary

Spark Streaming is being used for real-time solutions, from evaluation to production, closer
in usage to Spark’s other commonly used components. As a preferred choice of streaming
engine over others, more organizations are building real-time streaming solutions as they
consider streaming an important Spark feature.

35
Afterword: Reynold Xin
2015 and 2016 have been exciting years for the adoption and increased Your voice matters. We got an insightful glimpse into the growth and
growth of Apache Spark and its community. Two releases—Spark 1.6 trends from this year’s survey: who’s using Spark, how they are using it,
and 2.0—have seen major improvements in all aspects of Spark noted what’s important, what new features they use, and what they are using
by respondents in this survey as important. I continue to look forward, it for. Just as the feedback from last year’s survey did, these insights will
and work with the community, to the exciting future ahead for the drive major updates and help shape the future of the Spark platform.
Spark platform.
Thank you to everyone who participated in Databricks’ Apache Spark
As Spark becomes easier, faster, and smarter, outside the predominantly Survey 2016!
IT and Consulting Industry, a newer audience is adopting it, as results from
the survey suggest. Performance, ease-of-use, streaming, and reliability REYNOLD XIN
top the list as most important features. At the time of this survey, we Chief Architect & Co-Founder of Databricks
released Apache Spark 2.0. Ongoing performance improvements, with @rxin

Project Tungsten, started in earlier releases and culminated in Spark 2.0.


In addition, Spark 2.0 delivered unified DataFrames and Datasets APIs and
simplified Structured Streaming APIs. All these make Spark an attractive
engine for performing advanced analytics across industry verticals in
solving complex data problems, by users from different functional roles.

36
Databricks’ vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache® Spark™,
a powerful open source data processing engine built for sophisticated analytics, ease of use, and speed. Databricks is the largest contributor to the open source Apache
Spark project providing 10x more code than any other company. The company has also trained over 20,000 users on Apache Spark, and has the largest number of
customers deploying Spark to date. Databricks provides a just-in-time data platform, to simplify data integration, real-time experimentation, and robust deployment of
production applications. Databricks is venture-backed by Andreessen Horowitz and NEA. For more information, contact [email protected].

TRY DATABRICKS FOR FREE CONTACT US FOR A PERSONALIZED DEMO


databricks.com/try-databricks databricks.com/contact-databricks

© Databricks 2016. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.

37

You might also like