2016 Spark Survey
2016 Spark Survey
® ™
SURVEY 2016
REPORT
Table of Contents
Introduction 3 APACHE SPARK IN THE CLOUD IS GROWING 27
Trend: Increase in Public Cloud Deployments 28
Foreword: Matei Zaharia 4
Trend: Percentage Decrease in On-Premises Deployments 29
REPORT HIGHLIGHTS 5 Section Summary 30
APACHE SPARK’S GROWTH CONTINUES 13 APACHE SPARK STREAMING AND MACHINE LEARNING
The Apache Spark Community is Growing 14 SURGE IN USAGE 31
Spark’s Fastest Growing Areas from 2015 to 2016 17 Apache Spark Streaming is Growing 32
Spark Users are Growing 18 Apache Spark Streaming Engine is the Preferred Choice 34
Spark Users Employ Multiple Languages 19 Section Summary 35
Spark Components Used in Production 20
Afterword: Reynold Xin 36
Spark is Used Widely in Organizations 21
Users Solve Complex Problems 22 About Databricks 37
Users Employ Multiple Components 23
What Users Consider Important 24
Top Three Storage Technologies 25
Section Summary 26
2
Introduction
SPARK SURVEY 2016
In July 2016, Databricks conducted an Apache® Spark™ Survey to
identify insights into how organizations are using Spark as well
as highlight growth trends since the last Spark Survey 2015. In 23 % 21 %
AR CHITECTS
this report, the results reflect answers from over 900 distinct DATA SC IE N TISTS
organizations and 1615 respondents, who were predominantly
Apache Spark users.
41 %
DATA E N G IN EERS
3
Foreword: Matei Zaharia
I’m delighted to share the results of this year’s Databricks Apache Spark’s growth continues across various industries building complex data
Spark Survey. As I noted in the previous Spark Survey 2015, we solutions by people in various functional roles. It has moved well beyond
witnessed a rapid adoption of Spark and the precipitous growth the early-adopter phase at tech companies and is now mainstream in
of the Spark community. And this year’s Spark’s growth trajectory large data-driven enterprises.
and trends continue. In particular, I’m excited to see more Spark
deployments in the cloud and more interest in people building real-
time applications using Spark Streaming with multiple components,
such as Machine Learning. Given that Apache Spark 2.0 lays the Since its inception, Spark’s core mission has been to
foundational steps for Structured Streaming, by providing simplified make Big Data simple and accessible for everyone—
and unified APIs to write end-to-end streaming applications called for organizations of all sizes and across all industries.
continuous applications, I anticipate this interest will surge further in
And we have not deviated from that mission...
the coming months—with subsequent releases of Spark.
MATEI ZAHARIA
Since its inception, Spark’s core mission has been to make Big Chief Technologist at Databricks,
Data simple and accessible for everyone—for organizations of all VP of Apache Spark at the Apache Software Foundation
sizes and across all industries. And we have not deviated from that @matei_zaharia
4
REPORT HIGHLIGHTS
SPARK’S GROWTH
CONTINUES
5
REPORT HIGHLIGHTS
attendees, more code contributors, and a surge 66,000 225,000 600 1000 3912 5100 1144 1800
6
REPORT HIGHLIGHTS
Asked what Apache Spark components developers use to build complex solutions
for their use cases, 74% of respondents said they use two or more components
to build different types of products.
52%
68%
DATA WAREHOUSING
BUSINESS / CUSTOMER INTELLIGENCE
74%
of respondents
USE T WO O R MO RE
64%
of respondents
USE T HREE O R
CO MPO NENTS MO RE CO MPO NENTS
45% REAL-TIME / STREAMING SOLUTIONS
7
REPORT HIGHLIGHTS
In addition to using multiple Apache Spark components, many respondents indicated that they
use multiple programing languages in Spark. They also are using multiple components in
production, including increased use of Spark Streaming and MLlib.
LAN G UAG E S US ED I N S PAR K YEAR- OVE R-YE AR SPARK COM PONENTS USED IN PRODU CTION YEAR-OVER-YEAR
% of respondents who use each language (more than one language could be selected) % of respondents who use each component in production (more than one component could be selected)
71%
62% 65%
58%
44%
31 29% 38% 40%
36 % %
18 20 % 24% 22%
13% 18
% %
15 % 14 %
2015 2016 2015 2016 2015 2016 2015 2016 2015 2016 2015 2016 2015 2016 2015 2016 2015 2016
8
REPORT HIGHLIGHTS
153 %
67 %
57 %
38 %
51% of users in the 2015 Spark Survey said they A PAC H E S PA R K 2015 2016
51 61 %
D E PLOY M E N T
deployed Apache Spark in the public cloud,
compared with 61% of users in 2016, showing
I N PU B LIC C LO U D S
I N CR E A S E D BY 1 0 %
%
a growth of 20%. SI N C E 2 0 1 5 . of respondents deployed of respondents deploy
in a public cloud in a public cloud
10
REPORT HIGHLIGHTS
CONSIDER APACHE
real-time applications, respondents in this SPARK ST R EAMIN G
VERY IMPO RTAN T
survey show an affinity toward building real-
time applications using the Spark Streaming
framework.
35%
SOMEWHAT
IMPORTANT
33 %
Among all the streaming engines, 33% of of respondents
US E A PAC H E S PA R K
respondents said they were heavy users of ST R E A M I N G A LOT
Spark Streaming.
11
REPORT HIGHLIGHTS
38
ADVANCED ANALYTICS
PRODUCTION CASES
%
2015 2016
12
APACHE SPARK’S
GROWTH
CONTINUES
13
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT
240
Spark contributors, compared to 600 in 2015 from
250+ organizations. With such large numbers of
%
contributors and organizations investing in Spark’s SPARK MEETUP MEMBERS
future development, it has engaged a community 2015 2016
66,000 225,000
of developers globally. The Apache Spark Meetup
groups’ membership continues to flourish, both
nationally and internationally.
14
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT
30 %
SPARK SUMMIT
57 %
COMPANIES REPRESENTED
ATTENDEES AT SUMMITS
Every year, more users attend Spark Summit, the
2015 2016 2015 2016
largest dedicated conference to the Apache Spark 3912 5100 1144 1800
15
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT
4 R EL EAS ES I N 2015
1.2, 1.3, 1.4, 1.5
2
In just two years, the Spark community has
released six Spark releases. When asked which
M AJOR
version of Apache Spark they are using, 75% R EL EAS ES I N 2016
responded that they are using Spark 1.6, while 18% 1.6, 2.0 as of September 2016
75
US E S PAR K 1.6
%
18 %
US E S PAR K 2.0
7%
OTH E R
16
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT
ST R E A M I NG US E RS IN PR ODUCTION WIN DOWS USERS IN DEVELOP MENT A DVA NCED A NA LYT ICS USERS (M Llib)
57 39
IN P RODUCT ION
% % 2015 2016
38
2015 2016
23% 32%
14%
OF RESPONDENTS
22%
OF RESPONDENTS
OF RESPONDENTS OF RESPONDENTS
% 2015
13%
2016
18%
OF RESPONDENTS OF RESPONDENTS
17
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT
18
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT
18% 20%
Usage of Spark in Python, SQL, and R increased,
while Scala and Java usage decreased. This
2015 2016 2015 2016 2015 2016 2015 2016 2015 2016
indicates that more data analysts are drawn
SC ALA SQL PYT HON R JAVA
to Spark from areas other than pure data
engineering, suggesting that Spark usage is
expanding to new and diverse users.
19
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT
153
DATAFRAMES
%
67 SQL
%
57
STREAMING
%
11
ADVANCED ANALYTICS
%
20
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT
in Organizations O R GA N I ZAT IO N ?
Percentages rounded to the nearest integer.
11
SOFTWARE
enables it to process diverse workloads. % (SAAS, WEB, MOBILE)
BANKING /
The banking sector saw the highest percentage
18
FINANCE
PUBLISHING / MEDIA
CARRIERS / TELECOM
63 39 29
3%
% % %
BANKING HEALTH / MEDICAL / CONSULTING (IT)
3 %
6% 13% OTHER
USERS PHARMACY / BIOTECH USERS USERS COMPUTERS / HARDWARE ECOMMERCE / RETAIL
21
APACHE SPARK’S GROWTH CONTINUES
40%
RECOMMENDATION ENGINES
37%
LOG PROCESSING
36%
USER-FACING SERVICES
29%
FRAUD DETECTION / SECURITY
22
APACHE SPARK’S GROWTH CONTINUES
23
APACHE SPARK’S GROWTH CONTINUES
%
Users are drawn to Spark for a number of reasons:
51 % EASE OF
D E P LOY ME N T
76 %
it’s easier to get started quickly because of simple and
consistent APIs; it’s faster because of improvements
RE AL-TIME
STRE AMING
91 %
PER FO R MAN CE
EASE OF
P R O G RA MMI N G
24
APACHE SPARK’S GROWTH CONTINUES
SPARK
82 73
open-source and proprietary SQL data stores.
% %
of respondents use of respondents use
O PEN -S O UR CE S Q L DATAB AS ES K EY-VALUE STO R ES (N oS Q L )
58 %
of respondents use
PR O PR IETARY S Q L DATAB AS ES
25
APACHE SPARK’S
SPARK STREAMING
GROWTH CONTINUES
IS IMPORTANT
Section Summary
26
APACHE SPARK IN THE
CLOUD IS GROWING
27
APACHE SPARK IN
STREAMING
THE CLOUD
IS IS
IMPORTANT
GROWING
36
MESOS
%
10 YARN
%
13 %
STANDALONE
SPARK DEPLOYMENTS SPARK DEPLOYMENTS SPARK DEPLOYMENTS
29
APACHE SPARK IN THE CLOUD IS GROWING
Section Summary
Not only do cloud deployments have lower deployment costs and fewer management headaches,
they have higher and proven performance benefits.
Using Apache Spark on 206 EC2 machines, we sorted 100TB of data on disk in
23 minutes. In comparison, the previous world record set by Hadoop MapReduce
used 2100 machines and took 72 minutes. This means that Spark sorted the
same data 3X faster using 10X fewer machines.
REYNOLD XIN
Chief Architect & Co-Founder of Databricks
30
APACHE SPARK STREAMING
AND MACHINE LEARNING
SURGE IN USAGE
31
APACHE SPARK STREAMING AND
IS IMPORTANT
MACHINE LEARNING SURGE IN USAGE
SPA R K ST R E A M IN G
Since its release, Spark Streaming has become TO YO UR US E C A S E ? % NOT
IMPORTAN T
IMPORTAN T
29%
of respondents develop
45%
of respondents develop
40%
of respondents develop
FRAUD D E TECTI ON / R EA L-TI M E STR EA M I NG R ECOM M ENDATI ON
S E CUR I TY P R ODU CTS P R ODU CTS ENGI NE P R ODU CTS
32
APACHE SPARK STREAMING AND
IS IMPORTANT
MACHINE LEARNING SURGE IN USAGE
SPA R K ST R E A M IN G A N D M Ll ib US E IN P R O D U CT I O N
% of respondents who use the component in production (more than one component could be selected)
22%
18%
14% 13%
Organizations use Spark Streaming along with
Spark’s other multiple components to develop
streaming applications. Both Spark Streaming and
MLlib saw a notable increase in production use.
57
STREAMING
%
38
ADVANCED ANALYTICS
%
PRODUCTION CASES PRODUCTION CASES
33
APACHE SPARK STREAMING AND
IS IMPORTANT
MACHINE LEARNING SURGE IN USAGE
Apache Spark Q: W H IC H O F T H E S E T E C H N O LO G IE S D O YO U CU R R E N T LY
US E A LOT FO R ST R E A M I N G A N D/ O R CO M P LE X E V E N T
P R O C E SS I N G C A S E S ? Select all that apply.
Streaming Engine 33% APACHE SPARK
4% KINESIS
6% APACHE STORM
Note: Respondents were predominately Spark users.
When compared to other Spark components,
Spark Streaming matches MLlib at 71% in use,
APAC HE SPAR K
from evaluation to production. COMPONEN T PO PUL AR ITY
71% 83
83% 89% 71% 88%
% of respondents who use the component
anywhere from evaluation to production
In the 2015 Spark survey, 14% of users said they (more than one component could be selected) SPAR K ST R E AM IN G RDDS DATAFRAM E S M Llib SQL
34
APACHE SPARK STREAMING AND
IS IMPORTANT
MACHINE LEARNING SURGE IN USAGE
Section Summary
Spark Streaming is being used for real-time solutions, from evaluation to production, closer
in usage to Spark’s other commonly used components. As a preferred choice of streaming
engine over others, more organizations are building real-time streaming solutions as they
consider streaming an important Spark feature.
35
Afterword: Reynold Xin
2015 and 2016 have been exciting years for the adoption and increased Your voice matters. We got an insightful glimpse into the growth and
growth of Apache Spark and its community. Two releases—Spark 1.6 trends from this year’s survey: who’s using Spark, how they are using it,
and 2.0—have seen major improvements in all aspects of Spark noted what’s important, what new features they use, and what they are using
by respondents in this survey as important. I continue to look forward, it for. Just as the feedback from last year’s survey did, these insights will
and work with the community, to the exciting future ahead for the drive major updates and help shape the future of the Spark platform.
Spark platform.
Thank you to everyone who participated in Databricks’ Apache Spark
As Spark becomes easier, faster, and smarter, outside the predominantly Survey 2016!
IT and Consulting Industry, a newer audience is adopting it, as results from
the survey suggest. Performance, ease-of-use, streaming, and reliability REYNOLD XIN
top the list as most important features. At the time of this survey, we Chief Architect & Co-Founder of Databricks
released Apache Spark 2.0. Ongoing performance improvements, with @rxin
36
Databricks’ vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache® Spark™,
a powerful open source data processing engine built for sophisticated analytics, ease of use, and speed. Databricks is the largest contributor to the open source Apache
Spark project providing 10x more code than any other company. The company has also trained over 20,000 users on Apache Spark, and has the largest number of
customers deploying Spark to date. Databricks provides a just-in-time data platform, to simplify data integration, real-time experimentation, and robust deployment of
production applications. Databricks is venture-backed by Andreessen Horowitz and NEA. For more information, contact [email protected].
© Databricks 2016. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
37