Anaconda-Journey To ODS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

ANACONDA WHITEPAPER

THE JOURNEY TO
OPEN DATA SCIENCE
By: Christine Doig, Data Scientist and Product Marketing Manager
March 2016
IN THIS WHITEPAPER
Migrating from traditional analytics to modern Open Data Science (ODS) is one of the most important
trends of the decade. However, although this migration is critical to the core mission of many enterprises,
practicing Data Science effectively has remained a thorny challenge. Part of this is due to the use of
older proprietary data technologies that do not provide the flexibility and power needed to channel
the full potential of a Data Science team — not just Data Scientists and Developers, but business
experts and other stakeholders. These proprietary technologies are not innovative, transparent or
community supported, and they typically cannot quickly adapt to changes as projects develop.
Fortunately, modern technologies, by leveraging developer talent from across the world through the
open source paradigm, now provide comprehensive, best-of-breed and cost effective ecosystems. Not
only do these technologies meet current challenges, but they also address potential future struggles,
since they can rapidly adapt in ways that proprietary Data Science technologies cannot.
In this paper, you’ll learn why Open Data Science is the foundation to modernizing data analytics, and:
• In what ways availability, interoperability, transparency and innovation are
some of the most important benefits of the ODS approach
• The best path and important considerations when moving to ODS
• Why the Anaconda platform is the best solution for ODS
• How you can leverage your current investment in analytics while moving to ODS

OPEN DATA SCIENCE


Open Data Science is not a single technology, but a revolution purpose (such as visualization but not machine learning) and
within the Data Science community. ODS is an inclusive did not easily interoperate. For these reasons, data analytics
movement that makes open source tools for Data Science teams were forced to commit to long evaluation periods
— data, analytics and computation — easily work together and to work with monolithic tools that had to be integrated
as a connected ecosystem. Data is everywhere, and ODS together and were less than optimal for their needs.
has emerged to meet the demands of modern analytics.
The promise of ODS and the fundamental principles are: With ODS, all of these concerns evaporate. Technologies are
nonproprietary, free and open source, making powerful tools
· Availability available for both individuals and enterprise teams. This access
allows anyone to use a large and flourishing interconnected
· Innovation
ecosystem of analytic technologies, giving the Data Scientist
· Interoperability the best tool for his or her problem today and tomorrow as
technology advances. This future proofing is underscored with
· Transparency
the open source guarantee — source code is available no matter
Availability. ODS avoids the drawbacks of former approaches what the circumstances — permitting the Data Science team
to Data Science that limited the access to holistic tools for all to confidently build solutions that will endure the test of time.
members of the Data Science team. Traditional approaches
relied on proprietary technologies that were expensive
(sometimes prohibitively so), designed for a single isolated

2 · continuum.io
Innovation. Data Science thrives on innovation, but traditional ecosystem. This includes the ability to embrace new open source
approaches tied to proprietary technologies are typically technologies, such as Hadoop and Julia, and to leverage legacy
slow and rigid. When changes to proprietary products code, including Fortran and C/C++, into modern solutions.
are made, they are on vendors’ schedules. It sometimes
took years for innovations to surface to the market. Transparency. Proprietary software is a “black box” where the
internal structure and processing is confidential. The algorithms
Data scientists have become impatient with proprietary software and their implementations were encapsulated in the proprietary
that, because of its limitations, moves at a glacial pace and software. In the past this was acceptable, since the software
causes them to perform time-consuming, complex workarounds. technology era was in its infancy and there were few choices to
solving Data Science problems. Thanks to decades of academic
Meanwhile, the open source community delivers a simple, graceful research, there is an abundance of ODS tools that disclose the
way to deal with today’s issues. With ODS, innovation comes algorithms and processing techniques to the public via open
from many different open source communities, including Python, source, so that Data Scientists can ensure the technique is
R, Java, Hadoop, Scala, Julia and others. In these communities, appropriate to solving the problem at hand. Additionally, Data
a very broad based peer review brings fresh suggestions to Scientists can leverage open source and improve it to suit their
quickly optimize the research trajectory, making the latest problems and environments. This flexibility makes it easier and
science instantly available through the open source paradigm. faster for the Data Science team to deliver higher value solutions.
In contrast to this slow moving proprietary approach, ODS — Transparency also feeds back into innovation, since
because of its hyper-flexible, customizable and inexpensive anyone from the community can check the aptness of the
nature — invites experimentation, quick-to-fail approaches analytics. It also greatly stimulates future contributions,
and rapid prototyping. This spreads the fast-paced innovation allowing students in universities to view and engage with
velocity of open source to the Data Science team. Ideas and open source. This results in new ideas, research and
contributors from across your Data Science team, including innovation continuously progressing the ODS ecosystem.
domain experts such as biologists, physicists, economists
or others, are introduced to new ways of looking at the data. While proprietary vendors are aware of the sea change in Data
This leads to new models that deliver cutting-edge insights Science and are trying to embrace it, their attempts to do so
and unlock value that has been trapped in your data. are typically out of sync with the market needs and appear
overhyped and disappointing. In short, traditional proprietary
Interoperability. Monolithic tools typically integrate with analytic platforms are not meeting the needs of modern Data
their own suite but are either closed to integrating with Science teams that expect four fundamental principles: 1)
outside tools or provide an inferior, often slower method Availability 2) Innovation 3) Interoperability and 4) Transparency.
when doing so. Many times the suite is a composite of As more businesses try to rapidly unlock the value of their data in
tools acquired by the vendor over time, and these tools modern architectures, ODS becomes essential to their strategy.
are clunky and sluggish when they do interoperate.

ODS excels at integrating tools from across the open source


ecosystem. By the very nature of open source, contributors
look to leverage existing know-how, code and tools. This type
of approach eradicates the typical boundaries established by
proprietary vendors that prevent interoperability. Python, in
particular, is known as the “glue language” that makes it easy to
create code to connect software components into a seamless

3 · continuum.io
TRADITIONAL ANALYTICS MODERN ANALYTICS
Technology Technology

pandas

Teams Teams

Data
Developer Scientist DevOps
Statistician DBA Ops

MOVING TO OPEN DATA SCIENCE


Moving to any new technology has an impact on your — scientists, mathematicians, engineers, business and more
team, IT infrastructure, development process and workload. — as open source is the de facto used in most universities
Because of this, proper planning is essential. The drivers worldwide. This results in a new generation of talent that
for change are different in every organization, so the can be brought onboard for Data Science projects.
speed and approach to the transition will also vary.
Whether trained at university or on-the-job, the Data Science
Team. Shifting to an ODS paradigm requires changes. team needs the ability to integrate multiple tools into their
Successful projects begin with people, and ODS is no workflow quickly and easily in order to be effective and highly
different. New organizational structures — e.g., a center of productive. Most of the skills-ready university graduates
excellence, lab teams or emerging technology teams — are a are very familiar with collaborating with colleagues across
way to dedicate personnel to jumpstart the changes. These geographies in their university experience. Many are also familiar
groups are typically chartered with actively seeking out new with Notebooks, an ODS tool that facilitates the sharing of code,
ODS technologies and determining the fit and value to the data, narratives and visualizations. This familiarity is critical
organization. This facilitates adoption of ODS and bridges the because collaboration in Data Science is crucial to its success.
gap between traditional IT and lines of business. Additionally,
roles may shift — e.g., from statistician to Data Scientist Research shows that the highest indicator of success for Data
and from database administrator to Data Engineer — and Scientists is curiosity. ODS satisfies their curiosity and makes
new roles, such as Computational Scientist, will emerge. them happy as they are constantly learning new and innovative
ways to deliver Data Science solutions. Moving to ODS increases
With these changes, the team will need additional training morale as Data Scientists get to build on the shoulders of giants
to become proficient in ODS tools. While instructor-led who created the foundation for modern analytics. They feel
training is still the norm, there are also many learning empowered by being able to use their choice of tools, algorithms
opportunities for ODS available online where the team and compute environments to get the job done in a productive
can self-teach using ODS tools. With ODS, recruiting and impactful way that satisfies their natural curiosity and desire
knowledgeable resources is much easier across disciplines to make a meaningful change and impact with their work.

4 · continuum.io
Technology. Selecting technology with ODS is significantly limits are reached with the proprietary technology. The
easier than proprietary software, because the software is freely proprietary technologies are then phased out over time.
available for download. This allows the Data Science team
to self-serve their own proof of concept, trying out the ODS A migration strategy is slightly riskier and moves existing
technology to meet the specific needs of their organization. solutions into ODS by reproducing the solution as-is with any
For Data Science, there is no shortage of choices. Open source and all limitations. This is often accomplished by outsourcing the
languages such as Python, R, Scala and Julia are frontrunners migration to a knowledgeable third party who is proficient in the
in ODS, and each of these languages offers many different proprietary technology as well as ODS. A migration strategy can
open source libraries for data analysis, mathematics and data take place over time by targeting low-risk projects and limited
presentation, such as NumPy, SciPy, pandas, matplotlib and scope until all existing Data Science code has been migrated to
others, available at no cost and with open source licensing. ODS. Migration strategies can also migrate all the legacy code via
No matter what your goals are in Data Science, there will a “big bang” cutover. The Data Science solutions are improved
be an open source project that meets your needs. to remove the legacy limitations over time, usually using a
continuous integration continuous delivery (CICD) methodology.
Some open source software only works effectively on local client
machines, while other open source software supports scale out A recoding strategy is higher risk and takes advantage of the
architectures, such as Hadoop. Typically, a commercial vendor fills entire modern analytics stack to reduce cost, streamline code
the gap on supporting a wider variety of modern architectures. efficiency, decrease maintenance and create higher impact
business value often, through faster performance or from
Migration. The strategy for migrating to ODS is determined adding new data to drive better results and value. The objective
to align with the business objectives and risk tolerance of recoding is to remove limitations and constraints of legacy
of the organization. It is not necessary to commit to a code by taking full advantage of ODS on modern compute
full recoding to ODS from the start. There is a range of infrastructure. With this strategy, oftentimes a full risk assessment
strategies from completely risk averse (do nothing) to is often completed to determine the prioritization of projects
higher risk (recode), each with their own pros and cons. for recoding. The full risk assessment includes estimates for
cost reduction and improved results to determine the risk.
A coexistence strategy is fairly risk averse and allows the
team to learn the new technology, typically on greenfield The introduction of Big Data projects have become an
projects, while keeping the legacy proprietary technology ideal scenario for many companies to use a coexistence
in place. This minimizes disruption and, when the Data strategy - leaving legacy environments as-is and using
Science team is familiar and comfortable with the ODS ODS on Hadoop - for their Big Data projects.
tools, existing projects start to migrate to ODS when

MIGRATION STRATEGIES FOR THE


JOURNEY TO OPEN DATA SCIENCE

Do Nothing Co-exist Migrate Recode

5 · continuum.io
THE ANACONDA PLATFORM: OPEN DATA SCIENCE’S ONE-STOP SOLUTION
Anaconda is the leading modern open source analytics platform 720 open source certified analytic Python and R packages.
powered by Python. Anaconda enables modernization to Open The Python and R packages are used for data prep, data
Data Science as a platform that embraces the entire ODS mining, stats, machine learning, deep learning, simulation and
ecosystem — Python, R, Java, Scala, Hadoop and more — optimization, text and natural language processing, geospatial,
across the entire modern analytics stack from desktop to server video/image/audio mining and graph and network optimization.
to clusters and cloud for enterprises. Anaconda makes it easy to: Legacy analytics written in Fortran and C/C++ can also be
leveraged into modern Data Science solutions with Anaconda.
· Create, collaborate and deploy Data Science solutions These solutions can be intelligent web apps or interactive
· Leverage modern architectures and frameworks visualizations, embedded into processes via RESTful APIs or
dashboards for ad hoc analysis or production deployments. The
· Set up and manage open source
breadth and depth of the Anaconda platform makes it feasible to
Create, Collaborate and Deploy Data Science Solutions. move all your legacy analytics to ODS without making sacrifices.
Anaconda is an inclusive platform that brings all Data Science Leverage Modern Architectures and Frameworks. Anaconda
roles together to easily collaborate on high impact Data Science is an enterprise-ready platform that delivers high performance,
solutions. Data Science teams create models with their favorite security and authentication to support the most demanding
languages and tools, including Jupyter Notebooks, IDEs, data Data Science solutions. The compiler included in Anaconda
exploration, data mining and Microsoft Excel, along with over delivers significant throughput that is comparable to the

6 · continuum.io
performance of C while making it much easier to create and Conda promotes innovation and collaboration by allowing teams
maintain the Data Science solutions. Anaconda exploits modern to partake in the latest certified open source packages and
architectures for both scaling up and scaling out processing onto makes installation a breeze. This supports innovation across
clusters and multi-core CPUs and GPUs to deliver cost effective the development cycle, as new packages or latest versions
Data Science solutions. Anaconda can co-exist alongside legacy are needed. This same capability allows the Data Science
and modern stacks, including Hadoop, Spark, Elasticsearch team to package their own innovations to share with other
and others. Additionally, Anaconda can process inside Hadoop, teams. This fast and simple way to reproduce environments
eliminating data movement, easily parallelizing workloads to makes it easy to migrate Data Science solutions across the
move computations to the data, and bypassing Hadoop overhead development, user acceptance testing (UAT) and production
to read and write directly to HDFS files, all to deliver extremely systems. With Anaconda, your Data Science solutions are
high performance. These innovations allow migrated legacy effortlessly deployed into production. While migrating to
analytics to achieve faster throughput while exploiting new data. open source may seem difficult and laborious, Anaconda
makes it stress-free to setup and deploy ODS solutions.
Setup and Manage Open Source. Anaconda includes
conda, an open source, multilanguage (e.g. Python, R, Java,
Fortran, C/C++), cross-platform (e.g. Windows, Linux, OS X)
package, dependency and environment manager that makes
open source convenient and feasible for enterprises.

DEPLOY & OPERATE

EXPLORE & ANALYZE

COLLABORATE & PUBLISH Developer DevOps


Interactive
notebooks

Data Apps
& Visualization

Models
Data Scientist Data Scientist Data Scientist

7 · continuum.io
SUMMARY
Open Data Science is the foundation to modernizing your data analytics. With Anaconda, you can now:

· Leverage the full power and innovation of open source for all your Data Science needs

· Collaborate in a secure fashion with teams across the globe

· Integrate with legacy analytics to maximize your investments

· Reduce costs while experiencing new freedom to select the best tool and
architecture that fits your enterprise now and in the future

While moving to ODS can seem intimidating, Anaconda delivers an enterprise


platform that makes the journey to ODS simple and painless.

In an Open Data Science world, Anaconda is your key to unlocking the value in your data.

ABOUT CONTINUUM ANALYTICS

Continuum Analytics is the creator and driving force behind Anaconda, the leading modern open source analytics
platform powered by Python. We put superpowers into the hands of people who are changing the world.

With more than 2M downloads annually and growing, Anaconda is trusted by the world’s leading businesses across
industries – financial services, government, health & life sciences, technology, retail & CPG, oil & gas – to solve the world’s
most challenging problems. Anaconda does this by helping everyone in the Data Science team discover, analyze and
collaborate by connecting their curiosity and experience with data. With Anaconda, teams manage their Open Data Science
environments without any hassles to harness the power of the latest open source analytic and technology innovations.

Our community loves Anaconda because it empowers the entire Data Science team – Data Scientists, Developers, DevOps,
Architects, and Business Analysts – to connect the dots in their data and accelerate the time-to-value that is required in
today’s world. To ensure our customers are successful, we offer comprehensive support, training and professional services.

Continuum Analytics’ Founders and Developers have created or contribute to some of the most popular
Open Data Science technologies, including NumPy, SciPy, Matplotlib, Pandas, Jupyter/IPython, Bokeh, Numba
and many others. Continuum Analytics is venture-backed by General Catalyst and BuildGroup.

To learn more, visit http://www.continuum.io

8 · continuum.io

You might also like