0% found this document useful (0 votes)
378 views7 pages

11 Best Practices For Data Engineers

Snowflake 11 Best Practices for Data Engineers

Uploaded by

operatedboy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
378 views7 pages

11 Best Practices For Data Engineers

Snowflake 11 Best Practices for Data Engineers

Uploaded by

operatedboy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

11 BEST PRACTICES

FOR DATA ENGINEERS


How to Drive Profitability with Your Data
2 Introduction
3 #1: Enable your pipeline to handle concurrent workloads
3 #2: Tap into existing skills to get the job done
3 #3: Use data streaming instead of batch ingestion
4 #4: Streamline pipeline development processes
4 #5: Operationalize pipeline development
4 #6: Invest in tools with built-in connectivity
4 #7: Incorporate extensibility
5 #8: Enable data sharing in your pipelines
5 #9: Choose the right tool for data wrangling
5 #10: Build data cataloging into your engineering strategy
5 #11: Rely on data owners to set security policy
6 About Snowflake
CHAMPION GUIDES
INTRODUCTION
There’s never been a better time to be a data “Data engineers have become valuable resources that
engineer. Less than a year ago, CNBC ranked can harness the value of data for business objectives,
data engineer as one of the 25 fastest-growing which ultimately plays a strategic role in a complex
landscape that is essential to the entire organization,”
jobs in the U.S.¹ In fact, according to the real-
says big-data news portal Datanami. “Understanding
time jobs feed Nova, data engineer was the
and navigating data needs has the ability to empower
fastest growing job title for 2018.² data engineers to propel an organization into a thriving
But the parameters of the job are changing quickly. data-first company.”³
Databases and data warehouses are moving to the
If you’re a data engineer looking to make the right
cloud and new tools and data pipelines are taking
decisions about data strategies and tools for your
over traditional data engineering tasks such as
organization, here are 11 best practices for data
manually writing ETL code and cleaning data. As a
engineering that can mean the difference between
result, companies are asking engineers to provide
profitability and loss.
guidance on data strategy and pipeline optimization.
In addition, as information grows exponentially
and as the sources and types of data become
more complicated, engineers must know the latest
strategies and tools to help the business leverage that
data for increased profitability and growth.

¹ cnbc.com/2018/12/11/payscale-the-25-fastest-growing-jobs-of-2018.html
² insights.dice.com/2018/12/27/data-engineer-2018-hottest-tech-jobs/
3 datanami.com/2019/07/18/data-engineers-the-c-suites-savior/
2
CHAMPION GUIDES
11 BEST PRACTICES
FOR DATA ENGINEERS
processing task using direct SQL statements rather so they can quickly take actions such as changing
1. ENABLE YOUR than using Kafka. Maximize your current skills before the website’s layout to drive more sales. Set up
PIPELINE TO HANDLE you invest resources learning something new. continuous streaming ingestion to decrease pipeline
CONCURRENT WORKLOADS latency and enable the business to use data from a
few minutes ago, instead of a day ago. Understand
To be profitable, businesses need to run many data
analysis processes simultaneously, and they need
3. USE DATA STREAMING the available streaming capabilities and how they
work with different architectures, and implement
systems that can keep up with the demand. Data INSTEAD OF BATCH INGESTION pipelines that can handle both batch and
comes into the enterprise 24 hours a day, seven Data comes into your business 24 hours a day, so a streaming data.
days a week, from the web, mobile devices, and periodic batch ingestion can miss recent events. This
Internet of Things (IoT) devices. Your data pipeline can have catastrophic consequences, such as failure
has to load and process that data while scientists to detect fraud or a data breach. Stale data can
are analyzing the data and downstream applications affect profitability, as well. For example, a company
are processing it for further use. A modern data running an online shopping event wants immediate
pipeline that lives in the cloud features an elastic insights into which products are most viewed, most
multi-cluster, shared data architecture that enables purchased, and least popular as soon as possible,
the handling of concurrent workloads. It can
allocate multiple independent, isolated clusters
for processing, data loading, transformation, and
analytics while sharing the same data concurrently
without resource contention.

2. TAP INTO EXISTING SKILLS


TO GET THE JOB DONE
Many pipelines use complex algorithms that
seemingly require data engineers to use Apache
Spark, Apache Kafka, or Python. But you don’t have
to learn new platforms to solve problems. Instead,
find a way to use your current skills. For example,
modern ETL enables you to accomplish your stream

3
CHAMPION GUIDES
invest in tools that have built-in connections to each
4. STREAMLINE PIPELINE 5. OPERATIONALIZE other. If your tools don’t have connectivity, do the
DEVELOPMENT PROCESSES PIPELINE DEVELOPMENT extra step of storing data in a generic form such as
After creating a pipeline, you may have to modify the format used by Amazon Simple Storage Service
To ensure the validity of production data, build
it or scale it to accommodate more data sources. (S3) so other tools can pick it up.
pipelines in a test environment, where you can test
code and algorithms iteratively until they are ready Design your pipelines so they can be easily modified
for a production environment. By using a cloud or scaled. The concept is known as “DataOps,”
data platform as the foundation for running data or DevOps for data, and it consists of building 7. INCORPORATE
pipelines, creating test environments can be as continuous integration, delivery, and deployment into EXTENSIBILITY
simple as creating a clone of an existing environment the pipeline using automation and, in some cases,
artificial intelligence (AI). Incorporating DataOps in Organizations use many disparate tools to derive
without the rigor of managing new databases and
your pipeline will make your data more reliable and meaning from their data. For example, organizations
infrastructure. This will greatly accelerate the time to
more available. may write custom APIs to scan images and extract
go from development to test to production far faster
text from the images. Another example of a custom
than building these same pipelines on premises.
algorithm is doing sentiment analysis of customer
service chats. Make sure you build modern pipelines
6. INVEST IN TOOLS WITH that can leverage this code. By using APIs and
BUILT-IN CONNECTIVITY pipelining tools, you can create a data flow that uses
A modern, cloud-based data pipeline accommodates outside code seamlessly.
many tools and platforms that need to communicate
with each other. Building connections between source
systems, data warehouses, data lakes, and analytics
applications takes time, labor, and money. Instead,

4
CHAMPION GUIDES
form in order to ensure its veracity. End users may
8. ENABLE DATA SHARING IN also want to know which data sets can be trusted
YOUR PIPELINES and which data sets are a work in progress. Build a
Often, multiple groups inside and outside of your data catalog that keeps track of the data lineage so
organization need the same core data to perform you can trace the data if needed. This will increase
their analyses. For example, a retailer may need the end users’ trust in the data and will also improve
to share sales data with three different suppliers. the data’s accuracy.
Building separate pipelines with the same data
would take time and cost money. As an alternative,
modern tools in the cloud enable you to create a 11. RELY ON DATA OWNERS TO
shared pipeline that enables you to govern who SET SECURITY POLICY
can access the data. Shared pipelines get the right
information to the right people quickly. Data engineers may not understand how to set the
security policy—who can see it and what kind of
access they have to it. For example, they might not
realize that certain data fields need to be obfuscated
9. CHOOSE THE RIGHT TOOL before the data is sent to a particular user,
FOR DATA WRANGLING potentially causing a security or regulatory issue.
A data wrangling tool can fix inconsistencies in data, To prevent this scenario, the owner or producer of
transforming distinct entities such as fields, rows, the data should set the security policy. Others can
or data values within a data set so they’re easier to provide recommendations, but ultimately, the owner
leverage. For example, the store name “Giantmart” is most aware of how data needs to be secured
might arrive in your pipeline from different sources before it is distributed.
in different ways, such as “Giant-Mart,” “Giantmart The world of data engineering is changing quickly.
Megacenter,” and “Giant-mart Inc.” This can cause Technologies such as IoT, AI, and the cloud are
problems as the data is loaded and analyzed. transforming data pipelines and upending traditional
Cleaner data equals better, more accurate insights methods of data management. The decisions that
for business decision-making. you make about your data pipeline, whether large or
small, can have a significant impact on the business.
The wrong choices mean increased costs and time
10. BUILD DATA spent on unnecessary tasks. The right decisions
enable the business to harness the power of data to
CATALOGING INTO YOUR achieve profitability and growth for years to come.
ENGINEERING STRATEGY
Analysts may have questions about the data in
your pipeline such as where it came from, who has
accessed it, or which business process owns it. A
data scientist may need to view the data in its raw

5
ABOUT SNOWFLAKE
Snowflake Cloud Data Platform shatters the barriers that prevent organizations from unleashing the true value from their data.
Thousands of customers deploy Snowflake to advance their businesses beyond what was once possible by deriving all the insights
from all their data by all their business users. Snowflake equips organizations with a single, integrated platform that offers the only
data warehouse built for any cloud; instant, secure, and governed access to their entire network of data; and a core architecture to
enable many other types of data workloads, including a single platform for developing modern data applications.
Snowflake: Data without limits.Find out more at snowflake.com.

© 2019 Snowflake, Inc. All rights reserved. snowflake.com #YourDataNoLimits

You might also like