11 Best Practices For Data Engineers
11 Best Practices For Data Engineers
¹ cnbc.com/2018/12/11/payscale-the-25-fastest-growing-jobs-of-2018.html
² insights.dice.com/2018/12/27/data-engineer-2018-hottest-tech-jobs/
3 datanami.com/2019/07/18/data-engineers-the-c-suites-savior/
2
CHAMPION GUIDES
11 BEST PRACTICES
FOR DATA ENGINEERS
processing task using direct SQL statements rather so they can quickly take actions such as changing
1. ENABLE YOUR than using Kafka. Maximize your current skills before the website’s layout to drive more sales. Set up
PIPELINE TO HANDLE you invest resources learning something new. continuous streaming ingestion to decrease pipeline
CONCURRENT WORKLOADS latency and enable the business to use data from a
few minutes ago, instead of a day ago. Understand
To be profitable, businesses need to run many data
analysis processes simultaneously, and they need
3. USE DATA STREAMING the available streaming capabilities and how they
work with different architectures, and implement
systems that can keep up with the demand. Data INSTEAD OF BATCH INGESTION pipelines that can handle both batch and
comes into the enterprise 24 hours a day, seven Data comes into your business 24 hours a day, so a streaming data.
days a week, from the web, mobile devices, and periodic batch ingestion can miss recent events. This
Internet of Things (IoT) devices. Your data pipeline can have catastrophic consequences, such as failure
has to load and process that data while scientists to detect fraud or a data breach. Stale data can
are analyzing the data and downstream applications affect profitability, as well. For example, a company
are processing it for further use. A modern data running an online shopping event wants immediate
pipeline that lives in the cloud features an elastic insights into which products are most viewed, most
multi-cluster, shared data architecture that enables purchased, and least popular as soon as possible,
the handling of concurrent workloads. It can
allocate multiple independent, isolated clusters
for processing, data loading, transformation, and
analytics while sharing the same data concurrently
without resource contention.
3
CHAMPION GUIDES
invest in tools that have built-in connections to each
4. STREAMLINE PIPELINE 5. OPERATIONALIZE other. If your tools don’t have connectivity, do the
DEVELOPMENT PROCESSES PIPELINE DEVELOPMENT extra step of storing data in a generic form such as
After creating a pipeline, you may have to modify the format used by Amazon Simple Storage Service
To ensure the validity of production data, build
it or scale it to accommodate more data sources. (S3) so other tools can pick it up.
pipelines in a test environment, where you can test
code and algorithms iteratively until they are ready Design your pipelines so they can be easily modified
for a production environment. By using a cloud or scaled. The concept is known as “DataOps,”
data platform as the foundation for running data or DevOps for data, and it consists of building 7. INCORPORATE
pipelines, creating test environments can be as continuous integration, delivery, and deployment into EXTENSIBILITY
simple as creating a clone of an existing environment the pipeline using automation and, in some cases,
artificial intelligence (AI). Incorporating DataOps in Organizations use many disparate tools to derive
without the rigor of managing new databases and
your pipeline will make your data more reliable and meaning from their data. For example, organizations
infrastructure. This will greatly accelerate the time to
more available. may write custom APIs to scan images and extract
go from development to test to production far faster
text from the images. Another example of a custom
than building these same pipelines on premises.
algorithm is doing sentiment analysis of customer
service chats. Make sure you build modern pipelines
6. INVEST IN TOOLS WITH that can leverage this code. By using APIs and
BUILT-IN CONNECTIVITY pipelining tools, you can create a data flow that uses
A modern, cloud-based data pipeline accommodates outside code seamlessly.
many tools and platforms that need to communicate
with each other. Building connections between source
systems, data warehouses, data lakes, and analytics
applications takes time, labor, and money. Instead,
4
CHAMPION GUIDES
form in order to ensure its veracity. End users may
8. ENABLE DATA SHARING IN also want to know which data sets can be trusted
YOUR PIPELINES and which data sets are a work in progress. Build a
Often, multiple groups inside and outside of your data catalog that keeps track of the data lineage so
organization need the same core data to perform you can trace the data if needed. This will increase
their analyses. For example, a retailer may need the end users’ trust in the data and will also improve
to share sales data with three different suppliers. the data’s accuracy.
Building separate pipelines with the same data
would take time and cost money. As an alternative,
modern tools in the cloud enable you to create a 11. RELY ON DATA OWNERS TO
shared pipeline that enables you to govern who SET SECURITY POLICY
can access the data. Shared pipelines get the right
information to the right people quickly. Data engineers may not understand how to set the
security policy—who can see it and what kind of
access they have to it. For example, they might not
realize that certain data fields need to be obfuscated
9. CHOOSE THE RIGHT TOOL before the data is sent to a particular user,
FOR DATA WRANGLING potentially causing a security or regulatory issue.
A data wrangling tool can fix inconsistencies in data, To prevent this scenario, the owner or producer of
transforming distinct entities such as fields, rows, the data should set the security policy. Others can
or data values within a data set so they’re easier to provide recommendations, but ultimately, the owner
leverage. For example, the store name “Giantmart” is most aware of how data needs to be secured
might arrive in your pipeline from different sources before it is distributed.
in different ways, such as “Giant-Mart,” “Giantmart The world of data engineering is changing quickly.
Megacenter,” and “Giant-mart Inc.” This can cause Technologies such as IoT, AI, and the cloud are
problems as the data is loaded and analyzed. transforming data pipelines and upending traditional
Cleaner data equals better, more accurate insights methods of data management. The decisions that
for business decision-making. you make about your data pipeline, whether large or
small, can have a significant impact on the business.
The wrong choices mean increased costs and time
10. BUILD DATA spent on unnecessary tasks. The right decisions
enable the business to harness the power of data to
CATALOGING INTO YOUR achieve profitability and growth for years to come.
ENGINEERING STRATEGY
Analysts may have questions about the data in
your pipeline such as where it came from, who has
accessed it, or which business process owns it. A
data scientist may need to view the data in its raw
5
ABOUT SNOWFLAKE
Snowflake Cloud Data Platform shatters the barriers that prevent organizations from unleashing the true value from their data.
Thousands of customers deploy Snowflake to advance their businesses beyond what was once possible by deriving all the insights
from all their data by all their business users. Snowflake equips organizations with a single, integrated platform that offers the only
data warehouse built for any cloud; instant, secure, and governed access to their entire network of data; and a core architecture to
enable many other types of data workloads, including a single platform for developing modern data applications.
Snowflake: Data without limits.Find out more at snowflake.com.