The Future of Data Engineering
The Future of Data Engineering
Live Webinar and Q&A - Designing and Implementing an Integrated Identity Strategy (Mar 4th),
Sponsored by Auth0
Designing and Implementing an Integrated Identity Strategy (Mar 4th Webinar) - Save
Your Seat
The job of data engineers is to help an organization move and process data but not
to do that themselves.
Data architecture follows a predictable evolutionary path from monolith to
automated, decentralized, self-serve data microwarehouses.
Integration with connectors is a key step in this evolution but increases workload
and requires automation to correct for that.
Increasing regulatory scrutiny and privacy requirements will drive the need for
automated data management.
A lack of polished, non-technical tooling is currently preventing data ecosystems
from achieving full decentralization.
"The future of data engineering" is a fancy title for presenting stages of data pipeline
maturity and building out a sample architecture as I progress, until I land on a modern
data architecture and data pipeline. I will also hint at where things are headed for the
next couple of years.
It's important to know me and my perspective when I'm predicting the future, so that
you can couch mine with your own perspectives and act accordingly.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=ar… 1/28
2/24/2021 The Future of Data Engineering
We,Live
at WePay, useQ&A
Webinar and Kafka a lot. I and
- Designing spent about seven
Implementing years at LinkedIn,
an Integrated the birthplace
Identity Strategy (Mar 4th), of
Sponsored
Kafka, whichby
isAuth0
a pub/sub, write-ahead log. Kafka has become the backbone of a log-
based architecture. At LinkedIn, I spent a bunch of time doing everything from data
science to service infrastructure, and so on. I also wrote Apache Samza, which is a
stream-processing system, and helped build out their Hadoop ecosystem. Before that, I
spent time as a data scientist at PayPal.
There are many definitions for "data engineering". I've seen people use it when talking
about business analytics and in the context of data science. I'm going to throw down my
definition: a data engineer's job is to help an organization move and process data. To
move data means streaming pipelines or data pipelines; to process data means data
warehouses and stream processing. Usually, we're focused on asynchronous, batch or
streaming stuff as opposed to synchronous real-time things.
I want to call out the key word here: "help". Data engineers are not supposed to be
moving and processing the data themselves but are supposed to be helping the
organization do that.
Maxime Beauchemin is a prolific engineer who started out, I think, at Yahoo and passed
through Facebook, Airbnb, and Lyft. Over the course of his adventures, he wrote Airflow,
which is the job scheduler that we and a bunch of other companies use. He also wrote
Superset. In his "The Rise of the Data Engineer" blog post a few years ago, Beauchemin
said that "... data engineers build tools, infrastructure, frameworks, and services." This is
how we go about helping the organization to move and process the data.
The reason that I put this presentation together was a 2019 blog post from a company
called Ada in which they talk about their journey to set up a data warehouse. They had a
MongoDB database and were starting to run up against its limits when it came to
reporting and some ad hoc query things. Eventually, they landed on Apache Airflow and
Redshift, which is AWS's data-warehousing solution.
What struck me about the Ada post was how much it looked like a post that I’d written
about three years earlier. When I landed at WePay, they didn't have much of a data
warehouse and so we went through almost the exact same exercise that Ada did. We
eventually landed on Airflow and BigQuery, which is Google Cloud's version of Redshift.
The Ada post and mine are almost identical, from the diagrams to even the structure and
sections of the post.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=ar… 2/28
2/24/2021 The Future of Data Engineering
ThisLive
was something
Webinar we- Designing
and Q&A had doneand
a few years earlier
Implementing and so IIdentity
an Integrated threwStrategy
down the
(Margauntlet
4th), on
Sponsored
Twitter by Auth0 Ada’s future. I claimed to know how they would progress as they
and predicted
continued to build out their data warehouse: one step would be to go from batch to a
real-time pipeline, and the next step would be to a fully self-serve or automated pipeline.
I'm not trying to pick on Ada. I think it's a perfectly reasonable solution. I just think that
there's a natural evolution of a data pipeline and a data warehouse and the modern data
ecosystem and that's really what I want to cover here.
I refined this idea with the tweet in Figure 1. The idea was that we initially land with
nothing, so we need to set up a data warehouse quickly. Then we expand as we do more
integrations, and maybe we go to real-time because we've got Kafka in our ecosystem.
Finally, we move to automation for on-demand stuff. That eventually led to my "The
Future of Data Engineering" post in which I discussed four future trends.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=ar… 3/28
2/24/2021 The Future of Data Engineering
TheLive
first trendand
Webinar is timeliness, going
Q&A - Designing andfrom this batch-based
Implementing periodic
an Integrated Identity architecture to a more
Strategy (Mar 4th),
Sponsored
real-time by Auth0
architecture. The second is connectivity; once we go down the timeliness route,
we start doing more integration with other systems. The last two tie together:
automation and decentralization. On the automation front, I think we need to start
thinking about how we operate not just our operations but our data management. And
then decentralizing the data warehouse.
The reason I created this path is that everyone's future is different because everyone is at
a different point in their life cycle. The future at Ada looks very different than the future
at WePay because WePay may be farther along on some dimensions - and then there are
companies that are even farther along than WePay.
These stages let you find your current starting point and build your own roadmap from
there.
Stage 0: None
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=ar… 4/28
2/24/2021 The Future of Data Engineering
You're
Live probably atQ&A
Webinar and this- stage if you
Designing andhave no dataan
Implementing warehouse. You probably
Integrated Identity have4th),
Strategy (Mar a
Sponsoredarchitecture.
monolithic by Auth0 You're maybe a smaller company and you need a warehouse up
and running now. You probably don't have too many data engineers and so you're doing
this on the side.
Stage 0 looks like Figure 3, with a lovely monolith and a database. You take a user and
you attach it to the database. This sounds crazy to people that have been in the data-
warehouse world for a while but it's a viable solution when you need to get things quickly
up and running. The data appears to the user as basically real-time data because you're
reading directly from the database. It's easy and cheap.
This is where WePay was when I landed there in 2014. We had a PHP monolith and a
monolithic MySQL database. The users we had, though, weren't happy and things were
starting to tip over. We had queries timing out. We had users impacting each other -
most OLTP systems that you're going to be using do not have strong isolation and multi-
tenancy so users can really get in each other's way. Because we were using MySQL, we
were missing some of the fancier analytic SQL stuff that our data-science and business-
analytics people wanted and report generation was starting to break. It was a pretty
normal story.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=ar… 5/28
2/24/2021 The Future of Data Engineering
Stage 1: Batch
Live Webinar and Q&A - Designing and Implementing an Integrated Identity Strategy (Mar 4th),
Sponsored by Auth0
We started down the batch path, and this is where the Ada post and my earlier post
come in.
Going into Stage 1, you probably have a monolithic architecture. You might be starting to
lean away from that but usually it works best when you have relatively few sources. Data
engineering is now probably your part-time job. Queries are timing out because you're
exceeding the database capacity, whether in space, memory, or CPU.
The lack of complex analytical SQL functions is becoming an issue for your organization
as people need those for customer-facing or internal reports. People are asking for
charts, business intelligence, and all that kind of fun stuff.
This is where the classic batch-based approach comes in. Between the database and the
user, you stuff a data warehouse that can accomplish a lot more OLAP and fulfill analytic
needs. To get data from the database into that data warehouse, you have a scheduler that
periodically wakes up to suck in the data.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=ar… 6/28
2/24/2021 The Future of Data Engineering
That's
Livewhere
WebinarWePay
and Q&Awas at about
- Designing a year
and after I joined.
Implementing This Identity
an Integrated architecture
Strategyis(Mar
fantastic
4th), in
Sponsored
terms by Auth0
of tradeoffs. You can get the pipeline up pretty quickly these days - when I did it in
2016, it took a couple of weeks. Our data latency was about 15 minutes, so we did
incremental partition loads, taking little chunks of data and loading them in. We were
running a few hundred tables. This is a nice place to start if you're trying to get
something up and running but, of course, you outgrow it.
The number of Airflow workflows that we had went from a few hundred to a few
thousand. We started running tens or hundreds of thousands of tasks per day, and that
became an operational issue because of the probability that some of those are not going
to work. We also discovered - and this is not intuitive for people who haven't run
complex data pipelines - that the incremental or batch-based approach requires
imposing dependencies or requirements on the schemas of the data that you're loading.
We had issues with create_time and modify_time and ORMs doing things in different ways
and it got a little complicated for us.
DBAs were impacting our workload; they could do something that hurt the replica that
we're reading off of and cause latency issues, which in turn could cause us to miss data.
Hard deletes weren't propagating - and this is a big problem if you have people who
delete data from your database. Removing a row or a table or whatever can cause
problems with batch loads because you just don't know when the data disappears. Also,
MySQL replication latency was affecting our data quality and periodic loads would cause
occasional MySQL timeouts on our workflow.
Stage 2: Realtime
This is where real-time data processing kicks off. This approaches the cusp of the
modern era of real-time data architecture and it deserves a closer look than the first two
stages.
You might be ready for Stage 2 if your load times are taking too long. You've got
pipelines that are no longer stable, whether because workflows are failing or your
RDBMS is having trouble serving the data. You've got complicated workflows and data
latency is becoming a bigger problem: maybe the 15-minute jobs you started with in
2014 are now taking an hour or a day, and the people using them aren't happy about it.
Data engineering is probably your full-time job now.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=ar… 7/28
2/24/2021 The Future of Data Engineering
Your ecosystem
Live might
Webinar and Q&A -have something
Designing like Apache
and Implementing Kafka floating
an Integrated Identity around. Maybe
Strategy (Mar 4th),the
Sponsored
operations by Auth0
folks have spun it up to do log aggregation and run some operational metrics
over it; maybe some web services are communicating via Kafka to do some queuing or
asynchronous processing.
From a data-pipeline perspective, this is the time to get rid of that batch processor for
ETL purposes and replace it with a streaming platform. That's what WePay did. We
changed our ETL pipeline from Airflow to Debezium and a few other systems, so it
started to look like Figure 6.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=ar… 8/28
2/24/2021 The Future of Data Engineering
Live Webinar and Q&A - Designing and Implementing an Integrated Identity Strategy (Mar 4th),
Sponsored by Auth0
The hatched Airflow box now contains five boxes, and we're talking about many
machines so the operational complexity has gone up. In exchange, we get a real-time
pipeline.
Kafka is a write-ahead log that we can send messages to (they get appended to the end of
the log) and we can have consumers reading from various locations in that log. It's a
sequential read and sequential write kind of thing.
We use it with the upstream connectors. Kafka has a component called Kafka Connect.
We heavily use Debezium, a change-data-capture (CDC) connector that reads data from
MySQL in real time and funnels it in real time into Kafka.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=ar… 9/28
2/24/2021 The Future of Data Engineering
CDC is Webinar
Live essentially a way
and Q&A to replicate
- Designing data from one
and Implementing data source
an Integrated to others.
Identity StrategyWikipedia’s
(Mar 4th),
Sponsored
fancy by Auth0
definition of CDC is "… the identification, capture, and delivery of the changes
made to the enterprise data sources." A concrete example is what something like
Debezium will do with a MySQL database. When I insert a row, update that row, and
later delete that row, the CDC feed will give me three different events: an insert, the
update, and the delete. In some cases, it will also provide the before and the after states
of that row. As you can imagine, this can be useful if you're building out a data
warehouse.
Debezium can use a bunch of sources. We use MySQL, as I mentioned. One of the things
in that Ada post that caught my eye was the fact that they were using MongoDB - sure
enough, Debezium has a MongoDB connector. We contributed a Cassandra connector to
Debezium a couple of months ago. It's incubating and we're still getting up off the
ground with it ourselves but that's something that we're going to be using heavily in the
near future.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 10/28
2/24/2021 The Future of Data Engineering
LastLive
butWebinar
not least
and in
Q&Aour architecture,
- Designing we have KCBQ,
and Implementing whichIdentity
an Integrated standsStrategy
for Kafka
(MarConnect
4th),
Sponsored
BigQuery by Auth0
(I do not name things creatively). This connector takes data from Kafka and
loads it into BigQuery. The cool thing about this, though, is that it leverages BigQuery’s
real-time streaming insert API. One of the cool things about BigQuery is that you can use
its RESTful API to post data into the data warehouse in real time and it's visible almost
immediately. That gives us a latency from our production database to our data
warehouse of a couple of seconds.
This pattern opens up a lot of use cases. It lets you do real-time metrics and business
intelligence off of your data warehouse. It also allows you to debug, which is not
immediately obvious - if your engineers need to see the state of their database in
production right now, being able to go to the data warehouse to expose that to them so
that they can figure out what's going on with their system with essentially a real-time
view is pretty handy.
You can also do some fancy monitoring with it. You can impose assertions about what
the shape of the data in the database should look like so that you can be satisfied that the
data warehouse and the underlying web service itself are healthy.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 11/28
2/24/2021 The Future of Data Engineering
Figure
Live 8 showsand
Webinar some
Q&Aof the inevitable
- Designing problems an
and Implementing weIntegrated
encountered inStrategy
Identity this migration.
(Mar 4th), Not
Sponsored
all of by Auth0 were on this pipeline, so we found ourselves between the new cool
our connectors
stuff and the older painful stuff.
Datastore is a Google Cloud system that we were using; that was still Airflow-based.
Cassandra didn't have a connector and neither did Bigtable, which is a Google Cloud
equivalent of HBase. We had BigQuery but BigQuery needed more than just our primary
OLTP data; it needed logging and metrics. We had Elasticsearch and this fancy graph
database (which we're going to be open-sourcing soon) that also needed data.
The ecosystem was looking more complicated. We're no longer talking about this little
monolithic database but about something like Figure 9, which comes from Confluent
and is pretty accurate.
You have to figure out how to manage some of this operational pain. One of the first
things you can do is to start integration so that you have fewer systems to deal with. We
used Kafka for that.
Stage 3: Integration
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 12/28
2/24/2021 The Future of Data Engineering
If you
Livethink back
Webinar and20
Q&Ayears to enterprise-service-bus
- Designing architectures,
and Implementing an Integrated that's really
Identity Strategy all data
(Mar 4th),
Sponsoredis.
integration byThe
Auth0
only difference is that streaming platforms like Kafka along with the
evolution in stream processing have made this viable.
You might be ready for data integration if you've got a lot of microservices. You have a
diverse set of databases as Figure 8 depicts. You've got some specialized, derived data
systems; I mentioned a graph database but you may have special caches or a real-time
OLAP system. You've got a team of data engineers now, people who are responsible for
managing this complex workload. Hopefully, you have a happy, mature SRE
organization that's more than willing to take on all these connectors for you.
Figure 10 shows what data integration looks like. We still have the base data pipeline
that we've had so far. We've got a service with a database, we've got our streaming
platform, and we've got our data warehouse, but now we also have web services, maybe a
NoSQL thing, or a NewSQL thing. We've got a graph database and search system
plugged in.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 13/28
2/24/2021 The Future of Data Engineering
Live Webinar and Q&A - Designing and Implementing an Integrated Identity Strategy (Mar 4th),
Sponsored by Auth0
Figure 11 depicts where WePay was at the beginning of 2019. Things were becoming
more complicated. Debezium connects not only to MySQL but to Cassandra as well, with
the connector that we'd been working on. At the bottom is Kafka Connect Waltz (KCW).
Waltz is a ledger that we built in house that's Kafka-ish in some ways and more like a
database in other ways, but it services our ledger use cases and needs. We are a
payment-processing system so we care a lot about data transactionality and multi-region
availability and so we use a quorum-based write-ahead log to handle serializable
transactions. On the downstream side, we've got a bunch of stuff going on.
We were incurring a lot of pain and have many boxes on our diagram. This is getting
more and more complicated. The reason we took on this complexity has to do with
Metcalfe's law. I'm going to paraphrase the definition and probably corrupt it: it
essentially states that the value of a network increases as you add nodes and connections
to it. Metcalfe's law was initially intended to apply to communication devices, like
adding more peripherals to an Ethernet network.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 14/28
2/24/2021 The Future of Data Engineering
So, Live
we're getting
Webinar andtoQ&A
a network effect
- Designing in our data ecosystem.
and Implementing an Integrated In a post
Identity in early
Strategy 2019,
(Mar 4th), I
Sponsored
thought by Auth0
through the implications of Kafka as an escape hatch. You add more systems to
the Kafka bus, all of which are able to load their data in and expose it to other systems
and slurp up the data of in Kafka, and you leverage this network effect in your data
ecosystem.
We found this to be a powerful architecture because the data becomes portable. I'm not
saying it'll let you avoid vendor lock-in but it will at least ameliorate some of those
concerns. Porting data is usually the harder part to deal with when you're moving
between systems. The idea is that it becomes theoretically possible, if you're on Splunk
for example, to plug in Elasticsearch alongside it to test it out - and the cost to do so is
certainly lower.
Data portability also helps with multi-cloud strategy. If you need to run multiple clouds
because you need high availability or you want to pick cloud vendors to save money, you
can use Kafka and the Kafka bus to move the data around.
The problems in Stage 3 look a little different. When WePay bought into this integration
architecture, we found ourselves still spending a lot of time on fairly manual tasks like
those in Figure 12.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 15/28
2/24/2021 The Future of Data Engineering
Live Webinar and Q&A - Designing and Implementing an Integrated Identity Strategy (Mar 4th),
Sponsored by Auth0
In short, we were spending a lot of time administering the systems around the streaming
platform - the connectors, the upstream databases, the downstream data warehouses -
and our ticket load looked like Figure 13.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 16/28
2/24/2021 The Future of Data Engineering
Figure
Live 13: Ticket
Webinar andload
Q&Aat- Designing
WePay inand
Stage 3
Implementing an Integrated Identity Strategy (Mar 4th),
Sponsored by Auth0
Fans of JIRA might recognize Figure 13. It is a screenshot of our support load in JIRA in
2019. It starts relatively low then it skyrockets and it never fully recovered, although
there's a nice trend late in the year that relates to the next step of our evolution.
Stage 4: Automation
We started investing in automation. This is something you've got to do when your
system gets this big. I think most people would say we should have been automating all
along.
You might be ready for Stage 4 if your SREs can't keep up, you're spending a lot of time
on manual toil, and you don't have time for the fun stuff.
Figure 14: Stage 4 adds two new layers to the data ecosystem
Figure 14 shows the two new layers that appear in Stage 4. The first is the automation of
operations, and this won’t surprise most people. It's the DevOps stuff that has been
going on for a long time. The second layer, data-management automation, is not quite as
obvious.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 17/28
2/24/2021 The Future of Data Engineering
Let’s first
Live coverand
Webinar automation for operations.
Q&A - Designing Google’s
and Implementing Site Reliability
an Integrated Engineering
Identity Strategy (Mar 4th),
Sponsored
handbook by Auth0
defines toil as manual, repeatable, automatable stuff. It's usually interrupt-
driven: you're getting Slack messages or tickets or people are showing up at your desk
asking you to do things. That is not what you want to be doing. The Google book says, "If
a human operator needs to touch your system during normal operations, you have a
bug."
But the "normal operations" of data engineering were what we were spending our time
on. Anytime you're managing a pipeline, you're going to be adding new topics, adding
new data sets, setting up views, and granting access. This stuff needs to get automated.
Great news! There's a bunch of solutions for this: Terraform, Ansible, and so on. We at
WePay use Terraform and Ansible but you can substitute any similar product.
Figure 15: Some systemd_log thing in Terraform that logs some stuff when you're using
compaction (which is an exciting policy to use with your systemd_logs)
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 18/28
2/24/2021 The Future of Data Engineering
Live Webinar and Q&A - Designing and Implementing an Integrated Identity Strategy (Mar 4th),
Sponsored by Auth0
You can use it to manage your topics. Figures 15 and 16 show some Terraform
automations. Not terribly surprising.
Yes, we should have been doing this, but we kind of were doing this already. We had
Terraform, we had Ansible for a long time - we had a bunch of operational tooling. We
were fancy and on the cloud. We had a bunch of scripts to manage BigQuery and
automate a lot of our toil like creating views in BigQuery, creating data sets, and so on.
So why did we have such a high ticket load?
The answer is that we were spending a lot of time on data management. We were
answering questions like "Who's going to get access to this data once I load it?",
"Security, is it okay to persist this data indefinitely or do we need to have a three-year
truncation policy?", and "Is this data even allowed in the system?" As a payment
processor, WePay deals with sensitive information and our people need to follow
geography and security policies and other stuff like that.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 19/28
2/24/2021 The Future of Data Engineering
WeLive
have a fairly
Webinar androbust compliance
Q&A - Designing and arm that's part
Implementing of JPMorgan
an Integrated Chase.
Identity Because
Strategy we deal
(Mar 4th),
withSponsored by Auth0
credit cards, we have PCI audits and we deal with credit-card data. Regulation is
here and we really need to think about this. Europe has GDPR. California has CCPA. PCI
applies to credit-card data. HIPAA for health. SOX applies if you're a public company.
New York has SHIELD. This is going to become more and more of a theme, so get used
to it. We have to get better at automating this stuff or else our lives as data engineers are
going to be spent chasing people to make sure this stuff is compliant.
I want to discuss what that might look like. As I get into the futuristic stuff, I get more
vague or hand-wavy, but I'm trying to keep it as concrete as I can.
First thing you want to do for automated data management is probably to set up a data
catalog. You probably want it centralized, i.e., you want to have one with all the
metadata. The data catalog will have the locations of your data, what schemas that data
has, who owns the data, and lineage, which is essentially the source and path of the data.
The lineage for my initial example is that it came from MySQL, it went to Kafka, and
then it got loaded into BigQuery - that whole pipeline. Lineage can even track encryption
or versioning, so you know what things are encrypted and what things are versioned as
the schemas evolved.
There's a bunch of activity in this lineage area. Amundsen is a data catalog from Lyft.
You have Apache Atlas. LinkedIn open-sourced DataHub as a patch in 2020. WeWork
has a system called Marquez. Google has a product called Data Catalog. I know I'm
missing more.
These things generally do a lot, more than one thing, but I want to show a concrete
example. I yanked Figure 17 from the Amundsen blog. It has fake data, the schema, the
field types, the data types, everything. At the right, it has who owns the data - and notice
that Add button there.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 20/28
2/24/2021 The Future of Data Engineering
Live Webinar and Q&A - Designing and Implementing an Integrated Identity Strategy (Mar 4th),
Sponsored by Auth0
It tells us what source code generated the data — in this case, it's Airflow, as indicated by
that little pinwheel — and some lineage. It even has a little preview. It's a pretty nice UI.
Underneath it, of course, is a repository that actually houses all this information. That's
really useful because you need to get all your systems to be talking to this data catalog.
That Add button in the Owned By section is important. You don't as a data engineer
want to be entering that data yourself. You do not want to return to the land of manual
data stewards and data management. Instead, you want to be hooking up all these
systems to your data catalog so that they're automatically reporting stuff about the
schema, about the evolution of the schema, about the ownership when the data is loaded
from one to the next.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 21/28
2/24/2021 The Future of Data Engineering
Live Webinar and Q&A - Designing and Implementing an Integrated Identity Strategy (Mar 4th),
Sponsored by Auth0
Figure 18: Your data ecosystem needs to talk to your data catalog
First off, you need your systems like Airflow and BigQuery, your data warehouses and
stuff, to talk to the data catalog. I think there's quite a bit of movement there.
You then need your data-pipeline streaming platforms to talk to the data catalog. I
haven't seen as much yet for that. There may be stuff coming out that will integrate
better, but right now I think that's something you’ve got to do on your own.
I don't think we've done a really good job of bridging the gap on the service side. You
want your service stuff in the data catalog as well: things like gRPC protobufs, JSON
schemas, and even the DBs of those databases.
Once you know where all your data is, the next step is to configure access to it. If you
haven't automated this, you're probably going to Security, Compliance, or whoever the
policymaker is and asking if this individual can see this data whenever they make access
requests - and that's not where you want to be. You want to be able to automate the
access-request management so that you can be as hands off with it as possible.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 22/28
2/24/2021 The Future of Data Engineering
ThisLive
is kind of and
Webinar an alphabet soup and
Q&A - Designing withImplementing
role-basedan
access control
Integrated (RBAC),
Identity identity
Strategy access
(Mar 4th),
Sponsored by
management Auth0 and an access-control list (ACL). Access control is just a bunch of
(IAM),
fancy words for a bunch of different features for managing groups, user access, and so
on. You need three things to do this: you need your systems to support it, you need to
provide tooling to policymakers so they can configure the policies appropriately, and you
need to automate the management of the policies once the policymakers have defined
them.
There has been a fair amount of work done to support this aspect. Airflow has RBAC,
which was a patch WePay submitted. Airflow has taken this seriously and has added a
lot more, like DAG-level access control. Kafka has had ACLs for quite a while.
You can use tools to automate this stuff. We want to automate adding a new user to the
system and configuring their access. We want to automate the configuration of access
controls when a new piece of data is added to the system. We want to automate service-
account access as new web services come online.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 23/28
2/24/2021 The Future of Data Engineering
There's occasionally
Live Webinar a -need
and Q&A to grant
Designing someone temporary
and Implementing access
an Integrated to Strategy
Identity something.
(Mar You
4th), don't
Sponsored
want to have by
toAuth0
set a calendar reminder to revoke the access for this user in three weeks.
You want that to be automated. The same goes for unused access. You want to know
when users aren't using all the permissions that they're granted so that you can strip
those unused permissions to limit the vulnerability of the space.
Now that your data catalog tells you where all the data is and you have policies set up,
you need to detect violations. I mostly want to discuss data loss prevention (DLP) but
there's also auditing, which is keeping track of logs and making sure that the activities
and systems are conforming to the required policies.
I'm going to talk about Google Cloud Platform because I use it and I have some
experience with its data-loss solution. There's a corresponding AWS product called
Macie. There's also an open-source project called Apache Ranger, with a bit of an
enforcement and monitoring mechanism built into it; that's more focused on the
Hadoop ecosystem. What all these things have in common is that you can use them to
detect the presence of sensitive data where it shouldn't be.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 24/28
2/24/2021 The Future of Data Engineering
Figure
Live 20 is anand
Webinar example. A pieceand
Q&A - Designing of submitted text
Implementing contains Identity
an Integrated a phone number,
Strategy (Marand
4th),the
Sponsored
system sendsbya Auth0
result that says it is "very likely" that it has detected an infoType of phone
number. You can use this stuff to monitor your policies. For example, you can run DLP
checks on a data set that is supposed to be clean - i.e., not have any sensitive information
in it - and if a check finds anything like a phone number, Social Security number, credit
card, or other sensitive information, it can immediately alert you that there's a violation
in place.
There’s a little bit of progress here. Users can use the data catalog and find the data that
they need, we have some automation in place, and maybe we're using Terraform to
manage ACLs for Kafka or to manage RBAC in Airflow. But there's still a problem and
that is that data engineering is probably still responsible for managing that configuration
and those deployments. The reason for that is mostly the interface. We're still getting
pull requests, Terraform, DSL, YAML, JSON, Kubernetes ... it's nitty-gritty.
It might be a tall order to ask security teams to make changes to that. Asking your
compliance wing to make changes is an even taller order. Going beyond your compliance
people is basically impossible.
Stage 5: Decentralization
You're probably ready to decentralize your data pipeline and your data warehouses if
you have a fully automated real-time data pipeline but people are still coming to ask you
to load data.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 25/28
2/24/2021 The Future of Data Engineering
Live Webinar and Q&A - Designing and Implementing an Integrated Identity Strategy (Mar 4th),
Sponsored by Auth0
If you have an automated data pipeline and data warehouse, I don't think you need a
single team to manage all this stuff. I think the place where this will first happen, and
we're already seeing this in some ways, is in a decentralization of the data warehouse. I
think we're moving towards a world where people are going to be empowered to spin up
multiple data warehouses and administer and manage their own.
I frame this line of thought based on our migration from monolith to microservices over
the past decade or two. Part of the motivation for that was to break up large, complex
things, to increase agility, to increase efficiency, and to let people move at their own
pace. A lot of those characteristics sound like your data warehouse: it's monolithic, it's
not that agile, you have to ask your data engineering team to do things, and maybe
you're not able to do things at your own pace. I think we're going to want to do the same
thing - go from a monolith to microwarehouses - and we're going to want a more
decentralized approach.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 26/28
2/24/2021 The Future of Data Engineering
I'mLive
notWebinar
alone in this
and Q&Athought. Zhamak
- Designing Dehghanianwrote
and Implementing a great
Integrated blogStrategy
Identity post that
(Maris4th),
such a
Sponsored
great by Auth0
description of what I’m thinking. She discusses the shift from this monolithic view
to a more fragmented or decentralized view. She even discusses policy automation and a
lot of the same stuff that I'm thinking about.
I think this shift towards decentralization will take place in two phases. Say you have a
set of raw tools - Git, YAML, JSON, etc. - and a beaten-down engineering team that is
getting requests left and right and running scripts all the time. To escape that, the first
step is simply to expose that raw set of tools to your other engineers. They're
comfortable with this stuff, they know Git, they know pull requests, they know YAML
and JSON and all that. You can at least start to expose the automated tooling and
pipelines to those teams so that they can begin to manage their own data warehouses.
An example of this would be a team that does a lot of reporting. They need a data
warehouse that they can manage so you might just give them keys to the castle, and they
can go about it. Maybe there's a business-analytics team that's attached to your sales
organization and they need a data warehouse. They can manage their own as well.
This is not the end goal; the end goal is full decentralization. But for that we need much
more development of the tooling that we're providing, beyond just Git, YAML, and the
RTFM attitude that we sometimes throw around.
We need polished UIs, something that you can give not only to an engineer who’s been
writing code for 10 years but to almost anyone in your organization. If we can get to that
point, I think we will be able to create a fully decentralized warehouse and data pipeline
where Security and Compliance can manage access controls while data engineers
manage the tooling and infrastructure.
This is what Maxime Beauchemin meant by "... data engineers build tools,
infrastructure, frameworks, and services." Everyone else can manage their own data
pipelines and their own data warehouses and data engineers can help them do that.
There’s that key word "help" that I drew attention to at the beginning.
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 27/28
2/24/2021 The Future of Data Engineering
About the
Live Webinar and Author
Q&A - Designing and Implementing an Integrated Identity Strategy (Mar 4th),
Sponsored by Auth0
Discuss
https://www.infoq.com/articles/future-data-engineering-riccomini/?topicPageSponsorship=a0c8d845-527b-4a12-aa87-2bd19b172bec&itm_source=a… 28/28