An AWS Data Lake With S3 Explained! - by David Hundley - Towards Data Science
An AWS Data Lake With S3 Explained! - by David Hundley - Towards Data Science
Open in app
If you’ve had any connection to the data world, you’ve probably heard some memorable,
often quirky phrase about how valuable data is. I’m thinking of phrases like…
“Without big data, you are deaf and blind in the middle of a freeway.”
https://towardsdatascience.com/an-aws-data-lake-with-s3-explained-c67c5f161db3 1/9
5/22/2021 An AWS Data Lake with S3 Explained! | by David Hundley | Towards Data Science
And truthfully, the hype is merited. (Except I’m not sure I’d agree with that one about
the bacon…) From an analytical perspective, data helps us to make informed
decisions about the next steps we should take in our businesses. This can manifest
in the form of tabular reports to data dashboards to this new thing getting a lot of hype
called machine learning. Where people in “ye olden days” were very much left in the
dark on how to make the best business decisions, we today have a LOT of data resources
at our disposal to help. Using my back-catalog of icons I’ve created for former blog posts
(😃), I’m not exaggerating when I say that all the things in the visual below can produce
valuable data.
So given that all this stuff is creating data, the next logical question is… how do we
make the best use of it?
This is where the concept of a data lake comes into mind. Put basically, a data lake is a
unified space to place all of your data — both structured and unstructured — to
build analytical solutions from. And because I’m a picture guy, here’s a simple picture
that illustrates that.
https://towardsdatascience.com/an-aws-data-lake-with-s3-explained-c67c5f161db3 2/9
5/22/2021 An AWS Data Lake with S3 Explained! | by David Hundley | Towards Data Science
Open in app
Of course, you want to manage this data lake to be sure it doesn’t become a data
dumping ground. Data governance is super important, so you’ll want to be sure you
manage things like metadata, data quality, and more with the stuff that you put into this
data lake. That’s sort of out of the scope for this post, but I would be doing you a
disservice if I didn’t at least mention it!
For this post in particular, I want to focus on what it means building a data lake within
Amazon Web Services (AWS). With cloud solutions being all the rage these days, it
makes sense that people would want to build out their own data lake within AWS. More
specifically, it would make sense that people would want to use AWS’s Simple Storage
Service (S3) as that basis for the data lake.
Here’s the problem… no offense to AWS, but I don’t think they do a great job at
explaining how S3 differs from “old world” concepts. I hold four AWS certifications —
including the Big Data Specialty — and none of the study materials I came across
studying for those certifications really explains all that well what I’m about to explain in
this post. What I’m going to share in this post will likely radically change your thinking
about how to properly design your AWS data lake on S3.
But BEFORE we get into that, let’s talk about those “old world” concepts…
https://towardsdatascience.com/an-aws-data-lake-with-s3-explained-c67c5f161db3 3/9
5/22/2021 An AWS Data Lake with S3 Explained! | by David Hundley | Towards Data Science
So if you had data in one physical environment that had to be used for analytical
purposes in another physical environment, you probably had to copy that data over to
the new replica environment. Of course, you probably also kept a tie to the source
environment to ensure that the stuff in the replica environment is still up-to-date.
That little image above represents copying data from one operational source to an
analytical replica. Of course, your operational source data most likely isn’t in one single
environment. It’s likely that you have tens — if not hundreds — of those operational
sources where you gather data. That’s a lot of data movement! But due to literal physical
https://towardsdatascience.com/an-aws-data-lake-with-s3-explained-c67c5f161db3 4/9
5/22/2021 An AWS Data Lake with S3 Explained! | by David Hundley | Towards Data Science
limitations, that copy has to be done. The data can’t literally be in two places at the same
Open in app
time, right?
Well, here’s where things get different with AWS and their S3 buckets…
But do you know why that is the case? Don’t fret if you don’t! Maybe a simple picture will
help illustrate this…
In a sense, it’s not unfair to think about AWS in general already being one GIANT data
lake! The reason you can’t create a bucket called “dkhundley” is because there’s already one
present in this massive “data lake” we call S3. The physical infrastructure has been
https://towardsdatascience.com/an-aws-data-lake-with-s3-explained-c67c5f161db3 5/9
5/22/2021 An AWS Data Lake with S3 Explained! | by David Hundley | Towards Data Science
abstracted away from us, so logically speaking, it’s like every single company’s data is
Open in app
one big happy family.
Now, don’t let me scare you! By this logic, you might jump to the natural conclusion that
the data in your S3 buckets can be readily accessed by somebody else’s company and
their respective AWS account. This thankfully is NOT true. AWS has been very
intentional about putting the proper security around everything in AWS, including S3
buckets, so you can only access S3 buckets if you have the right credentials to do so.
Here’s the real kicker and why S3 is so different than on-premise infrastructure (this is
VERY important): you don’t necessarily have to be within the same account that
produced the data in the S3 bucket to have access to that data. For example, if you
set me up with the right credentials, I can see the contents of your company’s S3
bucket(s) right from my own personal AWS account, NO PHYSICAL COPYING
REQUIRED. That’s right, folks. This is a HUGE shift in mindset from how we do things in
the on-premise world. Where we focus our time on isolating data with physical
infrastructure, cloud computing shifts are attention to focus on isolating data using security
policies.
https://towardsdatascience.com/an-aws-data-lake-with-s3-explained-c67c5f161db3 6/9
5/22/2021 An AWS Data Lake with S3 Explained! | by David Hundley | Towards Data Science
Given that AWS adopts a pay-as-you-go model, you want to design things in such a way
Open in app
that maximizes performance and minimizes costs. Considering that both storage and
data movement have their associated costs, replicating data from one S3 bucket to
another is both cost prohibitive AND performance inefficient. Likewise, in the context of
S3, AWS accounts DO NOT isolate resources. Do not make the mistake of thinking you
have to copy data from one S3 bucket to another just because you might not share the
same account as another in your company! There will still be specific use cases where
you do want to move data between S3 buckets, but if your analytical data is already good
to go in one S3 bucket, physically copying it to another “data lake account” S3 bucket is
probably not needed.
Test vs. Production Data: When you create a new IT solution that makes changes to
data, it’s natural to want to protect your production-level data from being negatively
impacted by that new solution. In most on-premise infrastructures, that means
physically isolating the environments between test and production. How you isolate
test vs. production data in S3 needs to be considered, and it can be done in a number
of ways. The safest and easiest way is by wholly isolating data into their own
respective buckets, and you can manage organization either using bucket naming
standards or AWS tags. (Or both!) But if you don’t want to replicate everything,
there are ways to isolate certain things within each bucket. That’s going to take more
work on your end, but if cost management is a big factor to you, it might be worth it.
Sensitive Data Protection: This is a lot like the isolation test vs. production that we
just discussed in the point above. The easiest thing again is to isolate sensitive data
into its own bucket and really lock that down with lots of security measures, but
again, it is possible to still lock down sensitive data with non-sensitive data in the
same bucket. I probably wouldn’t want to mess with the hassle of that, but you do
you, friends.
https://towardsdatascience.com/an-aws-data-lake-with-s3-explained-c67c5f161db3 7/9
5/22/2021 An AWS Data Lake with S3 Explained! | by David Hundley | Towards Data Science
Data Lake vs. Data Warehouse: Let’s be clear here… a data lake is NOT
Open in app
synonymous with a data warehouse. A data warehouse generally contains only
structured or semi-structured data, whereas a data lake contains the whole shebang:
structured, semi-structured, and unstructured. Data lakes often coexist with data
warehouses, where data warehouses are often built on top of data lakes. In terms of
AWS, the most common implementation of this is using S3 as the data lake and
Redshift as the data warehouse. Of course, there’s more than one way to skin a cat in
AWS, so don’t think you’re only limited to Redshift for your warehousing needs.
Data Management & Governance: I already brushed past this once in this post, but
I think it’s worth bringing up again. A data lake can become a data dump VERY
quickly without proper data management and governance. When you design your
data lake, AWS does offers services like AWS Glue to help you manage stuff like a
Data Catalog, but it puts a lot on you to figure out that stuff for yourselves. If you
really want extra help in this space, there are also many third party vendors that will
provide a lot of oomph here. (Oomph is a technical term. 😂) Depending on your
company’s needs, it might be worth that extra investment to bring in a third party
vendor to help you organize your data lake. (I’m not overly familiar with it, but AWS
does also offer a service called Lake Formation that may also be worth looking into.)
Lake Consumption: Things can get a little bit tricky when you want to build
analytical solutions on top of your data lake. Whereas AWS accounts don’t
necessarily matter when putting data into a data lake, they do matter more for your
consumption solutions. Multiple accounts can draw from the same data lake, but
you have to ensure that they all have the proper security credentials to access those
underlying S3 buckets. And chances are, you don’t want to give blanket access to
everybody for every S3 bucket in your data lake. Again, this is where data
management and governance is extremely important, so it again may be worth the
investment to leverage those same third party governance tools to help divy out
security credentials appropriately.
Alrighty, that wraps up this post! This was a pretty foreign concept to me until fairly
recently, so don’t beat yourself up if you didn’t fully grasp this even if you have an AWS
certification. If you enjoyed this post, you might also appreciate some of my other posts,
including last week’s post on my five tips for getting started in AWS. Thanks for reading!
https://towardsdatascience.com/an-aws-data-lake-with-s3-explained-c67c5f161db3 8/9
5/22/2021 An AWS Data Lake with S3 Explained! | by David Hundley | Towards Data Science
Open in app
Sign up for The Variable
By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/an-aws-data-lake-with-s3-explained-c67c5f161db3 9/9