0% found this document useful (0 votes)
19 views2 pages

Subtitle

Uploaded by

clementanaab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views2 pages

Subtitle

Uploaded by

clementanaab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

- [Raf] You may already

know that the cloud is more flexible, scalable, secure,


distributed, and resilient. But I want to give a more
data-related approach in terms of why cloud computing is
relevant for data analytics. In this section, I will explain why the
cloud is the best way to perform data analytics nowadays, and why it has been solid
for
operating big data workloads. So, let's get started. Before we start talking about
cloud, allow me to go back in time, maybe a decade, and
tell you a brief story. After going back in time, it will be natural for you to
understand why everybody loves doing
data analytics in the cloud. Ready for the journey? Get your beverage of
choice, and buckle up! (cup hitting the floor) (whirring sound) Years ago, the most
common
approach for companies to have compute infrastructure, big data included, was to
buy servers and install them into data centers. This is usually called
a collocation, or colo. The thing is, servers utilized for data operations are not
cheap, because they need lots of storage, consume lots of electricity, and require
careful maintenance
regarding data durability. Hence, entire dedicated
infrastructure teams. And trust me, I've been one of those infrastructure analysts
working with data centers. It is expensive and overwhelming. With that scenario,
only
big companies were able to work with big data. And consequently, data
analytics was not popular. It was very common for those servers to have a RAID
storage controller that replicates data across the disks, increasing the cost and
maintenance care even more. In the early 2000s, big data
operations were closely related to the underlying hardware, such as mainframes and
server clusters. Although this was extremely profitable for the ones selling
hardware, it was expensive and not
flexible for the consumers. Then, something fantastic
started to happen. And the name of this fantastic
thing is Apache Hadoop. Mostly, what Hadoop does is replacing all that fancy
hardware by software installed in operating systems. Yeah, that's right. With the
help of Hadoop
and computing frameworks, data could be distributed and replicated across multiple
servers by using distributed systems,
and eliminating the need of those expensive
data-replication hardware to start working with big data. All you needed was
efficient network equipment, and the data were
synchronized over the network to other servers. By embracing failures instead
of trying to avoid them, Hadoop helped reduce hardware complexity. And when you
reduce hardware
complexity, you reduce cost. And by reducing cost, you
start to democratize big data, because smaller companies could start leveraging it
as well. Welcome to the big data boom. I brought up Hadoop originally, because
Hadoop is the most popular open source, big data ecosystem. There are others. And
what I wanted to highlight
here, is the concept, and not specific frameworks or vendors. The thing is, by
baselining
hardware to a basic level and applying all big data concepts to software, such as
data replication, we can start thinking about
running big data operations on providers that are capable
to provide virtual machines with storage and a network card attached. We can start
thinking
about using the cloud to build entire data
lakes, data warehousing, and data analytics solutions. Since then, cloud computing
has emerged as an attractive alternative because it is exactly what it does. You
can get virtual machines,
install the software that will handle the data replication, distributed file
systems, and
entire big data ecosystems, and be happy without having to spend lots of money in
hardware. The advantage is that
cloud does not stop there. Many cloud providers, such
as Amazon Web Services, started to see that customers were spinning up virtual
machines to install big data tools and frameworks. And then based on that, Amazon
started to create offerings with everything already
installed, configured, and ready to use. That's why you have AWS
services, such as Amazon EMR, Amazon S3, Amazon RDS, Amazon
Athena, and many others. Those are what we call managed services. All those are AWS
services
that operate in the data scope. In a later lesson, I will
talk more about some services. We will need to build our
basic data analytics solution. Another big advantage of running
data analytics in the cloud is the ability to stop paying
for infrastructure resources when you don't need them anymore. This is very common
in the data analytics, because due to the nature
of big data operations, you may need to run
reports once in a while. And you can easily do that in the cloud by spinning up
server or services, using them, getting the
report you need, saving that, and turning off everything. In addition, you can
temporarily spin more servers to speed up your jobs, and
turn off when you're done. And since you mostly pay for
time and resources needed, 10 servers running for 1 hour
tends to have the same price of one server running for 10 hours. Basically, with
the cloud,
you're having access to hardware without having to concern with all the burden
involved on doing data center operations. It is like the best of both worlds. Stay
with me to learn
more about AWS services I will use towards my descriptive
data analysis solution using Amazon S3, CloudTrail,
Amazon Athena, and QuickSight.

You might also like