Aws Data Analytics Fundamentals
Aws Data Analytics Fundamentals
Big data is an industry term that has changed in recent years. Big data solutions are
often part of data analysis solutions.
Business challenge
Imagine an organization that is growing rapidly.
Data is generated in many ways. The big question is where to put it all and how to use
it to create value or generate competitive advantages.
The challenges identified in many data analysis solutions can be summarized by five
key challenges: volume, velocity, variety, veracity, and value.
V is veracity, which refers to the trustworthiness of your data. Have you ever heard the
saying, “My word is my bond”? It’s supposed to instill trust, to let you know that the person
saying it is honorable and will do what they say they will. That’s veracity. To have
trustworthy data, you have to know the provenance of your data.
V is value—which is the bottom line, really. The whole point of this effort is getting value from data.
That includes creating reports and dashboards that inform critical business decisions. It also includes
highlighting areas for improving the business. And it includes making it easier to find and
communicate critical details about business operations.
Due to increasing volume, velocity, variety, veracity, and value of data, some data
management challenges cannot be solved with traditional database and processing
solutions. That's where data analysis solutions come in.
Streaming data is a source of business data that is gaining popularity. This data source is
less structured. It may require special software to collect the data and specific processing
applications to correctly aggregate and analyze it in near real-time.
Public data sets are another source of data for businesses. These include census data, health
data, population data, and many other datasets that help businesses understand the data they
are collecting on their customers. This data may need to be transformed so that it will contain
only what the business needs.
Throughout this course, we will cover the services that AWS offers for each of the
components pictured below.
It is vital to spot trends, make correlations, and run more efficient and profitable businesses.
It's time to put your data to work.
My business has a set of 15 JSON data files that are each about 2.5 GB in size. They
are placed on a file server once an hour. They must be ingested as soon as they arrive in this
location. This data must be combined with all transactions from the financial dashboard for
this same period, then compared to the recommendations from the marketing engine. All data
is fully cleansed. The results from this time period must be made available to decision makers
by 10 minutes after the hour in the form of financial dashboards. Based on the scenario
Scenario 2
My business compiles data generated by hundreds of corporations. This data is delivered to
us in very large files, transactional updates, and even data streams. The data must be cleansed
and prepared to ensure that rogue inputs do not skew the results. Knowing the data source for
each record is vital to the work we do. A large portion of the data gathered is irrelevant to our
analysis, so this data must be eliminated. The final requirement is that all data must be
combined and loaded into our data warehouse, where it will be analyzed.
This problem involves volume, variety, and veracity.
Volume The data is delivered in very large files, transactional updates, and even in data streams.
Variety The business will need to combine the data from all three sources into a single data warehouse.
Veracity The data is known to be suspect. The data must be cleansed and prepared to ensure that rogue
inputs do not skew the results. Knowing the data source for each record is vital to the work we do
Structured data is organized and stored in the form of values that are
grouped into rows and columns of a table.
Semistructured data is often stored in a series of key-value pairs that
are grouped into elements within a file.
Unstructured data is not structured in a consistent way. Some data
may have structure similar to semi-structured data but others may
only contain metadata.
Many internet articles tout the huge amount of information sitting within unstructured
data. New applications are being released that can now catalog and provide incredible
insights into this untapped resource.
But what is unstructured data? It is in every file that we store, every picture we take,
and email we send.
Data sets are getting bigger and more diverse every single day.
Modern data management platforms must capture data from diverse sources at
speed and scale. Data needs to be pulled together in manageable, central
repositories—breaking down traditional silos. The benefits of collection and
analysis of all business data must outweigh the costs.
CUSTOMER NEED
Imagine a business that has implemented Amazon QuickSight as a data visualization tool.
When this tool relies on data stored on-premises, latency may be added into processing. This
latency can become a problem for users. Another common concern is a user's ability to pull
together the correct data sets to perform the necessary analytics.
- Store anything
- Secure object storage
- Natively online, HTTP access
- Unlimited scalability
- 99.999999999% durability
Amazon S3 is object storage built to store and retrieve any amount of data from anywhere.
Amazon S3 concepts
To get the most out of Amazon S3, you need to understand a few simple concepts.
First, Amazon S3 stores data as objects within buckets.
An object is composed of a file and any metadata that describes that file. To store an
object in Amazon S3, you upload the file you want to store into a bucket. When you
upload a file, you can set permissions on the object and add any metadata.
Buckets are logical containers for objects. You can have one or more buckets in your
account and can control access for each bucket individually. You control who can
create, delete, and list objects in the bucket. You can also view access logs for the
bucket and its objects and choose the geographical region where Amazon S3 will store
the bucket and its contents.
Object metadata
For each object stored in a bucket, Amazon S3 maintains a set of system metadata. Click the link
to the right to learn more.
Data analysis solutions on Amazon S3
There are numerous advantages of using Amazon S3 as the storage platform for your
data analysis solution.
With Amazon S3, you can cost-effectively store all data types in their native formats. You
can then launch as many or as few virtual servers needed using Amazon Elastic Compute
Cloud (Amazon EC2) and use AWS analytics tools to process your data. You can optimize
your EC2 instances to provide the correct ratios of CPU, memory, and bandwidth for best
performance.
Decoupling your processing and storage provides a significant number of benefits, including
the ability to process and analyze the same data with a variety of tools.
Amazon S3 makes it easy to build a multi-tenant environment, where many users can bring
their own data analytics tools to a common set of data. This improves both cost and data
governance over traditional solutions, which require multiple copies of data to be distributed
across multiple processing platforms.
Although this may require an additional step to load your data into the right tool, using
Amazon S3 as your central data store provides even more benefits over traditional storage
options.
Although this may require an additional step to load your data into the right tool, using
Amazon S3 as your central data store provides even more benefits over traditional storage
options.
Combine Amazon S3 with other AWS services to query and process data. Amazon S3 also
integrates with AWS Lambda serverless computing to run code without provisioning or
managing servers. Amazon Athena can query Amazon S3 directly using the Structured Query
Language (SQL), without the need for data to be ingested into a relational database.
With all of these capabilities, you only pay for the actual amounts of data you process or the
compute time you consume.
Standardized Application Programming Interfaces (APIs)
–
Representational State Transfer (REST) APIs are programming interfaces commonly used to
interact with files in Amazon S3. Amazon S3's RESTful APIs are simple, easy to use, and
supported by most major third-party independent software vendors (ISVs), including Apache
Hadoop and other leading analytics tool vendors. This allows customers to bring the tools
they are most comfortable with and knowledgeable about to help them perform analytics on
data in Amazon S3.
Topic 2:
Introduction to data lakesStoring business content has
always been a point of contention, and often frustration, within businesses of all types.
Should content be stored in folders? Should prefixes and suffixes be used to identify
file versions? Should content be divided by department or specialty? The list goes on
and on.
The issue stems from the fact that many companies start to implement document or
file management systems with the best of intentions but don't have the foresight or
infrastructure in place to maintain the initial data organization.
Out of the dire need for organizing the ever increasing volume of data, data lakes were
born.
Business challenge
Businesses grow over time. As they do, a natural result is that important files and data get
scattered across the enterprise. It is very common to find employees who have no idea where
data can be found and—even worse—how to analyze it when it is in different locations.
Data lakes promise the ability to store all data for a business in a single repository.
You can leverage data lakes to store large volumes of data instead of persisting that
data in data warehouses. Data lakes, such as those built in Amazon S3, are generally
less expensive than specialized big data storage solutions. That way, you only pay for
the specialized solutions when using them for processing and analytics and not for
long-term storage. Your extract, transform, and load (ETL) and analytic process can
still access this data for analytics.
Single source of truth Be careful not to let your data lake become a swamp.
Enforce proper organization and structure for all data entering the lake.
Store any type of data, regardless of structure Be careful to ensure
that data within the data lake is relevant and does not go unused. Train users on how
to access the data, and set retention policies to ensure the data stays refreshed.
Traditional data storage and analytic tools can no longer provide the agility and flexibility
required to deliver relevant business insights. That’s why many organizations are shifting to a
data lake architecture.
Imagine a business that has thousands of files stored in Amazon S3. The business needs a
solution for automating common data preparation tasks and organizing the data in a secure
repository.
AWS Lake Formation (currently in preview)
AWS Lake Formation makes it easy to set up a secure data lake in days. A data lake is a
centralized, curated, and secured repository that stores all your data, both in its original form
and when prepared for analysis. A data lake enables you to break down data silos and
combine different types of analytics to gain insights and guide better business decisions.
AWS Lake Formation is in preview only.
AWS Lake Formation makes it easy to ingest, clean, catalog, transform, and secure
your data and make it available for analysis and machine learning. Lake Formation
gives you a central console where you can discover data sources, set up transformation
jobs to move data to an Amazon S3 data lake, remove duplicates and match records,
catalog data for access by analytic tools, configure data access and security policies,
and audit and control access from AWS analytic and machine learning services.
As the volume of data has increased, so have the options for storing data. Traditional
storage methods such as data warehouses are still very popular and relevant. However,
data lakes have become more popular recently. These new options can confuse
businesses that are trying to be financially wise and technically relevant.
So which is better: data warehouses or data lakes? Neither and both. They are
different solutions that can be used together to maintain existing data warehouses
while taking full advantage of the benefits of data lakes.
Business challenge
Businesses are left asking the question, "Why?" Why should we spend a bunch of time and
money implementing a data lake when we have invested so much into a data warehouse? It is
important to remember that a data lake augments, but does not replace, a data warehouse.
Data warehouses
A data warehouse is a central repository of structured data
from many data sources. This data
is transformed, aggregated, and prepared for business
reporting and analysis.
A data warehouse is a central repository of information coming from one or more data
sources. Data flows into a data warehouse from transactional systems, relational
databases, and other sources. These data sources can include structured,
semistructured, and unstructured data. These data sources are transformed into
structured data before they are stored in the data warehouse.
Data is stored within the data warehouse using a schema. A schema defines how data
is stored within tables, columns, and rows. The schema enforces constraints on the
data to ensure integrity of the data. The transformation process often involves the
steps required to make the source data conform to the schema. Following the first
successful ingestion of data into the data warehouse, the process of ingesting and
transforming the data can continue at a regular cadence.
Business analysts, data scientists, and decision makers access the data through
business intelligence (BI) tools, SQL clients, and other analytics
applications. Businesses use reports, dashboards, and analytics tools to extract insights
from their data, monitor business performance, and support decision making. These
reports, dashboards, and analytics tools are powered by data warehouses, which store
data efficiently to minimize I/O and deliver query results at blazing speeds to
hundreds and thousands of users concurrently.