Big Data PDF
Big Data PDF
determine its essential features. Data analysis is the process of compiling, processing,
and analyzing data so that you can use it to make decisions.
Analytics is the systematic analysis of data. Data analytics is the specific analytical
process being applied.
Effective data analysis solutions require both storage and the ability to analyze data in
near real time, with low latency,
while yielding high-value returns.
Streaming data is a source of business data that is gaining popularity. This data source
is less structured. It may require special software to collect the data and specific
processing applications to correctly aggregate and analyze it in near real-time.
Public data sets are another source of data for businesses. These include census data,
health data, population data, and many other datasets that help businesses understand
the data they are collecting on their customers. This data may need to be transformed
so that it will contain only what the business needs.
It is vital to spot trends, make correlations, and run more efficient and profitable
businesses. It's time to put your data to work.
When businesses have more data than they are able to
process and analyze, they have a volume problem.
● Structured data is organized and stored in the form of values that are
grouped into rows and columns of a table.
● Semistructured data is often stored in a series of key-value pairs that are
grouped into elements within a file.
● Unstructured data is not structured in a consistent way. Some data may
have structure similar to semi-structured data but others may only contain
metadata.
Amazon S3 concepts
To get the most out of Amazon S3, you need to understand a few simple concepts.
First, Amazon S3 stores data as objects within buckets.An object is composed of a
file and any metadata that describes that file. To store an object in Amazon S3, you
upload the file you want to store into a bucket. When you upload a file, you can set
permissions on the object and add any metadata. Buckets are logical containers for
objects. You can have one or more buckets in your account and can control access for
each bucket individually. You control who can create, delete, and list objects in the
bucket. You can also view access logs for the bucket and its objects and choose the
geographical region where Amazon S3 will store the bucket and its contents.
Although this may require an additional step to load your data into the right tool, using
Amazon S3 as your central data store provides even more benefits over traditional
storage options.
With all of these capabilities, you only pay for the actual amounts of data you
process or the compute time you consume
DATA LAKE
data lake is an architectural concept that helps you manage multiple data types from
A
multiple sources, both structured and unstructured, through a single set of tools.
Let’s break that down. A data lake takes Amazon S3 buckets and organizes them by
categorizing the data inside the buckets. It doesn’t matter how the data got there or
what kind it is. You can store both structured and unstructured data effectively in an
Amazon S3 data lake. AWS offers a set of tools to manage the entire data lake without
treating each bucket as separate, unassociated objects.
Many businesses end up grouping data together into numerous storage locations called
silos. These silos are rarely managed and maintained by the same team, which can be
problematic. Inconsistencies in the way data was written, collected, aggregated, or
filtered can cause problems when it is compared or combined for processing and
analysis.
For example, one team may use the address field to store both the street number and
street name, while another team might use separate fields for street number and street
name. When these datasets are combined, there is now an inconsistency in the way the
address is stored, and it will make analysis very difficult.
But by using data lakes, you can break down data silos and bring data into a single,
central repository that is managed by a single team. That gives you a single, consistent
source of truth.
Because data can be stored in its raw format, you don’t need to convert it, aggregate it,
or filter it before you store it. Instead, you can leave that pre-processing to the system
that processes it, rather than the system that stores it.
In other words, you don’t have to transform the data to make it usable. You keep the
data in its original form, however it got there, however it was written. When you’re
talking exabytes of data, you can’t afford to pre-process this data in every conceivable
way it may need to be presented in a useful state.
Let’s talk about having a single source of truth. When we talk about truth in relation to
data, we mean the trustworthiness of the data. Is it what it should be? Has it been
altered? Can we validate the chain of custody? When creating a single source of truth,
we’re creating a dataset, in this case the data lake, which can be used for all processing
and analytics. The bonus is that we know it to be consistent and reliable. It’s
trustworthy.
DATA WAREHOUSE(redshift):
Structured data storage is performed through databases. We are going to focus on one
specific type of database, a data warehouse, which is one of the most common
analytical database solutions.Data warehouses are used as a central system for storing
analytical data from multiple sources. Say a company has lots of different databases,
because different departments use them to track different things. That’s fine, but that
makes it difficult to get a good idea of what’s happening across all departments.
A data warehouse is a central repository of structured data from many data sources.
This data is transformed, aggregated, and prepared before it is loaded into the data
warehouse. The data within the data warehouse is then used for business reporting and
analysis.
Data warehouses are databases that store transactional data in a format that
accommodates large, complex queries. Warehouses have been the backbone of
business intelligence for decades. From small businesses with a relatively small dataset
to enormous businesses with exabytes of data, you can use data warehousing to make
sense of all kinds of data.
Before we continue talking about what Hadoop is, let’s define what Hadoop is not. It is
not a database or a replacement for existing data systems. It is not a single application.
Instead, it’s a framework of tools that help you both store and process data.
For now, let’s focus on the storage part of this framework.Hadoop supports rapid data
transfers, which means you can speed up the processing time for complex queries.
Whether you use Hadoop on-premises or Amazon EMR, you will use the same tools,
with one major exception: Amazon EMR uses its own file system. And that means you
can use your Amazon S3 data lake as the data store. So there’s no need to copy data
into the cluster, as you would with Hadoop on-premises.
PPros Cons
Data quality Highly curated data that serves Any data, which may or
as the central version of the truth may not be curated (e.g.,
raw data)
Apache Hadoop
Hadoop uses a distributed processing architecture, in which a task is mapped to a
cluster of commodity servers for processing. Each piece of work distributed to the
cluster servers can be run or re-run on any of the servers. The cluster servers frequently
use the Hadoop Distributed File System (HDFS) to store data locally for processing.
The results of the computation performed by those servers are then reduced to a single
output set. One node, designated as the master node, controls the distribution of tasks
and can automatically handle server failures.
When businesses need rapid insights from the data they
are collecting, but the systems in place simply cannot
meet the need, there's a velocity problem.
Generally speaking, there are two types of processing: batch processing and stream
processing. These require different architectures and enable different levels of analysis.
Batch processing means processing content in batches. You would use batch
processing when you have a lot of data to process, and you need to process it at certain
intervals—for example, on a schedule or when you reach a certain volume of data. This
kind of processing is performed on datasets like server logs, financial data, fraud
reports, and clickstream summaries.
Stream processing means processing data in a stream—in other words, processing
data that’s generated continuously, in small datasets (measured in kilobytes). You
would use stream processing when you need real-time feedback or continuous insights.
This kind of processing is performed on datasets like IoT sensor data, e-commerce
purchases, in-game player activity, clickstreams, or information from social networks.
Many organizations use both types of processing, stream and batch, on the exact same
dataset. Stream processing is used to get initial insights and real-time feedback, while
batch processing is used to get deep insights from complex analytics. For example,
credit card transactions. Have you ever received in text just moments after swiping your
card? This is a streaming data fraud prevention alert. This is a stream process that
happens in near-real time.
Now another process going on regularly using the same data would be the credit card
company processing the days customer fraud prevention alert trends. Same data, two
completely different business being met, two different velocities.
Data processing and the challenges associated with it can often be defined by the
velocity at which the data must be collected and processed. Batch processing comes in
two forms: scheduled and periodic.
Data acceleration
Another key characteristic of velocity on data is data acceleration, which means the rate
at which large collections of data can be ingested, processed, and analyzed. Data
acceleration is not constant. It comes in bursts. Take Twitter as an example. Hashtags
can become hugely popular and appear hundreds of times in just seconds, or slow
down to one tag an hour. That's data acceleration in action. Your system must be able
to efficiently handle the peak of hundreds of tags a second and the lows of one tag an
hour.
Batch:
atch processing is the execution of a series of programs, or jobs, on one or more
computers without manual intervention. Data is collected into batches asynchronously.
The batch is sent to a processing system when specific conditions are met, such as a
specified time of day. The results of the processing job are then sent to a storage
location that can be queried later as needed.
Remember, this is the service that will process the incoming stream. Parallel
consumption of data is one of the greatest benefits where velocity is concerned. Parallel
consumption lets multiple consumers work simultaneously on the same data to build
parallel applications on top of the data and manage their own time frames.
AMAZON KINESIS:
Amazon Kinesis Data Firehose, Amazon Kinesis Data Streams, and Amazon Kinesis
Data Analytics. A fourth capability of Kinesis is Amazon Kinesis Video Streams. This
service makes it easy to securely stream video from connected devices to AWS for
analytics, machine learning (ML), and other processing.
sensor data is being collected as a stream by Kinesis Data Firehose. This service is
configured to send the data to Kinesis Data Analytics, which will process the data.
Kinesis Data Analytics filters the data for relevant records and sends them into the next
Kinesis Data Firehose process.
In parallel, a second process selects relevant records from the original Kinesis Data
Firehose and places them in an Amazon S3 bucket at the serving layer. All of the
non-relevant records in this stream are simply discarded. Now that the relevant data is
in Amazon S3, you can use Amazon Athena to query those records.
In this architecture, sensor data is being collected in the form of a stream. The
streaming data is being collected from the sensor devices by Amazon Kinesis Data Firehose.
This service is configured to send the data to be processed using Amazon Kinesis Data
Analytics. This service filters the data for relevant records and send the data into another
Kinesis Data Firehose process, which places the results into an Amazon S3 bucket at the
serving layer.Using Amazon Athena, the data in the Amazon S3 bucket can now be queried to
produce insightful dashboards and reports using Amazon QuickSight.
Non-relational databases
Non-relational databases are built to store semistructured and unstructured data in a
way that provides for rapid collection and retrieval. There are several broad categories
of non-relational databases, and data is stored in each to meet specific requirements.
Document stores are a type of non-relational database that store semistructured and
unstructured data in the form of files. These files range in form but include JSON,
BSON, and XML. The files can be navigated using numerous languages including
Python and Node.js.
Logically, files contain data stored as a series of elements. Each element is an instance
of a person, place, thing, or event. For instance, the document store may hold a series
of log files from a set of servers. These log files can each contain the specifics for that
system without concern for what the log files in other systems contain.
Strengths:
● Flexibility
● No need to plan for a specific type of data when creating one
● Easy to scale
Weaknesses:
● Sacrifice ACID compliance for flexibility
● Cannot query across files
Key-value databases are a type of non-relational database that store unstructured data
in the form of key-value pairs.
Logically, data is stored in a single table. Within the table, the values are associated
with a specific key. The values are stored in the form of blob objects and do not require
a predefined schema. The values can be of nearly any type.
Strengths:
● Very flexible
● Able to handle a wide variety of data types
● Keys are linked directly to their values with no need for indexing or complex
join operations
● Content of a key can easily be copied to other systems without
reprogramming the data
Weaknesses:
● Impossible to query values because they are stored as a single blob
● Updating or editing the content of a value is quite difficult
● Not all objects are easily modeled as key-value pairs
GRAPH DATABASES:
Graph databases are purpose-built to store any type of data: structured, semistructured,
or unstructured. The purpose for organization in a graph database is to navigate
relationships. Data within the database is queried using specific languages associated
with the software tool you have implemented.
Logically, data is stored as a node, and edges store information on the relationships
between nodes. An edge always has a start node, end node, type, and direction, and an
edge can describe parent-child relationships, actions, ownership, and the like. There is
no limit to the number and kind of relationships a node can have.
Strengths:
● Allow simple, fast retrieval of complex hierarchical structures
● Great for real-time big data mining
● Can rapidly identify common data points between nodes
● Great for making relevant recommendations and allowing for rapid querying
of those relationships
Weaknesses:
● Cannot adequately store transactional data
● Analysts must learn new languages to query the data
● Performing analytics on the data may not be as efficient as with other
database types.
Data veracity is the degree to which data is accurate, precise, and trusted.
Domain integrity i s the process of ensuring that the data being entered into a
field matches the data type defined for that field..
Entity integrity is the process of ensuring that the values stored within a field
match the constraints defined for that field.
. BASE consistency is most concerned about the rapid availability of data. BASE
consistency is commonly implemented for NoSQL databases, in distributed systems
and on unstructured data stores. To ensure availability, changes to data are made
available immediately on the instance where the change was made. However, it may
take time for that change to be replicated across all instances. Eventually, the change
will be fully consistent across all instances.BASE is an acronym for B
a
sically
Av ailable Soft state E
ventually consistent. It is a method for maintaining
consistency and integrity in a structured or semistructured database.
BASE compliance
BASE supports data integrity in non-relational databases, which are
DynamoDB still use transactions for processing requests. These databases are
consistency of the data. To ensure the data is highly available, changes to data
are made available immediately on the instance where the change was made.
However, it may take time for that change to be replicated across the fleet of
instances. The aim is that the change will eventually be fully consistent across
the fleet. e en
The BA in BASE stands for basically available. This lets one instance of the database
receive a new record—or update an existing record—and make that change available
immediately for that instance. As this change is replicated across all the other instances,
the other instances will eventually become consistent.
In an ACID system, the change would not become available until all instances were
consistent. That's the key. In a BASE system, complete consistency is traded for
immediate availability. Eventually, you get complete consistency and availability in both
consistency models. The difference is in which one comes first.
The S in BASE stands for soft state. In a BASE system, there are allowances for partial
consistency across distributed instances. For this reason, BASE systems are
considered to be in a soft state, also known as a changeable state. To contrast this, in
an ACID system, the database is considered to be in a hard state because users can
only access data that is fully consistent.
Amazon EMR is a more hands-on approach to creating your data pipeline. This
service provides a robust data collection and processing platform. Using this
service requires you to have strong technical knowledge and know-how on your
team. The upside of this is that you can create a more customized pipeline to fit
your business needs. Additionally, your infrastructure costs may be lower than
running the same workload on AWS Glue.
AWS Glue is a serverless, managed ETL tool that provides a much more
streamlined experience than Amazon EMR. This makes the service great for
simple ETL tasks, but you will not have as much flexibility as with Amazon EMR.
You can also use AWS Glue as a metastore for your final transformed data by
using the AWS Glue Data Catalog. This catalog is a drop-in replacement for a
Hive metastore.
There are five types of analysis: descriptive, diagnostic, predictive, prescriptive, and
cognitive. Let’s begin with descriptive analysis, often called data mining. This form is the
oldest and most common. This method involves aggregating or comparing historic
values to answer the question “What happened?” or “What is happening?” For instance,
what were doorbell sales last month? Or what has been the highest grossing cat food of
the fourth quarter so far? This form of analysis provides insights and requires a large
amount of human judgment to turn those insights into actionable information.
Michelle: The next form of analysis is diagnostic. This method involves comparing
historic data, often gathered in descriptive analysis, to other data sets to answer the
question “Why did it happen?” With this information, you can answer questions such as
why did our social media impressions dip last month? Or what is the likely cause of the
increase in customer complaints?
As with descriptive analysis, this form of analysis provides insights and requires human
judgment to turn those insights into actionable information.
Blaine: The next form of analysis is predictive. This method involves predicting what
might happen in the future based on what happened in the past. For instance, what are
sales likely to be in 2020 based on our current rate of growth? What is the total number
of new subscriptions we can expect based on last year’s trajectory? This form of
analysis provides insights called predictions. These predictions also require human
judgment to evaluate the validity, ensuring they are realistic.
Michelle: The next form of analysis is prescriptive. This is where the action really heats
up. This method involves looking at historic data and predictions to answer the question
“What should be done?” Now, this form of analysis deviates from the three previous in
that it requires applications to incorporate rules and constraints to make intelligent
recommendations. The greatest advantage of this form of analysis is that it can be
automated. Applications that implement prescriptive analysis can make
recommendations, or decisions, and take action based on those recommendations.
Machine learning makes this method of analysis possible. Machine learning models
perform the analysis. These models become more accurate through a process called
training. During training, the application runs the data through the rules and constraints
several times. This refines the model’s ability to make accurate recommendations. For
example, Amazon.com recommends products based upon a customer’s purchase
history.
The only time human involvement is required for prescriptive analysis is when you’re
building and training the model. Once the model is in production, it can run independent
of human interaction. This form of analysis provides insights and recommendations.
Blaine: The final form of analysis is cognitive. This form of analysis uses a type of
artificial intelligence called deep learning. Deep learning, along with prescriptive
analysis, is used to make decisions and take action based upon visual, auditory, or
even natural language inputs. Deep learning mimics human judgment by combining
existing data and patterns to draw conclusions. With each analysis, the results feed
back into a knowledge database to inform future decisions. This creates a self-learning
feedback loop. Think of Amazon Alexa—the more questions you ask Alexa, the more
the system learns from you.