0% found this document useful (0 votes)

119 views18 pages

Big Data PDF

The document discusses planning a data analysis solution. It explains that data can come from existing on-premises databases and files, streaming sources, or public datasets. Different options are available for processing data, including Amazon S3, AWS Lambda, and Amazon Athena. The goal is to learn from the data to optimize efforts and spot trends. When businesses have more data than they can process, they have a volume problem.

Uploaded by

Anushka Sinha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

119 views18 pages

Big Data PDF

Uploaded by

Anushka Sinha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Analysis is a detailed examination of something in order to understand its nature or

determine its essential features. Data analysis is the process of compiling, processing,
and analyzing data so that you can use it to make decisions.

Analytics is the systematic analysis of data. Data analytics is the specific analytical
process being applied.
Effective data analysis solutions require both storage and the ability to analyze data in
near real time, with low latency,
while yielding high-value returns.

Planning a data analysis solution

Know where your data comes from
The majority of data ingested by data analysis solutions comes from existing
on-premises databases and file stores. This data is often in a state where the required
processing within the solution will be minimal.

Streaming data is a source of business data that is gaining popularity. This data source
is less structured. It may require special software to collect the data and specific
processing applications to correctly aggregate and analyze it in near real-time.

Public data sets are another source of data for businesses. These include census data,
health data, population data, and many other datasets that help businesses understand
the data they are collecting on their customers. This data may need to be transformed
so that it will contain only what the business needs.

Know the options for processing your data

There are many different solutions available for processing your data. There is no
one-size-fits-all approach. You must carefully evaluate your business needs and match
them to the services that will combine to provide you with the required results.

Know what you need to learn from your data

You must be prepared to learn from your data, work with internal teams to optimize
efforts, and be willing to experiment.

It is vital to spot trends, make correlations, and run more efficient and profitable
businesses. It's time to put your data to work.
When businesses have more data than they are able to
process and analyze, they have a volume problem.

There are three broad classifications of data source types:

● Structured data is organized and stored in the form of values that are
grouped into rows and columns of a table.
● Semistructured data is often stored in a series of key-value pairs that are
grouped into elements within a file.
● Unstructured data is not structured in a consistent way. Some data may
have structure similar to semi-structured data but others may only contain
metadata.

Amazon S3 concepts
To get the most out of Amazon S3, you need to understand a few simple concepts.
First, Amazon S3 stores data as objects within buckets.An object is composed of a
file and any metadata that describes that file. To store an object in Amazon S3, you
upload the file you want to store into a bucket. When you upload a file, you can set
permissions on the object and add any metadata. Buckets are logical containers for
objects. You can have one or more buckets in your account and can control access for
each bucket individually. You control who can create, delete, and list objects in the
bucket. You can also view access logs for the bucket and its objects and choose the
geographical region where Amazon S3 will store the bucket and its contents.

Data analysis solutions on Amazon S3

Decoupling of storage from compute and data processing

Centralized data architecture

Amazon S3 makes it easy to build a multi-tenant environment, where many users can
bring their own data analytics tools to a common set of data. This improves both cost
and data governance over traditional solutions, which require multiple copies of data to
be distributed across multiple processing platforms.

Although this may require an additional step to load your data into the right tool, using
Amazon S3 as your central data store provides even more benefits over traditional
storage options.

Integration with clusterless and serverless AWS services

Combine Amazon S3 with other AWS services to query and process data.
Amazon S3 also integrates with AWS Lambda serverless computing to run
code without provisioning or managing servers. Amazon Athena can query
Amazon S3 directly using the Structured Query Language (SQL), without the
need for data to be ingested into a relational database.

With all of these capabilities, you only pay for the actual amounts of data you
process or the compute time you consume

Standardized Application Programming Interfaces (APIs)

Representational State Transfer (REST) APIs are programming interfaces commonly
used to interact with files in Amazon S3. Amazon S3's RESTful APIs are simple, easy to
use, and supported by most major third-party independent software vendors (ISVs),
including Apache Hadoop and other leading analytics tool vendors. This allows
customers to bring the tools they are most comfortable with and knowledgeable about to
help them perform analytics on data in Amazon S3.

DATA LAKE
data lake is an architectural concept that helps you manage multiple data types from
A
multiple sources, both structured and unstructured, through a single set of tools.

Let’s break that down. A data lake takes Amazon S3 buckets and organizes them by
categorizing the data inside the buckets. It doesn’t matter how the data got there or
what kind it is. You can store both structured and unstructured data effectively in an
Amazon S3 data lake. AWS offers a set of tools to manage the entire data lake without
treating each bucket as separate, unassociated objects.
Many businesses end up grouping data together into numerous storage locations called
silos. These silos are rarely managed and maintained by the same team, which can be
problematic. Inconsistencies in the way data was written, collected, aggregated, or
filtered can cause problems when it is compared or combined for processing and
analysis.
For example, one team may use the address field to store both the street number and
street name, while another team might use separate fields for street number and street
name. When these datasets are combined, there is now an inconsistency in the way the
address is stored, and it will make analysis very difficult.
But by using data lakes, you can break down data silos and bring data into a single,
central repository that is managed by a single team. That gives you a single, consistent
source of truth.
Because data can be stored in its raw format, you don’t need to convert it, aggregate it,
or filter it before you store it. Instead, you can leave that pre-processing to the system
that processes it, rather than the system that stores it.
In other words, you don’t have to transform the data to make it usable. You keep the
data in its original form, however it got there, however it was written. When you’re
talking exabytes of data, you can’t afford to pre-process this data in every conceivable
way it may need to be presented in a useful state.

Let’s talk about having a single source of truth. When we talk about truth in relation to
data, we mean the trustworthiness of the data. Is it what it should be? Has it been
altered? Can we validate the chain of custody? When creating a single source of truth,
we’re creating a dataset, in this case the data lake, which can be used for all processing
and analytics. The bonus is that we know it to be consistent and reliable. It’s
trustworthy.

A data lake is a centralized repository that allows you

to store structured, semistructured, and unstructured
data at any scale.

Benefits of a data lake on AWS

● Are a cost-effective data storage solution. You can durably store a
nearly unlimited amount of data using Amazon S3.
● Implement industry-leading security and compliance. AWS uses
stringent data security, compliance, privacy, and protection
mechanisms.
● Allow you to take advantage of many different data collection and
ingestion tools to ingest data into your data lake. These services
include Amazon Kinesis for streaming data and AWS Snowball
appliances for large volumes of on-premises data.
● Help you to categorize and manage your data simply and efficiently.
Use AWS Glue to understand the data within your data lake, prepare
it, and load it reliably into data stores. Once AWS Glue catalogs your
data, it is immediately searchable, can be queried, and is available for
ETL processing.
● Help you turn data into meaningful insights. Harness the power of
purpose-built analytic services for a wide range of use cases, such as
interactive analysis, data processing using Apache Spark and Apache
Hadoop, data warehousing, real-time analytics, operational analytics,
dashboards, and visualizations.

DATA WAREHOUSE(redshift):
Structured data storage is performed through databases. We are going to focus on one
specific type of database, a data warehouse, which is one of the most common
analytical database solutions.Data warehouses are used as a central system for storing
analytical data from multiple sources. Say a company has lots of different databases,
because different departments use them to track different things. That’s fine, but that
makes it difficult to get a good idea of what’s happening across all departments.
A data warehouse is a central repository of structured data from many data sources.
This data is transformed, aggregated, and prepared before it is loaded into the data
warehouse. The data within the data warehouse is then used for business reporting and
analysis.

Data warehouses are databases that store transactional data in a format that
accommodates large, complex queries. Warehouses have been the backbone of
business intelligence for decades. From small businesses with a relatively small dataset
to enormous businesses with exabytes of data, you can use data warehousing to make
sense of all kinds of data.
Before we continue talking about what Hadoop is, let’s define what Hadoop is not. It is
not a database or a replacement for existing data systems. It is not a single application.
Instead, it’s a framework of tools that help you both store and process data.
For now, let’s focus on the storage part of this framework.Hadoop supports rapid data
transfers, which means you can speed up the processing time for complex queries.
Whether you use Hadoop on-premises or Amazon EMR, you will use the same tools,
with one major exception: Amazon EMR uses its own file system. And that means you
can use your Amazon S3 data lake as the data store. So there’s no need to copy data
into the cluster, as you would with Hadoop on-premises.

A data warehouse is a central repository of structured data from many

data sources. This data is transformed, aggregated, and prepared for
business reporting and analysis.
A subset of data from a data warehouse is called a data mart. Data marts only focus
on one subject or functional area. A warehouse might contain all relevant sources for
an enterprise, but a data mart might store only a single department’s sources.
Because data marts are generally a copy of data already contained in a data
warehouse, they are often fast and simple to implement.

PPros Cons

Fast data retrieval Costly to implement

Curated data sets Maintenance can be challenging

Centralized storage Security concerns

Hard to scale to meet demand

Better business intelligence
Characteristics Data warehouse Data lake
Data Relational from transactional Non-relational and
systems, operational databases, relational from IoT devices,
and line of business applications websites, mobile apps,
social media, and
corporate applications

Schema Written at the time of

Designed prior to implementation
analysis
(schema-on-write)
(schema-on-read)

Price/ Fastest query results using Query results getting faster

performance higher cost storage using
low-cost storage

Data quality Highly curated data that serves Any data, which may or
as the central version of the truth may not be curated (e.g.,
raw data)

Users Business analysts Data scientists, data

developers, and business
analysts (using curated
data)

Analytics Batch reporting, BI, and Machine learning,

visualizations predictive analytics, data
discovery, and profiling.

Apache Hadoop
Hadoop uses a distributed processing architecture, in which a task is mapped to a
cluster of commodity servers for processing. Each piece of work distributed to the
cluster servers can be run or re-run on any of the servers. The cluster servers frequently
use the Hadoop Distributed File System (HDFS) to store data locally for processing.
The results of the computation performed by those servers are then reduced to a single
output set. One node, designated as the master node, controls the distribution of tasks
and can automatically handle server failures.
When businesses need rapid insights from the data they
are collecting, but the systems in place simply cannot
meet the need, there's a velocity problem.

Generally speaking, there are two types of processing: batch processing and stream
processing. These require different architectures and enable different levels of analysis.
Batch processing means processing content in batches. You would use batch
processing when you have a lot of data to process, and you need to process it at certain
intervals—for example, on a schedule or when you reach a certain volume of data. This
kind of processing is performed on datasets like server logs, financial data, fraud
reports, and clickstream summaries.
Stream processing means processing data in a stream—in other words, processing
data that’s generated continuously, in small datasets (measured in kilobytes). You
would use stream processing when you need real-time feedback or continuous insights.
This kind of processing is performed on datasets like IoT sensor data, e-commerce
purchases, in-game player activity, clickstreams, or information from social networks.
Many organizations use both types of processing, stream and batch, on the exact same
dataset. Stream processing is used to get initial insights and real-time feedback, while
batch processing is used to get deep insights from complex analytics. For example,
credit card transactions. Have you ever received in text just moments after swiping your
card? This is a streaming data fraud prevention alert. This is a stream process that
happens in near-real time.
Now another process going on regularly using the same data would be the credit card
company processing the days customer fraud prevention alert trends. Same data, two
completely different business being met, two different velocities.
Data processing and the challenges associated with it can often be defined by the
velocity at which the data must be collected and processed. Batch processing comes in
two forms: scheduled and periodic.

Scheduled batch processing is a very large volume of data processed on a regular

schedule—hourly, daily or weekly. It is generally the same amount of data each time, so
the workloads are predictable. Periodic batch processing occurs at random times,
on-demand. These workloads often run once a certain amount of data has been
collected. This can make workloads unpredictable and hard to plan around.

Data acceleration
Another key characteristic of velocity on data is data acceleration, which means the rate
at which large collections of data can be ingested, processed, and analyzed. Data
acceleration is not constant. It comes in bursts. Take Twitter as an example. Hashtags
can become hugely popular and appear hundreds of times in just seconds, or slow
down to one tag an hour. That's data acceleration in action. Your system must be able
to efficiently handle the peak of hundreds of tags a second and the lows of one tag an
hour.

Batch:
atch processing is the execution of a series of programs, or jobs, on one or more
computers without manual intervention. Data is collected into batches asynchronously.
The batch is sent to a processing system when specific conditions are met, such as a
specified time of day. The results of the processing job are then sent to a storage
location that can be queried later as needed.

Stream DATA PROCESSING:

Stream processing is the collection and processing of a constant stream of data. In
batch processing solutions, the batch size and velocity is relatively stable. But with
stream processing solutions, the amount of data being transferred and the size of the
data packets is not always consistent.
the benefits of streaming data. The first benefit is all about control. With streaming data
solutions, the default is to decouple the collection system, called the producer, from the
processing system, called the consumer. The streaming solution provides a persistent
buffer for your incoming data. The data can be processed, and you can send the data at
your own rate depending on your needs.Second, each of the stream producers can
write their data to the same endpoint, allowing multiple streams of disparate data to be
combined into a single stream for processing. For example, in an IoT solution, a million
devices can write their unique data into the same endpoint easily. Third is the ability to
preserve the ordering of data. It can be vital to preserve the sequence of events within a
stream. For example, the producer sends data in the order 1,2,3,4, and the consumer
receives the data in the same order 1,2,3,4.

Remember, this is the service that will process the incoming stream. Parallel
consumption of data is one of the greatest benefits where velocity is concerned. Parallel
consumption lets multiple consumers work simultaneously on the same data to build
parallel applications on top of the data and manage their own time frames.

AMAZON KINESIS:
Amazon Kinesis Data Firehose, Amazon Kinesis Data Streams, and Amazon Kinesis
Data Analytics. A fourth capability of Kinesis is Amazon Kinesis Video Streams. This
service makes it easy to securely stream video from connected devices to AWS for
analytics, machine learning (ML), and other processing.
sensor data is being collected as a stream by Kinesis Data Firehose. This service is
configured to send the data to Kinesis Data Analytics, which will process the data.
Kinesis Data Analytics filters the data for relevant records and sends them into the next
Kinesis Data Firehose process.
In parallel, a second process selects relevant records from the original Kinesis Data
Firehose and places them in an Amazon S3 bucket at the serving layer. All of the
non-relevant records in this stream are simply discarded. Now that the relevant data is
in Amazon S3, you can use Amazon Athena to query those records.
In this architecture, sensor data is being collected in the form of a stream. The
streaming data is being collected from the sensor devices by Amazon Kinesis Data Firehose.
This service is configured to send the data to be processed using Amazon Kinesis Data
Analytics. This service filters the data for relevant records and send the data into another
Kinesis Data Firehose process, which places the results into an Amazon S3 bucket at the
serving layer.Using Amazon Athena, the data in the Amazon S3 bucket can now be queried to
produce insightful dashboards and reports using Amazon QuickSight.

1. Structured: Structured data is stored in a tabular format, often within a database

management system (DBMS). This data is organized based on a relational data
model, which defines and standardizes data elements and their relation to one
another. Data is stored in rows, with each row representing a single instance of a
thing (for example, a customer). These rows are well understood due to the table
schema, which explains what each field in the table represents. This makes
structured data easy to query. The downside to structured data is its lack of flexibility.
Let’s say that you have decided you want to track the age of your customers. You
must reconfigure the schema to allow for this new data, and you must account for all
records that don’t have a value for this new field. It is not impossible, but it can be a
very time-consuming process. Examples of structured data applications include
Amazon RDS, Amazon Aurora, MySQL, MariaDB, PostgreSQL, Microsoft SQL
Server, and Oracle.
2. Semi: Semistructured data is stored in the form of elements within a file. This data
is organized based on elements and the attributes that define them. It doesn't
conform to data models or schemas. Semistructured data is considered to have a
self-describing structure. Each element is a single instance of a thing, such as a
conversation. The attributes within an element define the characteristics of that
conversation. Each conversation element can track different attributes. This makes
semistructured data quite flexible and able to scale to meet the changing demands of
a business much more rapidly than structured data. The trade-off is with analytics. It
can be more difficult to analyze semistructured data when analysts cannot predict
which attributes will be present in any given data set. Examples of semistructured
data stores include CSV, XML, JSON, Amazon DynamoDB, Amazon Neptune, and
Amazon ElastiCache.
3. Unstructured: Unstructured data is stored in the form of files. This data doesn't
conform to a predefined data model and isn't organized in a predefined manner.
Unstructured data can be text-heavy, photographs, audio recordings, or even videos.
Unstructured data is full of irrelevant information, which means the files need to be
preprocessed to perform meaningful analysis. This can be done in many ways. For
example, services can add tags to the data based on rules defined for the types of
files. The data can also be cataloged to make it available to query services.
Examples of unstructured data include emails, photos, videos, clickstream data,
Amazon S3, and Amazon Redshift Spectrum.

Structured data is hot, immediately ready to be

analyzed.

Semistructured data is lukewarm—some data will be

ready to go and other data may need to be cleansed or
preprocessed. Unstructured data is the frozen
ocean—full of exactly what you need but separated by
all kinds of stuff you don’t need.

Non-relational databases
Non-relational databases are built to store semistructured and unstructured data in a
way that provides for rapid collection and retrieval. There are several broad categories
of non-relational databases, and data is stored in each to meet specific requirements.

Document stores are a type of non-relational database that store semistructured and
unstructured data in the form of files. These files range in form but include JSON,
BSON, and XML. The files can be navigated using numerous languages including
Python and Node.js.

Logically, files contain data stored as a series of elements. Each element is an instance
of a person, place, thing, or event. For instance, the document store may hold a series
of log files from a set of servers. These log files can each contain the specifics for that
system without concern for what the log files in other systems contain.

Strengths:
● Flexibility
● No need to plan for a specific type of data when creating one
● Easy to scale

Weaknesses:
● Sacrifice ACID compliance for flexibility
● Cannot query across files

Key-value databases are a type of non-relational database that store unstructured data
in the form of key-value pairs.

Logically, data is stored in a single table. Within the table, the values are associated
with a specific key. The values are stored in the form of blob objects and do not require
a predefined schema. The values can be of nearly any type.

Strengths:
● Very flexible
● Able to handle a wide variety of data types
● Keys are linked directly to their values with no need for indexing or complex
join operations
● Content of a key can easily be copied to other systems without
reprogramming the data

Weaknesses:
● Impossible to query values because they are stored as a single blob
● Updating or editing the content of a value is quite difficult
● Not all objects are easily modeled as key-value pairs

GRAPH DATABASES:

Graph databases are purpose-built to store any type of data: structured, semistructured,
or unstructured. The purpose for organization in a graph database is to navigate
relationships. Data within the database is queried using specific languages associated
with the software tool you have implemented.
Logically, data is stored as a node, and edges store information on the relationships
between nodes. An edge always has a start node, end node, type, and direction, and an
edge can describe parent-child relationships, actions, ownership, and the like. There is
no limit to the number and kind of relationships a node can have.

Strengths:
● Allow simple, fast retrieval of complex hierarchical structures
● Great for real-time big data mining
● Can rapidly identify common data points between nodes
● Great for making relevant recommendations and allowing for rapid querying
of those relationships

Weaknesses:
● Cannot adequately store transactional data
● Analysts must learn new languages to query the data
● Performing analytics on the data may not be as efficient as with other
database types.

VERACITY: When you have data that is ungoverned,

coming from numerous, dissimilar systems and cannot
curate the data in meaningful ways, you know you have a
veracity problem.
Curation i s the action or process of selecting, organizing, and looking after the
items in a collection.

Data integrity is the maintenance and assurance of the accuracy and

consistency of data over its entire lifecycle.

Data veracity is the degree to which data is accurate, precise, and trusted.

Data cleansing is the process of detecting and correcting corruptions within

data.

Referential integrity i s the process of ensuring that the constraints of table

relationships are enforced.

Domain integrity i s the process of ensuring that the data being entered into a
field matches the data type defined for that field..

Entity integrity is the process of ensuring that the values stored within a field
match the constraints defined for that field.

. BASE consistency is most concerned about the rapid availability of data. BASE
consistency is commonly implemented for NoSQL databases, in distributed systems
and on unstructured data stores. To ensure availability, changes to data are made
available immediately on the instance where the change was made. However, it may
take time for that change to be replicated across all instances. Eventually, the change
will be fully consistent across all instances.BASE is an acronym for B
a
sically
Av ailable Soft state E
ventually consistent. It is a method for maintaining
consistency and integrity in a structured or semistructured database.

BASE compliance
BASE supports data integrity in non-relational databases, which are

sometimes called NoSQL databases. Non-relational databases like Amazon

DynamoDB still use transactions for processing requests. These databases are

hyperactive, and the primary concern is availability of the data over

consistency of the data. To ensure the data is highly available, changes to data

are made available immediately on the instance where the change was made.

However, it may take time for that change to be replicated across the fleet of

instances. The aim is that the change will eventually be fully consistent across

the fleet. e en

The BA in BASE stands for basically available. This lets one instance of the database
receive a new record—or update an existing record—and make that change available
immediately for that instance. As this change is replicated across all the other instances,
the other instances will eventually become consistent.
In an ACID system, the change would not become available until all instances were
consistent. That's the key. In a BASE system, complete consistency is traded for
immediate availability. Eventually, you get complete consistency and availability in both
consistency models. The difference is in which one comes first.

The S in BASE stands for soft state. In a BASE system, there are allowances for partial
consistency across distributed instances. For this reason, BASE systems are
considered to be in a soft state, also known as a changeable state. To contrast this, in
an ACID system, the database is considered to be in a hard state because users can
only access data that is fully consistent.

Transforming your data –

comparing Amazon EMR and AWS Glue
When it comes to performing the data transformation component of ETL, there
are two options within AWS: Amazon EMR and AWS Glue. These two services
provide similar results but require different amounts of knowledge and time
investment.

Amazon EMR is a more hands-on approach to creating your data pipeline. This
service provides a robust data collection and processing platform. Using this
service requires you to have strong technical knowledge and know-how on your
team. The upside of this is that you can create a more customized pipeline to fit
your business needs. Additionally, your infrastructure costs may be lower than
running the same workload on AWS Glue.

AWS Glue is a serverless, managed ETL tool that provides a much more
streamlined experience than Amazon EMR. This makes the service great for
simple ETL tasks, but you will not have as much flexibility as with Amazon EMR.
You can also use AWS Glue as a metastore for your final transformed data by
using the AWS Glue Data Catalog. This catalog is a drop-in replacement for a
Hive metastore.

When you have massive volumes of data used to

support a few golden insights, you may be missing the
value of your data.

What is data analytics?Data in the absence of meaning is

meaningless. Words in a language you don't understand are equally
meaningless. It is only when meaning is supplied that the data or words can be
understood.
Data analytics comes in two classifications, information analytics, and operational
analytics. Information analytics is the process of analyzing information to find the
value contained within it. It is a broad classification of data analytics that can
cover topics from the financial accounting for a business to analyzing the number
of entries and exits in a secured building. The second form of analytics is
operational analytics. It is quite similar to information analytics; however, it
focuses on the digital operations of an organization.

There are five types of analysis: descriptive, diagnostic, predictive, prescriptive, and
cognitive. Let’s begin with descriptive analysis, often called data mining. This form is the
oldest and most common. This method involves aggregating or comparing historic
values to answer the question “What happened?” or “What is happening?” For instance,
what were doorbell sales last month? Or what has been the highest grossing cat food of
the fourth quarter so far? This form of analysis provides insights and requires a large
amount of human judgment to turn those insights into actionable information.

Michelle: The next form of analysis is diagnostic. This method involves comparing
historic data, often gathered in descriptive analysis, to other data sets to answer the
question “Why did it happen?” With this information, you can answer questions such as
why did our social media impressions dip last month? Or what is the likely cause of the
increase in customer complaints?
As with descriptive analysis, this form of analysis provides insights and requires human
judgment to turn those insights into actionable information.

Blaine: The next form of analysis is predictive. This method involves predicting what
might happen in the future based on what happened in the past. For instance, what are
sales likely to be in 2020 based on our current rate of growth? What is the total number
of new subscriptions we can expect based on last year’s trajectory? This form of
analysis provides insights called predictions. These predictions also require human
judgment to evaluate the validity, ensuring they are realistic.

Michelle: The next form of analysis is prescriptive. This is where the action really heats
up. This method involves looking at historic data and predictions to answer the question
“What should be done?” Now, this form of analysis deviates from the three previous in
that it requires applications to incorporate rules and constraints to make intelligent
recommendations. The greatest advantage of this form of analysis is that it can be
automated. Applications that implement prescriptive analysis can make
recommendations, or decisions, and take action based on those recommendations.

Machine learning makes this method of analysis possible. Machine learning models
perform the analysis. These models become more accurate through a process called
training. During training, the application runs the data through the rules and constraints
several times. This refines the model’s ability to make accurate recommendations. For
example, Amazon.com recommends products based upon a customer’s purchase
history.
The only time human involvement is required for prescriptive analysis is when you’re
building and training the model. Once the model is in production, it can run independent
of human interaction. This form of analysis provides insights and recommendations.

Blaine: The final form of analysis is cognitive. This form of analysis uses a type of
artificial intelligence called deep learning. Deep learning, along with prescriptive
analysis, is used to make decisions and take action based upon visual, auditory, or
even natural language inputs. Deep learning mimics human judgment by combining
existing data and patterns to draw conclusions. With each analysis, the results feed
back into a knowledge database to inform future decisions. This creates a self-learning
feedback loop. Think of Amazon Alexa—the more questions you ask Alexa, the more
the system learns from you.

As with prescriptive analysis, human involvement is focused on building and training

models. Once the model goes into production, it can run with little human involvement.
However, the model thrives off human-based inputs. This form of analysis provides
insights and decisions—it can even suggest and perform actions.

Cheat Sheet AWS Data Engineer Associate
No ratings yet
Cheat Sheet AWS Data Engineer Associate
117 pages
AWS Data Engineering Services
No ratings yet
AWS Data Engineering Services
24 pages
Data Lake On Aws
No ratings yet
Data Lake On Aws
29 pages
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
Ssis Class Notes
100% (4)
Ssis Class Notes
71 pages
SnapLogic Enterprise Automation For Dummies
100% (2)
SnapLogic Enterprise Automation For Dummies
66 pages
AWS Data Analytics - Technical - Student
No ratings yet
AWS Data Analytics - Technical - Student
160 pages
AWS Data Lake
100% (1)
AWS Data Lake
104 pages
DataAnalytics AWS PDF
No ratings yet
DataAnalytics AWS PDF
133 pages
PSO Data Analytics Day 1
100% (1)
PSO Data Analytics Day 1
106 pages
Data Lake
No ratings yet
Data Lake
26 pages
Data Engineering by AWS
100% (1)
Data Engineering by AWS
11 pages
Modern Data Architectures Using The AWS WellArchitected Data Analytics Lens REPEAT ARC321-R2
100% (1)
Modern Data Architectures Using The AWS WellArchitected Data Analytics Lens REPEAT ARC321-R2
19 pages
AWS 05 DataLake
No ratings yet
AWS 05 DataLake
78 pages
Data Lake On The Aws Cloud With Talend Big Data Platform
100% (1)
Data Lake On The Aws Cloud With Talend Big Data Platform
13 pages
Redshift-DA Handout
No ratings yet
Redshift-DA Handout
121 pages
Big Data Analytics Options On AWS
100% (1)
Big Data Analytics Options On AWS
50 pages
Unit-2 Notes DW 2021
No ratings yet
Unit-2 Notes DW 2021
45 pages
AWS Data Lake
No ratings yet
AWS Data Lake
118 pages
AWS ML Cheat Sheet Nov 2024
No ratings yet
AWS ML Cheat Sheet Nov 2024
100 pages
CCS341 Data Warehousing Notes Unit I
No ratings yet
CCS341 Data Warehousing Notes Unit I
30 pages
Data Engineering YouTube Roadmap
No ratings yet
Data Engineering YouTube Roadmap
4 pages
AWS Data Lake
No ratings yet
AWS Data Lake
87 pages
Aws Storage and Edge Processin 1748041998 180511161600
No ratings yet
Aws Storage and Edge Processin 1748041998 180511161600
45 pages
Amazon Redshift - Analyze Data Across Your Lake House With Amazon Redshift
No ratings yet
Amazon Redshift - Analyze Data Across Your Lake House With Amazon Redshift
48 pages
Building Data Lakes
No ratings yet
Building Data Lakes
40 pages
Data Fabric: Smart Data Engineering, Operations, and Orchestration
100% (2)
Data Fabric: Smart Data Engineering, Operations, and Orchestration
26 pages
Analytics Services v2
No ratings yet
Analytics Services v2
59 pages
AWS Services
No ratings yet
AWS Services
34 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
BODS Training Session
0% (1)
BODS Training Session
18 pages
Modernserverlessdatalak
No ratings yet
Modernserverlessdatalak
45 pages
SQL Developer Resume
100% (2)
SQL Developer Resume
8 pages
REPEAT 3 Architecting Your Data Lake With SAP On AWS ENT310-R3
No ratings yet
REPEAT 3 Architecting Your Data Lake With SAP On AWS ENT310-R3
22 pages
Aws Data Service Notes
No ratings yet
Aws Data Service Notes
9 pages
Enterprise Data Warehousing On Aws
No ratings yet
Enterprise Data Warehousing On Aws
26 pages
Modernize Your Analyticsand Data Architecture
No ratings yet
Modernize Your Analyticsand Data Architecture
47 pages
DATA WAREHOUSE - Pertemuan01
No ratings yet
DATA WAREHOUSE - Pertemuan01
20 pages
AWS Data Analytics Specialty Exam Cram Notes
No ratings yet
AWS Data Analytics Specialty Exam Cram Notes
43 pages
Aiesec X Aws Workshop
No ratings yet
Aiesec X Aws Workshop
45 pages
58076778-Node Javier Ramirez - AWS PDF
No ratings yet
58076778-Node Javier Ramirez - AWS PDF
73 pages
Module 6
No ratings yet
Module 6
16 pages
Introduction To Analytics On AWS
No ratings yet
Introduction To Analytics On AWS
34 pages
2019C2 - Data Lakes Ebook
No ratings yet
2019C2 - Data Lakes Ebook
37 pages
Data Warehousing Database Data Warehouse Data Lake
No ratings yet
Data Warehousing Database Data Warehouse Data Lake
17 pages
1605192076066-614 DAS-C01 Study Guide
No ratings yet
1605192076066-614 DAS-C01 Study Guide
18 pages
2019C2 - Data Lakes Ebook
No ratings yet
2019C2 - Data Lakes Ebook
37 pages
Data Lakes For Maximum Flexibility
No ratings yet
Data Lakes For Maximum Flexibility
29 pages
AWS Interview Q&A - Advanced
No ratings yet
AWS Interview Q&A - Advanced
10 pages
DocScanner 20 Oct 2024 2-19 PM
No ratings yet
DocScanner 20 Oct 2024 2-19 PM
16 pages
1 AWS Analytics and Data Lakes
No ratings yet
1 AWS Analytics and Data Lakes
15 pages
ELT Vs ETL
No ratings yet
ELT Vs ETL
13 pages
CCD Chapter 3 Notes
No ratings yet
CCD Chapter 3 Notes
11 pages
Bigdata Pipeline With AWS: Author: Diksha Singh Tomer Computer and Science Engineering Banasthali University, India
No ratings yet
Bigdata Pipeline With AWS: Author: Diksha Singh Tomer Computer and Science Engineering Banasthali University, India
9 pages
Introduction To Data Lakes
No ratings yet
Introduction To Data Lakes
6 pages
DWH
No ratings yet
DWH
7 pages
AWS Data Lake
No ratings yet
AWS Data Lake
13 pages
Name: Madhu Phone No: Professional Experience
No ratings yet
Name: Madhu Phone No: Professional Experience
3 pages
BDC Output 10
No ratings yet
BDC Output 10
7 pages
Unit3 - Cloud Data Storage
No ratings yet
Unit3 - Cloud Data Storage
7 pages
Storage Options For Transformed Data
No ratings yet
Storage Options For Transformed Data
3 pages
Warehouse Assignment MIM 106
No ratings yet
Warehouse Assignment MIM 106
8 pages
Informatica Powercenter Course
No ratings yet
Informatica Powercenter Course
8 pages
DW Vs Data Lake
No ratings yet
DW Vs Data Lake
5 pages
AWS Data-Lake Ebook
No ratings yet
AWS Data-Lake Ebook
9 pages
Subtitle
No ratings yet
Subtitle
2 pages
Final Project On Data Lakes With AWS
No ratings yet
Final Project On Data Lakes With AWS
2 pages
SAP S - 4HANA Migration Cockpit - Migrate Your Data To SAP S - 4HANA Cloud Public Edition
No ratings yet
SAP S - 4HANA Migration Cockpit - Migrate Your Data To SAP S - 4HANA Cloud Public Edition
65 pages
Venkatesh Ganti, Anish Das Sarma - Data Cleaning. A Practical Perspective-Morgan & Claypool (2013)
No ratings yet
Venkatesh Ganti, Anish Das Sarma - Data Cleaning. A Practical Perspective-Morgan & Claypool (2013)
72 pages
Deepika - SQL Developer
No ratings yet
Deepika - SQL Developer
6 pages
ICT Skill Framework 2020
No ratings yet
ICT Skill Framework 2020
462 pages
An SQL Based Cost Effective Inventory Op
No ratings yet
An SQL Based Cost Effective Inventory Op
13 pages
UP NEC Courses Catalogue - 10 March 2022
No ratings yet
UP NEC Courses Catalogue - 10 March 2022
20 pages
Professional Summary: SAP BI/BW HANA Consultant
No ratings yet
Professional Summary: SAP BI/BW HANA Consultant
6 pages
Informatica PowerCenter 9.x Level 1 Developer DS
No ratings yet
Informatica PowerCenter 9.x Level 1 Developer DS
4 pages
Computer Science Textbook Solutions - 31
No ratings yet
Computer Science Textbook Solutions - 31
30 pages
Ba Unit IV Notes Unit 4
No ratings yet
Ba Unit IV Notes Unit 4
113 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
27 pages
Dr. Anjan Krishnamurthy Associate Professor Dept. of CSE, BMSIT&M
No ratings yet
Dr. Anjan Krishnamurthy Associate Professor Dept. of CSE, BMSIT&M
129 pages
Lab 1 - Simple ETL Process Using SSIS: Problem Statement
No ratings yet
Lab 1 - Simple ETL Process Using SSIS: Problem Statement
9 pages
Unsia - Data Mining Pertemuan 9
No ratings yet
Unsia - Data Mining Pertemuan 9
39 pages
Week 04 Detailed Dimensional Modeling
No ratings yet
Week 04 Detailed Dimensional Modeling
6 pages
Accelerated ETL Generation For Microsoft SSIS
No ratings yet
Accelerated ETL Generation For Microsoft SSIS
14 pages
X.3 L-3 Organizational Issues, Tools, and Performance Considerations
No ratings yet
X.3 L-3 Organizational Issues, Tools, and Performance Considerations
16 pages
Syed Omer-Resume
No ratings yet
Syed Omer-Resume
13 pages
Lecture 5
No ratings yet
Lecture 5
20 pages
Provisional Grade History: Register No. Name Program School
No ratings yet
Provisional Grade History: Register No. Name Program School
5 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet

Big Data PDF

Uploaded by

Big Data PDF

Uploaded by

Analysis ​is a detailed examination of something in order to understand its nature or

Planning a data analysis solution

Know the options for processing your data

Know what you need to learn from your data

There are three broad classifications of data source types:

Data analysis solutions on Amazon S3

Centralized data architecture

Integration with clusterless and serverless AWS services

Standardized Application Programming Interfaces (APIs)

A data lake is a ​centralized repository​ ​that allows you

Benefits of a data lake on AWS

A data warehouse is a ​central repository​ of structured data from many

Fast data retrieval Costly to implement

Curated data sets Maintenance can be challenging

Centralized storage Security concerns

Hard to scale to meet demand

Schema Written at the time of

Price/ Fastest query results using Query results getting faster

Users Business analysts Data scientists, data

Analytics Batch reporting, BI, and Machine learning,

Scheduled batch processing is a very large volume of data processed on a regular

Stream DATA PROCESSING:

1. Structured: ​Structured data​ is stored in a tabular format, often within a database

Structured data​ is hot, ​immediately ready​ to be

Semistructured data​ is lukewarm—some data will be

VERACITY: When you have data that is ​ungoverned​,

Data integrity​ is the maintenance and assurance of the accuracy and

Data cleansing​ is the process of detecting and correcting corruptions within

Referential integrity i​ s the process of ensuring that the constraints of table

sometimes called NoSQL databases. Non-relational databases like Amazon

hyperactive, and the primary concern is availability of the data over

Transforming your data –

When you have ​massive volumes​ of ​data​ used to

What is data analytics?​Data in the absence of meaning is

As with prescriptive analysis, human involvement is focused on building and training

You might also like

Analysis is a detailed examination of something in order to understand its nature or

A data lake is a centralized repository that allows you

A data warehouse is a central repository of structured data from many

1. Structured: Structured data is stored in a tabular format, often within a database

Structured data is hot, immediately ready to be

Semistructured data is lukewarm—some data will be

VERACITY: When you have data that is ungoverned,

Data integrity is the maintenance and assurance of the accuracy and

Data cleansing is the process of detecting and correcting corruptions within

Referential integrity i s the process of ensuring that the constraints of table

When you have massive volumes of data used to

What is data analytics?Data in the absence of meaning is