0% found this document useful (0 votes)
119 views18 pages

Big Data PDF

The document discusses planning a data analysis solution. It explains that data can come from existing on-premises databases and files, streaming sources, or public datasets. Different options are available for processing data, including Amazon S3, AWS Lambda, and Amazon Athena. The goal is to learn from the data to optimize efforts and spot trends. When businesses have more data than they can process, they have a volume problem.

Uploaded by

Anushka Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views18 pages

Big Data PDF

The document discusses planning a data analysis solution. It explains that data can come from existing on-premises databases and files, streaming sources, or public datasets. Different options are available for processing data, including Amazon S3, AWS Lambda, and Amazon Athena. The goal is to learn from the data to optimize efforts and spot trends. When businesses have more data than they can process, they have a volume problem.

Uploaded by

Anushka Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Analysis ​is a detailed examination of something in order to understand its nature or

determine its essential features. ​Data analysis​ is the process of compiling, processing,
and analyzing data so that you can use it to make decisions.

Analytics ​is the systematic analysis of data. ​Data analytics​ is the specific analytical
process being applied.
Effective​ data analysis solutions require both storage and the ability to analyze data in
near real time​, with ​low latency​,
while yielding ​high-value returns​.

Planning a data analysis solution


Know where your data comes from
The majority of data ingested by data analysis solutions comes from existing
on-premises databases and file stores. This data is often in a state where the required
processing within the solution will be minimal.

Streaming data is a source of business data that is gaining popularity. This data source
is less structured. It may require special software to collect the data and specific
processing applications to correctly aggregate and analyze it in near real-time.

Public data sets are another source of data for businesses. These include census data,
health data, population data, and many other datasets that help businesses understand
the data they are collecting on their customers. This data may need to be transformed
so that it will contain only what the business needs.

Know the options for processing your data


There are many different solutions available for processing your data. There is no
one-size-fits-all approach. You must carefully evaluate your business needs and match
them to the services that will combine to provide you with the required results.

Know what you need to learn from your data


You must be prepared to learn from your data, work with internal teams to optimize
efforts, and be willing to experiment.

It is vital to spot trends, make correlations, and run more efficient and profitable
businesses. It's time to put your data to work.
When ​businesses have ​more data​ ​than they are able to
process and analyze​, they have a ​volume problem​.

There are three broad classifications of data source types:

● Structured data​ is organized and stored in the form of values that are
grouped into rows and columns of a table.
● Semistructured data​ is often stored in a series of key-value pairs that are
grouped into elements within a file.
● Unstructured data​ is not structured in a consistent way. Some data may
have structure similar to semi-structured data but others may only contain
metadata.

Amazon S3 concepts
To get the most out of Amazon S3, you need to understand a few simple concepts.
First, Amazon S3 stores data as ​objects​ ​within ​buckets​.An ​object​ ​is composed of a
file and any metadata that describes that file. To store an object in Amazon S3, you
upload the file you want to store into a bucket. When you upload a file, you can set
permissions on the object and add any metadata. ​Buckets​ are logical containers for
objects. You can have one or more buckets in your account and can control access for
each bucket individually. You control who can create, delete, and list objects in the
bucket. You can also view access logs for the bucket and its objects and choose the
geographical region where Amazon S3 will store the bucket and its contents.

Data analysis solutions on Amazon S3


Decoupling of storage from compute and data processing

Centralized data architecture


Amazon S3 makes it easy to build a multi-tenant environment, where many users can
bring their own data analytics tools to a common set of data. This improves both cost
and data governance over traditional solutions, which require multiple copies of data to
be distributed across multiple processing platforms.

Although this may require an additional step to load your data into the right tool, using
Amazon S3 as your central data store provides even more benefits over traditional
storage options.

Integration with clusterless and serverless AWS services


Combine Amazon S3 with other AWS services to query and process data.
Amazon S3 also integrates with AWS Lambda serverless computing to run
code without provisioning or managing servers. Amazon Athena can query
Amazon S3 directly using the Structured Query Language (SQL), without the
need for data to be ingested into a relational database.

With all of these capabilities, you only pay for the actual amounts of data you
process or the compute time you consume

Standardized Application Programming Interfaces (APIs)


Representational State Transfer (REST) APIs are programming interfaces commonly
used to interact with files in Amazon S3. Amazon S3's RESTful APIs are simple, easy to
use, and supported by most major third-party independent software vendors (ISVs),
including Apache Hadoop and other leading analytics tool vendors. This allows
customers to bring the tools they are most comfortable with and knowledgeable about to
help them perform analytics on data in Amazon S3.

DATA LAKE
​ data lake is an architectural concept that helps you manage multiple data types from
A
multiple sources, both structured and unstructured, through a single set of tools.

Let’s break that down. A data lake takes Amazon S3 buckets and organizes them by
categorizing the data inside the buckets. It doesn’t matter how the data got there or
what kind it is. You can store both structured and unstructured data effectively in an
Amazon S3 data lake. AWS offers a set of tools to manage the entire data lake without
treating each bucket as separate, unassociated objects.
Many businesses end up grouping data together into numerous storage locations called
silos. These silos are rarely managed and maintained by the same team, which can be
problematic. Inconsistencies in the way data was written, collected, aggregated, or
filtered can cause problems when it is compared or combined for processing and
analysis.
For example, one team may use the address field to store both the street number and
street name, while another team might use separate fields for street number and street
name. When these datasets are combined, there is now an inconsistency in the way the
address is stored, and it will make analysis very difficult.
But by using data lakes, you can break down data silos and bring data into a single,
central repository that is managed by a single team. That gives you a single, consistent
source of truth.
Because data can be stored in its raw format, you don’t need to convert it, aggregate it,
or filter it before you store it. Instead, you can leave that pre-processing to the system
that processes it, rather than the system that stores it.
In other words, you don’t have to transform the data to make it usable. You keep the
data in its original form, however it got there, however it was written. When you’re
talking exabytes of data, you can’t afford to pre-process this data in every conceivable
way it may need to be presented in a useful state.

Let’s talk about having a single source of truth. When we talk about truth in relation to
data, we mean the trustworthiness of the data. Is it what it should be? Has it been
altered? Can we validate the chain of custody? When creating a single source of truth,
we’re creating a dataset, in this case the data lake, which can be used for all processing
and analytics. The bonus is that we know it to be consistent and reliable. It’s
trustworthy.

A data lake is a ​centralized repository​ ​that allows you


to store ​structured​,​ semistructured​, and​ unstructured
data at any scale.

Benefits of a data lake on AWS 


● Are a ​cost-effective data storage​ ​solution. You can durably store a
nearly unlimited amount of data using Amazon S3.
● Implement industry-leading ​security and compliance​. AWS uses
stringent data security, compliance, privacy, and protection
mechanisms.
● Allow you to take advantage of ​many different data collection and
ingestion tools​ to ingest data into your data lake. These services
include Amazon Kinesis for streaming data and AWS Snowball
appliances for large volumes of on-premises data.
● Help you to ​categorize and manage your data​ simply and efficiently.
Use AWS Glue to understand the data within your data lake, prepare
it, and load it reliably into data stores. Once AWS Glue catalogs your
data, it is immediately searchable, can be queried, and is available for
ETL processing.
● Help you turn data into ​meaningful insights​. Harness the power of
purpose-built analytic services for a wide range of use cases, such as
interactive analysis, data processing using Apache Spark and Apache
Hadoop, data warehousing, real-time analytics, operational analytics,
dashboards, and visualizations.

DATA WAREHOUSE(redshift):
Structured data storage is performed through databases. We are going to focus on one
specific type of database, a data warehouse, which is one of the most common
analytical database solutions.Data warehouses are used as a central system for storing
analytical data from multiple sources. Say a company has lots of different databases,
because different departments use them to track different things. That’s fine, but that
makes it difficult to get a good idea of what’s happening across all departments.
A data warehouse is a central repository of structured data from many data sources.
This data is transformed, aggregated, and prepared before it is loaded into the data
warehouse. The data within the data warehouse is then used for business reporting and
analysis.

Data warehouses are databases that store transactional data in a format that
accommodates large, complex queries. Warehouses have been the backbone of
business intelligence for decades. From small businesses with a relatively small dataset
to enormous businesses with exabytes of data, you can use data warehousing to make
sense of all kinds of data.
Before we continue talking about what ​Hadoop​ is, let’s define what Hadoop is not. It is
not a database or a replacement for existing data systems. It is not a single application.
Instead, it’s a framework of tools that help you both store and process data.
For now, let’s focus on the storage part of this framework.Hadoop supports rapid data
transfers, which means you can speed up the processing time for complex queries.
Whether you use Hadoop on-premises or Amazon EMR, you will use the same tools,
with one major exception: Amazon EMR uses its own file system. And that means you
can use your Amazon S3 data lake as the data store. So there’s no need to copy data
into the cluster, as you would with Hadoop on-premises.

A data warehouse is a ​central repository​ of structured data from many


data sources. This data is ​transformed​, ​aggregated​, and ​prepared ​for
business reporting and analysis.
A subset of data from a data warehouse is called a ​data mart​. Data marts only ​focus
on one subject or functional area​. A warehouse might contain all relevant sources for
an enterprise, but a data mart might store ​only a single department’s sources​.
Because data marts are generally a copy of data already contained in a data
warehouse, they are often ​fast and simple to implement.

PPros  Cons 

Fast data retrieval  Costly to implement 

Curated data sets  Maintenance can be challenging 

Centralized storage  Security concerns 

Hard to scale to meet demand 


Better business intelligence 
Characteristics Data warehouse Data lake
Data Relational from transactional Non-relational and
systems, operational databases, relational from IoT devices,
and line of business applications websites, mobile apps,
social media, and
corporate applications

Schema Written at the time of


Designed prior to implementation
analysis
(schema-on-write)
(schema-on-read)

Price/ Fastest query results using Query results getting faster


performance higher cost storage using
low-cost storage

Data quality Highly curated data that serves Any data, which may or
as the central version of the truth may not be curated (e.g.,
raw data)

Users Business analysts Data scientists, data


developers, and business
analysts (using curated
data)

Analytics Batch reporting, BI, and Machine learning,


visualizations predictive analytics, data
discovery, and profiling.

Apache Hadoop
Hadoop ​uses a ​distributed processing architecture​, in which a task is mapped to a
cluster of commodity servers for processing. Each piece of work distributed to the
cluster servers can be run or re-run on any of the servers. The cluster servers frequently
use the ​Hadoop Distributed File System (HDFS)​ ​to store data locally for processing.
The results of the computation performed by those servers are then reduced to a single
output set. One node, designated as the master node, controls the distribution of tasks
and can automatically handle server failures.
When ​businesses ​need rapid insights​ from the data they
are collecting, but the ​systems ​in place simply ​cannot
meet the need​, there's a ​velocity ​problem.

Generally speaking, there are two types of processing: batch processing and stream
processing. These require different architectures and enable different levels of analysis.
Batch processing means processing content in batches. You would use batch
processing when you have a lot of data to process, and you need to process it at certain
intervals—for example, on a schedule or when you reach a certain volume of data. This
kind of processing is performed on datasets like server logs, financial data, fraud
reports, and clickstream summaries.
Stream processing means processing data in a stream—in other words, processing
data that’s generated continuously, in small datasets (measured in kilobytes). You
would use stream processing when you need real-time feedback or continuous insights.
This kind of processing is performed on datasets like IoT sensor data, e-commerce
purchases, in-game player activity, clickstreams, or information from social networks.
Many organizations use both types of processing, stream and batch, on the exact same
dataset. Stream processing is used to get initial insights and real-time feedback, while
batch processing is used to get deep insights from complex analytics. For example,
credit card transactions. Have you ever received in text just moments after swiping your
card? This is a streaming data fraud prevention alert. This is a stream process that
happens in near-real time.
Now another process going on regularly using the same data would be the credit card
company processing the days customer fraud prevention alert trends. Same data, two
completely different business being met, two different velocities.
Data processing and the challenges associated with it can often be defined by the
velocity at which the data must be collected and processed. Batch processing comes in
two forms: scheduled and periodic.

Scheduled batch processing is a very large volume of data processed on a regular


schedule—hourly, daily or weekly. It is generally the same amount of data each time, so
the workloads are predictable. Periodic batch processing occurs at random times,
on-demand. These workloads often run once a certain amount of data has been
collected. This can make workloads unpredictable and hard to plan around.

Data acceleration
Another key characteristic of velocity on data is ​data acceleration,​ which means the rate
at which large collections of data can be ingested, processed, and analyzed. Data
acceleration is not constant. It comes in bursts. Take Twitter as an example. Hashtags
can become hugely popular and appear hundreds of times in just seconds, or slow
down to one tag an hour. That's data acceleration in action. Your system must be able
to efficiently handle the peak of hundreds of tags a second and the lows of one tag an
hour.

Batch:
atch processing is the execution of a series of programs, or jobs, on one or more
computers without manual intervention. Data is collected into batches asynchronously.
The batch is sent to a processing system when specific conditions are met, such as a
specified time of day. The results of the processing job are then sent to a storage
location that can be queried later as needed.

Stream DATA PROCESSING:


Stream processing is the collection and processing of a constant stream of data. In
batch processing solutions, the batch size and velocity is relatively stable. But with
stream processing solutions, the amount of data being transferred and the size of the
data packets is not always consistent.
the benefits of streaming data. The first benefit is all about control. With streaming data
solutions, the default is to decouple the collection system, called the producer, from the
processing system, called the consumer. The streaming solution provides a persistent
buffer for your incoming data. The data can be processed, and you can send the data at
your own rate depending on your needs.Second, each of the stream producers can
write their data to the same endpoint, allowing multiple streams of disparate data to be
combined into a single stream for processing. For example, in an IoT solution, a million
devices can write their unique data into the same endpoint easily. Third is the ability to
preserve the ordering of data. It can be vital to preserve the sequence of events within a
stream. For example, the producer sends data in the order 1,2,3,4, and the consumer
receives the data in the same order 1,2,3,4.

Remember, this is the service that will process the incoming stream. Parallel
consumption of data is one of the greatest benefits where velocity is concerned. Parallel
consumption lets multiple consumers work simultaneously on the same data to build
parallel applications on top of the data and manage their own time frames.

AMAZON KINESIS:
Amazon Kinesis Data Firehose, Amazon Kinesis Data Streams, and Amazon Kinesis
Data Analytics. A fourth capability of Kinesis is Amazon Kinesis Video Streams. This
service makes it easy to securely stream video from connected devices to AWS for
analytics, machine learning (ML), and other processing.
sensor data is being collected as a stream by Kinesis Data Firehose. This service is
configured to send the data to Kinesis Data Analytics, which will process the data.
Kinesis Data Analytics filters the data for relevant records and sends them into the next
Kinesis Data Firehose process.
In parallel, a second process selects relevant records from the original Kinesis Data
Firehose and places them in an Amazon S3 bucket at the serving layer. All of the
non-relevant records in this stream are simply discarded. Now that the relevant data is
in Amazon S3, you can use Amazon Athena to query those records.
In this architecture, sensor data is being collected in the form of a stream. The
streami​ng data is being collected from the sensor devices by Amazon Kinesis Data Firehose.
This service is configured to send the data to be processed using Amazon Kinesis Data
Analytics. This service filters the data for relevant records and send the data into another
Kinesis Data Firehose process, which places the results into an Amazon S3 bucket at the
serving layer.Using Amazon Athena, the data in the Amazon S3 bucket can now be queried to
produce insightful dashboards and reports using Amazon QuickSight.

1. Structured: ​Structured data​ is stored in a tabular format, often within a database


management system (DBMS). This data is organized based on a relational data
model, which defines and standardizes data elements and their relation to one
another. Data is stored in rows, with each row representing a single instance of a
thing (for example, a customer). These rows are well understood due to the table
schema, which explains what each field in the table represents. This makes
structured data easy to query. The downside to structured data is its lack of flexibility.
Let’s say that you have decided you want to track the age of your customers. You
must reconfigure the schema to allow for this new data, and you must account for all
records that don’t have a value for this new field. It is not impossible, but it can be a
very time-consuming process. Examples of structured data applications include
Amazon RDS, Amazon Aurora, MySQL, MariaDB, PostgreSQL, Microsoft SQL
Server, and Oracle.
2. Semi: ​Semistructured data ​is stored in the form of elements within a file. This data
is organized based on elements and the attributes that define them. It doesn't
conform to data models or schemas. Semistructured data is considered to have a
self-describing structure. Each element is a single instance of a thing, such as a
conversation. The attributes within an element define the characteristics of that
conversation. Each conversation element can track different attributes. This makes
semistructured data quite flexible and able to scale to meet the changing demands of
a business much more rapidly than structured data. The trade-off is with analytics. It
can be more difficult to analyze semistructured data when analysts cannot predict
which attributes will be present in any given data set. Examples of semistructured
data stores include CSV, XML, JSON, Amazon DynamoDB, Amazon Neptune, and
Amazon ElastiCache.
3. Unstructured: ​Unstructured data​ is stored in the form of files. This data doesn't
conform to a predefined data model and isn't organized in a predefined manner.
Unstructured data can be text-heavy, photographs, audio recordings, or even videos.
Unstructured data is full of irrelevant information, which means the files need to be
preprocessed to perform meaningful analysis. This can be done in many ways. For
example, services can add tags to the data based on rules defined for the types of
files. The data can also be cataloged to make it available to query services.
Examples of unstructured data include emails, photos, videos, clickstream data,
Amazon S3, and Amazon Redshift Spectrum.

Structured data​ is hot, ​immediately ready​ to be


analyzed.

Semistructured data​ is lukewarm—some data will be


ready to go and other data ​may need to be cleansed​ or
preprocessed. ​Unstructured data​ ​is the ​frozen
ocean—full of ​exactly what you​ ​need ​but separated by
all kinds of ​stuff you don’t need​.

Non-relational databases
Non-relational databases are built to store semistructured and unstructured data in a
way that provides for rapid collection and retrieval. There are several broad categories
of non-relational databases, and data is stored in each to meet specific requirements.

Document stores​ are a type of non-relational database that store semistructured and
unstructured data in the form of files. These files range in form but include JSON,
BSON, and XML. The files can be navigated using numerous languages including
Python and Node.js.

Logically, files contain data stored as a series of elements. Each element is an instance
of a person, place, thing, or event. For instance, the document store may hold a series
of log files from a set of servers. These log files can each contain the specifics for that
system without concern for what the log files in other systems contain.

Strengths:
● Flexibility
● No need to plan for a specific type of data when creating one
● Easy to scale

Weaknesses:
● Sacrifice ACID compliance for flexibility
● Cannot query across files

Key-value databases are a type of non-relational database that store unstructured data
in the form of key-value pairs.

Logically, data is stored in a single table. Within the table, the values are associated
with a specific key. The values are stored in the form of blob objects and do not require
a predefined schema. The values can be of nearly any type.

Strengths:
● Very flexible
● Able to handle a wide variety of data types
● Keys are linked directly to their values with no need for indexing or complex
join operations
● Content of a key can easily be copied to other systems without
reprogramming the data

Weaknesses:
● Impossible to query values because they are stored as a single blob
● Updating or editing the content of a value is quite difficult
● Not all objects are easily modeled as key-value pairs

GRAPH DATABASES:

Graph databases are purpose-built to store any type of data: structured, semistructured,
or unstructured. The purpose for organization in a graph database is to navigate
relationships. Data within the database is queried using specific languages associated
with the software tool you have implemented.
Logically, data is stored as a node, and edges store information on the relationships
between nodes. An edge always has a start node, end node, type, and direction, and an
edge can describe parent-child relationships, actions, ownership, and the like. There is
no limit to the number and kind of relationships a node can have.

Strengths:
● Allow simple, fast retrieval of complex hierarchical structures
● Great for real-time big data mining
● Can rapidly identify common data points between nodes
● Great for making relevant recommendations and allowing for rapid querying
of those relationships

Weaknesses:
● Cannot adequately store transactional data
● Analysts must learn new languages to query the data
● Performing analytics on the data may not be as efficient as with other
database types.

VERACITY: When you have data that is ​ungoverned​,


coming from numerous, ​dissimilar systems​ ​and ​cannot
curate the data​ in meaningful ways, you know you have a
veracity ​problem.
Curation i​ s the action or process of selecting, organizing, and looking after the
items in a collection.

Data integrity​ is the maintenance and assurance of the accuracy and


consistency of data over its entire lifecycle.

Data veracity​ is the degree to which data is accurate, precise, and trusted.

Data cleansing​ is the process of detecting and correcting corruptions within


data.

Referential integrity i​ s the process of ensuring that the constraints of table


relationships are enforced.

Domain integrity i​ s the process of ensuring that the data being entered into a
field matches the data type defined for that field..

Entity integrity ​is the process of ensuring that the values stored within a field
match the constraints defined for that field.

. BASE consistency is most concerned about the rapid availability of data. BASE
consistency is commonly implemented for NoSQL databases, in distributed systems
and on unstructured data stores. To ensure availability, changes to data are made
available immediately on the instance where the change was made. However, it may
take time for that change to be replicated across all instances. Eventually, the change
will be fully consistent across all instances.​BASE​ is an acronym for B
​ a
​ sically
Av​ ailable ​S​oft state E
​ ​ventually consistent. It is a method for maintaining
consistency and integrity in a structured or semistructured database.

BASE compliance 
BASE supports data integrity in non-relational databases, which are 

sometimes called NoSQL databases. Non-relational databases like Amazon 

DynamoDB still use transactions for processing requests. These databases are 

hyperactive, and the primary concern is availability of the data over 

consistency of the data. To ensure the data is highly available, changes to data 

are made available immediately on the instance where the change was made. 

However, it may take time for that change to be replicated across the fleet of 

instances. The aim is that the change will eventually be fully consistent across 

the fleet. e​ en

The BA in BASE stands for basically available. This lets one instance of the database
receive a new record—or update an existing record—and make that change available
immediately for that instance. As this change is replicated across all the other instances,
the other instances will eventually become consistent.
In an ACID system, the change would not become available until all instances were
consistent. That's the key. In a BASE system, complete consistency is traded for
immediate availability. Eventually, you get complete consistency and availability in both
consistency models. The difference is in which one comes first.

The S in BASE stands for soft state. In a BASE system, there are allowances for partial
consistency across distributed instances. For this reason, BASE systems are
considered to be in a soft state, also known as a changeable state. To contrast this, in
an ACID system, the database is considered to be in a hard state because users can
only access data that is fully consistent.

Transforming your data –


comparing Amazon EMR and AWS Glue
When it comes to performing the data transformation component of ETL, there
are two options within AWS: Amazon EMR and AWS Glue. These two services
provide similar results but require different amounts of knowledge and time
investment.

Amazon EMR​ is a more hands-on approach to creating your data pipeline. This
service provides a robust data collection and processing platform. Using this
service requires you to have strong technical knowledge and know-how on your
team. The upside of this is that you can create a more customized pipeline to fit
your business needs. Additionally, your infrastructure costs may be lower than
running the same workload on AWS Glue.

AWS Glue ​is a serverless, managed ETL tool that provides a much more
streamlined experience than Amazon EMR. This makes the service great for
simple ETL tasks, but you will not have as much flexibility as with Amazon EMR.
You can also use AWS Glue as a metastore for your final transformed data by
using the AWS Glue Data Catalog. This catalog is a drop-in replacement for a
Hive metastore.

When you have ​massive volumes​ of ​data​ used to


support a ​few golden insights​, you may be missing the
value ​of your data.

What is data analytics?​Data in the absence of meaning is


meaningless. Words in a language you don't understand are equally
meaningless. It is only when meaning is supplied that the data or words can be
understood.
Data analytics comes in two classifications, information analytics, and operational
analytics. Information analytics is the process of analyzing information to find the
value contained within it. It is a broad classification of data analytics that can
cover topics from the financial accounting for a business to analyzing the number
of entries and exits in a secured building. The second form of analytics is
operational analytics. It is quite similar to information analytics; however, it
focuses on the digital operations of an organization.

There are five types of analysis: descriptive, diagnostic, predictive, prescriptive, and
cognitive. Let’s begin with descriptive analysis, often called data mining. This form is the
oldest and most common. This method involves aggregating or comparing historic
values to answer the question “What happened?” or “What is happening?” For instance,
what were doorbell sales last month? Or what has been the highest grossing cat food of
the fourth quarter so far? This form of analysis provides insights and requires a large
amount of human judgment to turn those insights into actionable information.

Michelle​: The next form of analysis is diagnostic. This method involves comparing
historic data, often gathered in descriptive analysis, to other data sets to answer the
question “Why did it happen?” With this information, you can answer questions such as
why did our social media impressions dip last month? Or what is the likely cause of the
increase in customer complaints?
As with descriptive analysis, this form of analysis provides insights and requires human
judgment to turn those insights into actionable information.

Blaine​: The next form of analysis is predictive. This method involves predicting what
might happen in the future based on what happened in the past. For instance, what are
sales likely to be in 2020 based on our current rate of growth? What is the total number
of new subscriptions we can expect based on last year’s trajectory? This form of
analysis provides insights called predictions. These predictions also require human
judgment to evaluate the validity, ensuring they are realistic.

Michelle​: The next form of analysis is prescriptive. This is where the action really heats
up. This method involves looking at historic data and predictions to answer the question
“What should be done?” Now, this form of analysis deviates from the three previous in
that it requires applications to incorporate rules and constraints to make intelligent
recommendations. The greatest advantage of this form of analysis is that it can be
automated. Applications that implement prescriptive analysis can make
recommendations, or decisions, and take action based on those recommendations.

Machine learning makes this method of analysis possible. Machine learning models
perform the analysis. These models become more accurate through a process called
training. During training, the application runs the data through the rules and constraints
several times. This refines the model’s ability to make accurate recommendations. For
example, Amazon.com recommends products based upon a customer’s purchase
history.
The only time human involvement is required for prescriptive analysis is when you’re
building and training the model. Once the model is in production, it can run independent
of human interaction. This form of analysis provides insights and recommendations.

Blaine​: The final form of analysis is cognitive. This form of analysis uses a type of
artificial intelligence called deep learning. Deep learning, along with prescriptive
analysis, is used to make decisions and take action based upon visual, auditory, or
even natural language inputs. Deep learning mimics human judgment by combining
existing data and patterns to draw conclusions. With each analysis, the results feed
back into a knowledge database to inform future decisions. This creates a self-learning
feedback loop. Think of Amazon Alexa—the more questions you ask Alexa, the more
the system learns from you.

As with prescriptive analysis, human involvement is focused on building and training


models. Once the model goes into production, it can run with little human involvement.
However, the model thrives off human-based inputs. This form of analysis provides
insights and decisions—it can even suggest and perform actions.

You might also like