0% found this document useful (0 votes)
53 views

Azure - Azure Data Workloads

This course provides an overview of Azure data workloads and services. It covers different types of data (structured, unstructured, semi-structured), databases (relational like SQL Server and non-relational), and workloads (transactional and analytical). The document discusses each topic in detail through examples to explain the concepts and differences between each type of data, database, and workload to help users choose the appropriate Azure services for building data solutions.

Uploaded by

BC Group
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Azure - Azure Data Workloads

This course provides an overview of Azure data workloads and services. It covers different types of data (structured, unstructured, semi-structured), databases (relational like SQL Server and non-relational), and workloads (transactional and analytical). The document discusses each topic in detail through examples to explain the concepts and differences between each type of data, database, and workload to help users choose the appropriate Azure services for building data solutions.

Uploaded by

BC Group
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Course Overview

This course is also the first one in the learning path towards the Azure Data Fundamentals
certification by Microsoft. Some of the major topics that we will cover include which types of
data there are and how to recognize them, what relational and non-relational databases are
and when they are used in the field,

what transactional and analytical workloads are and why it is important to know the
difference, what the difference is between batch data and streaming data when performing
data analytics, and, of course, for all of these, which services in Azure you can choose by
looking to build a data solution.

By the end of this course, you'll know the basics of working with data and have the
foundation needed to choose which Azure services you want to learn more about to start
building a data solution. I hope you will join me on this journey to become an Azure data
engineer with this course, Getting Started with Azure Data Workloads, at Pluralsight.
Types of Data, Databases, and Workloads
Introduction
Hello there, and welcome to this course. My name is Henry Been, and it is my pleasure to
be your teacher for this course on Getting Started with Azure Data Workloads. In this first
module, we will be talking about different types of data, databases, and data workloads.

First, we will discuss three different types of data that we can distinguish between,
structured, unstructured, and semi-structured data. Next up are different types of
databases. Here we will talk about relational and non-relational databases and what's the
difference between the two of them.

Finally, we will discuss data workloads. There are two different types of workloads that we
can differentiate between, transactional and analytical workloads. So now you know
what this module is all about, so let's start with different types of data.
Types of Data

But first, let's focus on different types of data. We usually differentiate between three types
of data. First, we have structured data. When we talk about structured data, we mean data
that can be rendered in the shape of a table. Each row in the table is called a record. The
columns in the table describe a name and data type of all records in the table, and each
record has to adhere to that description. This description is called a schema.

Next to structured data, there is unstructured data. Unstructured data is the data that has
no notion of fields, labels, or columns, or any other type of structure at all. And thirdly, in
between these two there is semi-structured data.

Semi-structured data is not necessarily tabular in nature, but it has some observable
structure. While that structure is in the data, it is not predefined or enforced using a
schema. Now all this might sound a bit abstract, so let's take a look at some examples of
these types of data. Here you see two tables with students and grades. These tables neatly
fit into the category of structured data.
Typical examples of where you would find structured data like this are CRM and ERP
systems, or, basically, any other type of administrative system that you can think of.

Every new concept or data structure added to the database will get its own table. In these
tables, records are stored that hold the actual information.

Every row holds precisely one record, and each piece of information, each cell, is called a
field. All these records and their fields adhere to a schema.

A possible schema for the Students table would state that the field ID should always hold an
integer number, and that the field's first name and last name should always hold a text
string that no field can be left out. In structured data sets, each table has a primary key.

A primary key is a field or a set of fields that can be used to uniquely identify each record.
In both these examples, the ID field is the primary key. The use of such a separate ID field
is fairly common.

There can also be a foreign key in the table. Foreign keys are references to primary keys in
other tables and are used to establish a relation between tables. In this example, the
StudentId in the Grades table is a foreign key that points back to the ID field of the
Students table.
The primary key and the foreign key are defined in the schema for a table. Let's now

explore some examples of unstructured data. Remember, unstructured data is all the data
that has no predefined structure. Typical examples that you would encounter are videos,
images, and audio files. While there can definitely be information in these types of files,
take this course, for example, it is much harder to interpret that information using a
computer system.

For this reason, unstructured data files are often analyzed using specific tooling to generate
more structured information out of it. For example, a video file like this can be processed to
generate a transcript. Then the transcript is analyzed and divided into 1-minute sections,
and for each section, the subject is determined and stored. Another example is the analysis
of images to identify faces and objects in that picture.

The results of these analyses are structured or semi-structured data that can be processed
more easily going forward. And finally, there is semi-structured data.

The first typical example of semi-structured data are log files. Log files often follow a format
that starts with some sort of timestamp, followed by more information. Here that
information is just a sentence, but the remainder of a line can also follow some kind of
structure.

Another typical example are data export/import formats. They can be used to transfer data
from one system to another or to download information and then load it in a spreadsheet
software, like Microsoft Excel. The format you see here is called comma-separated values,
or CSV for short. Here you see another data format called XML, or Extensible Markup
Language. Don't worry too much about the full names.

Everyone refers to them as just CSV or XML. In these examples, you can observe a clear
structure, and that structure can also be interpreted using computerized systems. But that
structure is not formalized in the schema, so it is not available upfront, and only when
reading the data you can find the structure.

Semi-structured data is not necessarily tabular in nature. The example here on the left can
be rendered as a table, the table of students with student records, but this is not always the
case. Let's assume that all the grades a student has received were nested into this XML.
Rendering it into a single table would now no longer be possible. One of the benefits of
structuring data like this is that it is easy to change the shape of the data over time. By not
having a schema, there is no reason a new field, like, for example, birthdate, could not be
added to one or more of the records in the XML on the left. So now you have an
understanding of different types of data, let's move on and take a look at different types of
databases.

Relational and Non-relational Databases


Depending on the work you need to do, you might use a different database system for each
job. While each database system has its strong and weak points, there are some basic
differences that can help you decide between classes of databases. In general, there is a
distinction between relational and non-relational databases, and then the distinction
between different types of non-relational databases.

Relational databases have been around since the beginning of databases, and up to a few
years ago were the most used type of database. Relational databases are mostly used for
storing tables of structured data. All of the internals of the relational database are optimized
for working with this type of data.

Interacting with the database is mostly done using a language called SQL, often pronounced
sequel, which stands for Structured Query Language. SQL is used for writing data to a table,
as well as for reading data out of a table. SQL is a declarative language, which means that
you do not write instructions on how to execute a query, but you describe the intended
result. The database is then responsible for translating that description into a series of
operations to execute.

While SQL is a standard, every database system has its own dialect of SQL to expose
specific features for that database system. All relational databases provide means for
describing the schema for tables, fields, field types, and relations. The schema for a table is
specified at creation using SQL.

Table and schema creation is one of the areas where differences in SQL dialects are clearly
visible. As relational databases are optimized for working with structured data, they enforce
a specified schema when writing data to a table. Whenever you try to write a record that
does not contain all the mandatory fields, has undefined fields, wrong data types, or any
other type of error, the operation will fail. There are tens, if not hundreds of different
relational database systems, but let's take a look at a few of the more well-known systems.

First up is Microsoft SQL Server. ⇒ This is a commercial product for which you have to buy
licenses for professional usage. It is a high-performance system, and one of its strong
points is its integration with Microsoft Active Directory. This means that when you have an
ID already running, you can use AD accounts to connect to the server.

There is also a Platform as a Service version of this database in Azure called Azure SQL
DB. MySQL is an open-source and free database system. Its strong point is that it is
relatively easy to install, manage, and learn, and this has made it very popular.

Both PostgreSQL is another open-source and free database system. Compared to MySQL, it
has more features, but it's also more complex and not as easy to learn. All right, that's
enough slides for now.

Let's take a look at a small demo. In this demo, you will see how to use Azure SQL DB to
create and query a database table for structured data, just for you to get a bit of a taste of
what interacting with a database looks like.

For the purpose of this demo, I have already created a database and connected to it, so we
can get directly to work. A typical workflow for working with any database has two steps.
First you type in a command, and then you get the result. So here I am pasting the
command to create a table called Students with a schema that specifies at three columns,
Id, FirstName, and LastName. Now hitting the Run button executes this command, and it
creates the Student table.

We can verify this by refreshing and browsing the list of tables on the left. Now let's type
another command to insert the first student in our table. After executing this command,
let's give one final command to list all students currently in the table.

You have just seen how to interact with an actual SQL DB. Other courses on Pluralsight will
teach you in depth how to do this on your own, but hopefully this short demo helps you
better understand the theory in this course.

Now let's switch gears and turn our attention to non-relational databases ⇒ . Non-relational databases do not store
their data in tables, but in collections or containers. In these collections, they can store arbitrary snippets of data like
an XML or JSON snippet.

Snippets with different shapes or even about different subjects can be mixed within a single
container, as non-relational databases do not enforce any kind of schema.

This model with more freedom has sparked the creation of different types of non-relational
databases. Let's focus on specific problem areas. There are four main types of
non-relational databases.
Document databases are used for storing small documents, often formatted as XML or
JSON. Wide-column stores are used for storing data in tables with rows and columns, but
without a schema.

This way, each record can have a different shape. Key-value stores are used for storing any
type of value under a specific locator key.

Key-value stores are incredibly performant and scale very well.

Finally, graph databases ⇒ can be used for modeling specific problems and performance querying of relationships
between entities.

Now let's take a look at some of the better known databases that implement these models.
Redis is a high performant in-memory ⇒ database that implements the key-value model. It is
often used as a caching system. Cassandra is an open-source and free to use wide-column
store. Cassandra has an architecture that is highly distributed, which makes it resistant
against hardware failure and suitable for running on low-end cheaper hardware.

Thirdly, there is Cosmos DB ⇒ , a cloud-native database for Azure. It is a document store, graph
database, or a key-value store all in one. It also features built-in support for distributing
data around the world. As this course focuses on Azure data workloads, let's also take a
quick look at how Cosmos DB can be used for storing and querying documents.

Transactional vs. Analytical Workloads


When working with data and databases, there are two types of workloads that you can
distinguish between. First are transactional workloads. These are the kind of workloads that
support the primary process. Systems like a CRM perform a high volume of reads and writes
to support their users getting information out of the system and storing updates. There are
also analytical workloads.

These workloads support users for getting insights out of data. Queries on these systems
are much less frequent, but each query can touch a much larger portion of the database.
Now let's take a look at both workload types in more detail.

Transactional workloads are about record keeping. CRMs, banking, or grade keeping
systems are all systems of record. This means that throughout a single day, many updates
to the system come in, and many users connect to read their portion of the data. As these
systems should answer queries with a level of authority, it is important that these systems
respond fast, often within milliseconds, and that the answers are always correct and
complete.

A workload of this type is called Online Transactional Processing, or OLTP. The word
transactional refers to one of the basic components of such a system, the transaction. A
transaction is an unsplittable set of updates to the database that are bound together and
should be executed as a single whole.
A good example is the transfer of money from one bank account to another. This
transaction consists out of two parts, a withdrawal on one account and a deposit on another
account.
And as you can imagine, it is crucial that both parts are executed, or that neither of them
is.

To support this type of behavior, for transactional workloads, databases adhere to the
so-called ACID properties. These properties describe which guarantees the database should
provide to make transactions of this type. But before we explore these ACID properties, let's
take a quick look at some examples of transactions.

Let's assume that we are starting a bank and have two relational tables in our system, one
with accounts and one with transfers from one account to another.

Now let's see what happens when two transactions come in. First, we start transaction 1 by
inserting a transfer record in the second table. Next, the balance on the first account
mentioned is updated. Now, while this happens, a second transaction starts for which we
also add a new transfer record and update the first account mentioned.

Now we add a second update to the accounts table for the first transaction, updating the
account with the ID 122 as well. Finally, the last update for the second transaction comes
in, and we update the account with ID 126.

As both transactions completed all operations without any errors, we commit both
transactions, and committing a transaction means that it becomes final. In this quick
walkthrough, you have seen how transactions work when there are no complications. But
you can imagine that there can be complications in the real world: loss of power, failure, or
something less catastrophic, like two transactions that want to mutate the same record. The
asset properties describe how a database should handle a situation like this. The first
property is atomicity and states that all transactions should be atomic, or unsplittable.
Either a transaction is committed and all changes are saved, or a transaction is aborted and
no changes are saved. Consistency states that the database should be in a consistent state
before and after each transaction.

Consistent means that all schema and relational requirements are met when a transaction
starts or commits. The isolation property states that all transactions should be executed in
such a way that it appears as if they were not running in parallel, or in other words, one
transaction should not see the intermediate results of another yet uncommitted transaction.

Finally, durability states that once a transaction has committed, the database system should
guarantee that the changes will persist, even if the database system crashes and has to
recover in the future. Of these properties, atomicity and isolation can be difficult to
understand, especially when you consider the transactional workloads can support
thousands of transactions per second.
So let's run through two scenarios that show how this works in more detail, starting with
atomicity. Continuing with our banking example, let's assume another transfer comes in.
We add these records to the Transfers table and update the balance for the first account
mentioned. But hey, that would put the balance below 0, and that's not allowed. So let's
mark it as a possible problem that needs to be corrected before this transaction can
commit. Now we update Account 2 again, and this is where we would normally commit. But
instead of committing, we have to abort, as one of the accounts is below 0, which is not
allowed, and the database should be in a consistent state after the transaction. Aborting the
transaction means that all updates will be undone. This is atomicity at its work. Let's now
take a look at isolation. Again, we have a first transfer coming in, and we update the first
account as requested. The second transfer comes in, but now the bank account we want to
transfer from is already updated by another transaction. This means that Transaction 2
cannot continue and has to wait for Transaction1 to either commit or abort before it can
continue. And from here on, you can see Transaction 1 going through the same steps as
before until it commits. And only after this commit, the second transaction can continue and
update first Account 1, and then Account 2, and until it can finally commit. This is how
isolation works. It guarantees that no matter how many transactions run at the same time,
they will never interfere with each other. After this discussion of transactional workloads,
let's talk a bit about analytical workloads. Analytical workloads are focused on situations
where you have a lower volume of reads, but often over a much larger amount of data.
There are two approaches to this type of workload, either using batch-based processes, or
by using streaming data. When using batch processes and querying the data after
processing, this is called Online Analytical Processing, or OLAP, sometimes pronounced
OLAP. This differs in approach from streaming data, where the data under investigation is
not stored at all, but only analyzed as it comes in, calculating the results needed. The
second module of this course is all about analytical processing using either batch processes
or streaming data. But first, let's summarize what you have learned so far.
Summary
Let's recap what you have learned in this first module. When working with data, we
generally recognize three types of data. Structured data is all the data that is organized into
tables and follows a predefined schema. Unstructured data is the data that does not have a
structure or internal organization, like a photograph or a picture. Finally, semi-structured
data is all the data that has an observable structure, but not a predefined schema.
Structured data and semi-structured data can be stored in databases. Relational databases
are for storing data that follows a predefined schema and is organized into tables, like
structured data. Non-relational databases are better suited for storing semi-structured data.
Finally, you learned that there are two types of data workloads that we can distinguish
between, transactional workloads to support primary processes and analytical workloads
that are optimized for answering business intelligence questions. In the next module, we
will explore the topic of analytical workloads in detail, so see you there.

Batch Data and Streaming Data


Introduction

Hello, and welcome back to the course, Getting Started with Azure Data Workloads. My
name is Henry Been, and it is my pleasure to be your teacher in this second module, Batch
Data and Streaming Data. In this module, we will talk about two different approaches to
working with data. First up is discussion of both batch data and streaming data to build an
understanding of how each approach works. Next, we will take a look at a company called
Globoticket. Globoticket sells tickets for concerts and other activities. At Globoticket, they
have an analytical question that they want answered. First, we will investigate how we can
answer that question using a batching data approach and explore which Azure services can
be used to do so. After this, we will do the same, but now using a streaming data approach.
All right, let's get started.

Batch Data vs. Streaming Data

In this clip, we will take a closer look at batch and streaming data and how they differ. Both
are a type of analytical processing, a process where you want to answer business
intelligence questions using data that is being generated by transactional workload.

When you do this by moving a large amount of data from one database to another, and
querying for answers in the second database, that is called batch processing. Streaming
data works based on the continuous export of new or changed data from your transactional
system into a query engine that calculates the answer to your queries on the fly. Now let's
zoom in on both approaches to see how they work in more detail starting with a typical
batching approach.

Let's assume there is a transactional workload that supports the primary process delivering
value to your customers, something like a web shop maybe. Such a workload is called an
OLTP workload, which stands for online transactional processing. But next to this workload,
you also want to run analytical queries to answer questions about how your customers are
using your applications, things that your marketing or sales department might be interested
in. A workload like this is called online analytical processing, or OLAP.

And as these databases can get very, very large, they are often called data warehouses.
Now separating these two workloads out is done for two reasons. First, OLTP and OLAP
often use different schemas for storing the same data. This allows for optimizing the
schema for the type of queries. Secondly, separating the databases ensures that analytical
queries cannot negatively impact the transactional workload by consuming too much
resources.

However, this leaves us with the problem of getting the data from one database into the
other. Luckily, you are not the only one facing this challenge. There are numerous tools ⇒
available to facilitate a process like this, and they are called ETL tools, which stands for extract, transform, and
load.

In these tools, you can define a process that extracts the data from one or more databases,
transforms the data from the source model into a model that is more suitable for analytical
querying, and then, finally, loads it into another database.
Such a pipeline often runs on a schedule, for example, every night. Now let's go back to
streaming data. Again, we start with an OLTP workload, but this time there is no OLAP
workload. Instead, the OLTP workload is set up to constantly export all the updates to a
streaming data engine. This engine differs from a database in the sense that it does not
store the data but stores the queries instead. As new data comes in, the data flows to each
and every query, which uses it to calculate or update intermediate results.

Once all query answers are updated, the data is removed from the engine. Now storing the
queries and not the data instead of the other way around is the fundamental difference
between the a streaming data engine and a database.

Now with this knowledge, let's compare both batch data and streaming data. When you
work with batch data, a new batch is prepared and loaded on a schedule. The consequence
of such a schedule is that the data is not always up to date.

All the data produced by the ETL pipeline is stored in an OLAP warehouse. After loading the
data here, you can query the data in a warehouse as often and with any query you want.
Batch data is very good for combining many data sources into a single data warehouse,
allowing for holistic querying over large data sets.

On the opposite, streaming data has the advantage that it provides near real-time answers
to your queries. A streaming data engine stores only the queries and the results and not the
data that provided the results. This means that a streaming data engine can process a
massive amount of data. The other consequence is that the queries have to be predefined,
which means that you cannot execute new or different queries on data that has already
gone through the engine. Finally, combining many datasets to enter a single query can be
more difficult when working with streaming data. Now going over this list, streaming data
may sound like it's superior to batch data in many cases. Still batch data is a very valid
choice in many cases, not in the least because it is often more cost efficient to build,
execute, and maintain than a streaming data setup. Processing data in batches can take
advantage of a lot of characteristics of databases. Also, batch data approaches may align
better with existing skills as they use more traditional database structures. Now to wrap this
clip up, please let me point out one important thing. It is not true that batch data or
streaming data has a one-to-one relationship with any specific type of data or database. You
can perfectly mix and match all the types of data and databases that we discussed in the
previous module and use them in both batch or streaming processes.
Case Study: Globoticket
To practice with batching data and streaming data architectures, let's explore a simple use
case. I'd like to introduce you to Globoticket, a ticketing company for concerts, plays, and
other activities. Globoticket runs a web shop where you can buy tickets to all the events
they are doing the sales for. All the data for this web shop is stored in a single database
that contains all the data they need for processing orders. Part of the schema for the
database is shown here. Now as you might expect, they have a table with information about
their customers. In this case, they record their full name and their birthdate. Of course,
they also record all the orders being placed. A lot of information is stored in the Orders
table. But we are only interested in the total order amount. Now while Globoticket is doing
well, they want to keep expanding their market share, and, thus, they are interested in
some demographic background of their customers. In particular, they want to investigate if
there are particular age groups that bring in more revenue than others. To answer this
question, they are asking for a list with the total order amount for each age, something like
this. Now if we can get that information out of their transactional database, they can use
that information to further their business. So let's explore how we can build a solution for
this problem using a batching data approach in the next clip.
Sales Reports Using Batch Data
Now let's take a look at how we can implement the question posed by Globoticket to better
understand who their current customers are. In this clip, we are going to answer that
question using a batch data approach. The advantage of using batch data is that it can be
done in such a way the existing web shop, database, and other processes do not have to be
changed in any way to make this possible. Also, using batch data, we can realize a fairly
quick and efficient implementation. Now to build a batch data architecture in Azure, let's
start by picking up the diagram from earlier. From this diagram, I have removed the
conceptual naming for all the elements, so we can now map all the elements needed to
existing Globoticket components or to new Azure services. Let's start on the left where we
have the Globoticket website and the Azure SQL DB that together run an OLTP workload. If
you are looking for an ETL tool in Azure, Azure Data Factory is one of the options available
and in this case the one that suits best. Azure Data Factory has built-in support for running
on a schedule, so that's covered as well. We can use another Azure SQL DB or Azure
Synapse depending on the size of our OLAP workload. Azure Synapses is the data
warehousing variation on Azure SQL DB and used to be called Azure SQL Data Warehouse in
the past. Finally, we can use Power BI to run queries against the OLAP database and
visualize the results. Power BI is a Microsoft offering for analyzing and visualizing data. It
comes along with the higher-tier Office 365 licenses. Now combining all these tools may
sound like a lot of work, but as they're all PaaS, Platform as a Service, or in the case of
Power BI, SaaS, Software as a Service, it will take an experienced engineer only a few
hours to set this up. The bulk of the work will be in the Azure Data Factory, so let's take a
deeper look at what you would do there. The first thing you will do is create what is called a
data flow where you describe a series of activities that have to be performed to get the data
from left to right. In this example, you would use the following four steps. First, you query
the information you need from the source databases and tables into Azure Data Factory.
Next, you combine the data from different sources, which is called joining. Once you have
combined all the data, you will select the columns you need to be passed on to the output.
And finally, you would write the results to the database on the right. Here you see a
screenshot of what this would look like in the real Azure Data Factory views. Now, here you
see a data flow where each activity is represented by a square that symbolizes one
operation. See how you built the flow we discussed from left to right, read data, combine
data, select columns, and finally write data. Expanding a flow like this or creating a new one
is actually fairly straightforward. For example, clicking the square at the bottom-left would
allow you to add more data sources, and clicking the smaller plus signs after each existing
activity allows you to chain another activity after it. Now if you want to learn more about
building data flows or Data Factory in general, you can take a look at some of the courses at
Pluralsight, for example, the course, Building your First Data Pipeline in Azure Data Factory,
by Emillio Melo. Now switching back to our architecture, the other new part in this diagram
is Power BI. Here you will do two things, first connect to the database holding the data for
analytics, your DB or data warehouse, and once Power BI is connected, you would build
your visualizations. In Power BI, that would look like this. First you would select the type of
visualization you want, and then you would drop the available columns in the table to either
the x axis or the values axis, and this will give you the graph you see right here. Now to
learn more about Power BI and how to create visualizations like this, you can take a look at
the course, Building your First Power BI Report, by Stacia Varga. And this completes our
exploration of building and batching data analytics flow. So let's now do the same using a
streaming data approach.
Sales Reports Using Streaming Data
Let's now switch gears again and see how we can also answer Globoticket's business
intelligence question using streaming data. Building an analytical database and an ETL.
pipeline for Globoticket was so successful that not long after a batch data implementation, a
new request from management arrives. While they enjoyed the results, they now want to
receive updates whenever a new order is placed, and with this new requirement, we are
forced to recreate our solution now using streaming data. Again, I have taken the
architectural diagram from the start of this module and removed all the conceptual names,
so that we can put actual Globoticket components and new Azure offerings in there to build
a new solution. Again, we start by putting in the Globoticket website and the supporting
Azure SQL DB on the left. Now for streaming data, we cannot use the Azure Data Factory.
Instead, we have to use Azure Stream Analytics. Stream Analytics can receive new order
records from any data source and use that to update in near real-time result sets. Stream
Analytics can receive streams of data coming, for example, from our OLTP database, and
use that to update in near real-time result sets. However, the question remains, how does
the new data flow from our database into Stream Analytics? It seems we are in a bit of a
pickle here, as there is a mismatch between the services. Azure SQL DB cannot stream
updates into another system, and Stream Analytics is not capable of polling an Azure SQL
DB database for updates. This means that we're going to have to change the architecture of
the original website and its database to allow for a stream of data being generated out of
that solution. Now there are multiple approaches to such a problem, but in this case, we're
going to introduce a messaging system between the website and the database. The
messaging system will receive all the updates from the website and then forward them to
the database. An example of such a messaging system is Azure Event Hubs, which is a
high-performing messaging system that can store and forward messages between one or
more producers to one or more consumers. With this system in place, we are no longer
going to write directly to the database, but instead send a message to the event hub, and
the event hub is capable of sending the same order not to one, but to two consumers. The
first consumer is going to be a small handler function that can pick the message up and
store it in the database. Just make sure that the original functionality is still there. The
other consumer is going to be Stream Analytics. Now once we have our new order records
flowing into Stream Analytics, another issue pops up. Remember we have to combine each
order record with the record of the customer that placed that order to get the birth date and
thus the age of the customer. Well, as Stream Analytics does not have all the data, but only
the data that flows in, it has to get the data from somewhere else. It actually has to go back
to our website database and read the customer order from the database. This is called a
lookup on reference data. Reference data is all the data that is not stored in the stream that
is being processed, but other data that is queried whenever a new record comes in. Once
this final communication line is also in place, Stream Analytics can finally produce the
results we wanted to generate, and these results can again be consumed in, for example,
Power BI. Unfortunately, it is not possible to take a meaningful screenshot from Stream
Analytics and walk through it with you. If you want to learn more about working with
Stream Analytics, I highly recommend taking a look at another course in the Pluralsight
library, for example, the course Understanding Stream Analytics, by Alan Smith. And though
we also have built a streaming analytics approach for answering the question of Globoticket,
we have completed walk through of a complete sample architecture for both batch and
streaming data, so let's meet in the next clip to wrap up this module.
Summary
To wrap this module up, let's recap what you have learned. You have learned about batch
data, a process where you build ETL processes to extract data from one or more databases
running an OLTP workload to move it into a data warehouse for running an OLAP workload.
Batch data is often the easiest approach to analytical queries to get started. It also supports
efficient pre-processing and model alteration from within the ETL process. This also makes it
suitable for combining data from multiple sources. You have also learned about streaming
data, another approach to analytics. Streaming data engines do not store the data, but
instead they store the queries. Data is forwarded to the engine for processing, after which it
is discarded. Streaming data approaches are useful when it is necessary to provide near
real-time results. Streaming data can also be the best way forward when data velocity is so
high that it is no longer feasible to store all the data. And with the completion of this
module, you have also completed this course. So thank you for your attention, and I hope
you will venture on into the world of data

You might also like