Microsoft Azure Data Fundamentals
Microsoft Azure Data Fundamentals
Over the last few decades, the amount of data generated by systems, applications, and
devices has increased significantly. Data is everywhere, in a multitude of structures and
formats.
Data is now easier to collect and cheaper to store, making it accessible to nearly every
business. Data solutions include software technologies and platforms that can help
facilitate the collection, analysis, and storage of valuable information. Every business
would like to grow their revenues and make larger profits. In this competitive market,
data is a valuable asset. When analyzed properly, data provides a wealth of useful
information and informs critical business decisions.
The capability to capture, store, and analyze data is a core requirement for every
organization in the world. In this module, you'll learn about options for representing and
storing data, and about typical data workloads. By completing this module, you'll build
the foundation for learning about the techniques and services used to work with data.
Learning objectives
In this module you will learn how to:
Structured data
Structured data is data that adheres to a fixed schema, so all of the data has the same
fields or properties. Most commonly, the schema for structured data entities is tabular -
in other words, the data is represented in one or more tables that consist of rows to
represent each instance of a data entity, and columns to represent attributes of the
entity. For example, the following image shows tabular data representations
for Customer and Product entities.
Structured data is often stored in a database in which multiple tables can reference one
another by using key values in a relational model; which we'll explore in more depth
later.
Semi-structured data
Semi-structured data is information that has some structure, but which allows for some
variation between entity instances. For example, while most customers may have an
email address, some might have multiple email addresses, and some might have none
at all.
One common format for semi-structured data is JavaScript Object Notation (JSON). The
example below shows a pair of JSON documents that represent customer information.
Each customer document includes address and contact information, but the specific
fields vary between customers.
JSONCopy
// Customer 1
{
"firstName": "Joe",
"lastName": "Jones",
"address":
{
"streetAddress": "1 Main St.",
"city": "New York",
"state": "NY",
"postalCode": "10099"
},
"contact":
[
{
"type": "home",
"number": "555 123-1234"
},
{
"type": "email",
"address": "[email protected]"
}
]
}
// Customer 2
{
"firstName": "Samir",
"lastName": "Nadoy",
"address":
{
"streetAddress": "123 Elm Pl.",
"unit": "500",
"city": "Seattle",
"state": "WA",
"postalCode": "98999"
},
"contact":
[
{
"type": "email",
"address": "[email protected]"
}
]
}
Note
JSON is just one of many ways in which semi-structured data can be represented. The
point here is not to provide a detailed examination of JSON syntax, but rather to
illustrate the flexible nature of semi-structured data representations.
Unstructured data
Not all data is structured or even semi-structured. For example, documents, images,
audio and video data, and binary files might not have a specific structure. This kind of
data is referred to as unstructured data.
Data stores
Organizations typically store data in structured, semi-structured, or unstructured format
to record details of entities (for example, customers and products), specific events (such
as sales transactions), or other information in documents, images, and other formats.
The stored data can then be retrieved for analysis and reporting later.
• File stores
• Databases
The ability to store data in files is a core element of any computing system. Files can be
stored in local file systems on the hard disk of your personal computer, and on
removable media such as USB drives; but in most organizations, important data files are
stored centrally in some kind of shared file storage system. Increasingly, that central
storage location is hosted in the cloud, enabling cost-effective, secure, and reliable
storage for large volumes of data.
The specific file format used to store data depends on a number of factors, including:
Copy
FirstName,LastName,Email
Joe,Jones,[email protected]
Samir,Nadoy,[email protected]
JSONCopy
{
"customers":
[
{
"firstName": "Joe",
"lastName": "Jones",
"contact":
[
{
"type": "home",
"number": "555 123-1234" //object 1
},
{
"type": "email",
"address": [email protected] //object 2
}
]
},
{
"firstName": "Samir",
"lastName": "Nadoy",
"contact":
[
{
"type": "email",
"address": "[email protected]"
}
]
}
]
}
XMLCopy
<Customers>
<Customer name="Joe" lastName="Jones">
<ContactDetails>
<Contact type="home" number="555 123-1234"/>
<Contact type="email" address="[email protected]"/>
</ContactDetails>
</Customer>
<Customer name="Samir" lastName="Nadoy">
<ContactDetails>
<Contact type="email" address="[email protected]"/>
</ContactDetails>
</Customer>
</Customers>
When working with data like this, data professionals often refer to the data files
as BLOBs (Binary Large Objects).
Some common optimized file formats you might see include Avro, ORC, and Parquet:
Explore databases
• 5 minutes
A database is used to define a central system in which data can be stored and queried.
In a simplistic sense, the file system on which files are stored is a kind of database; but
when we use the term in a professional data context, we usually mean a dedicated
system for managing data records rather than files.
Relational databases
Relational databases are commonly used to store and query structured data. The data is
stored in tables that represent entities, such as customers, products, or sales orders.
Each instance of an entity is assigned a primary key that uniquely identifies it; and these
keys are used to reference the entity instance in other tables. For example, a customer's
primary key can be referenced in a sales order record to indicate which customer placed
the order. This use of keys to reference data entities enables a relational database to
be normalized; which in part means the elimination of duplicate data values so that, for
example, the details of an individual customer are stored only once; not for each sales
order the customer places. The tables are managed and queried using Structured Query
Language (SQL), which is based on an ANSI standard, so it's similar across multiple
database systems.
Non-relational databases
Non-relational databases are data management systems that don’t apply a relational
schema to the data. Non-relational databases are often referred to as NoSQL database,
even though some support a variant of the SQL language.
processing
Completed100 XP
• 5 minutes
A transactional data processing system is what most people consider the primary
function of business computing. A transactional system records transactions that
encapsulate specific events that the organization wants to track. A transaction could be
financial, such as the movement of money between accounts in a banking system, or it
might be part of a retail system, tracking payments for goods and services from
customers. Think of a transaction as a small, discrete, unit of work.
OLTP solutions rely on a database system in which data storage is optimized for both
read and write operations in order to support transactional workloads in which data
records are created, retrieved, updated, and deleted (often referred to
as CRUD operations). These operations are applied transactionally, in a way that ensures
the integrity of the data stored in the database. To accomplish this, OLTP systems
enforce transactions that support so-called ACID semantics:
OLTP systems are typically used to support live applications that process business data -
often referred to as line of business (LOB) applications.
Analytical data processing typically uses read-only (or read-mostly) systems that store
vast volumes of historical data or business metrics. Analytics can be based on a
snapshot of the data at a given point in time, or a series of snapshots.
The specific details for an analytical processing system can vary between solutions, but a
common architecture for enterprise-scale analytics looks like this:
1. Operational data is extracted, transformed, and loaded (ETL) into a data lake
for analysis.
2. Data is loaded into a schema of tables - typically in a Spark-based data
lakehouse with tabular abstractions over files in the data lake, or a data
warehouse with a fully relational SQL engine.
3. Data in the data warehouse may be aggregated and loaded into an online
analytical processing (OLAP) model, or cube. Aggregated numeric values
(measures) from fact tables are calculated for intersections
of dimensions from dimension tables. For example, sales revenue might be
totaled by date, customer, and product.
4. The data in the data lake, data warehouse, and analytical model can be
queried to produce reports, visualizations, and dashboards.
Data lakes are common in large-scale data analytical processing scenarios, where a large
volume of file-based data must be collected and analyzed.
Data warehouses are an established way to store data in a relational schema that is
optimized for read operations – primarily queries to support reporting and data
visualization. Data Lakehouses are a more recent innovation that combine the flexible
and scalable storage of a data lake with the relational querying semantics of a data
warehouse. The table schema may require some denormalization of data in an OLTP
data source (introducing some duplication to make queries perform faster).
An OLAP model is an aggregated type of data storage that is optimized for analytical
workloads. Data aggregations are across dimensions at different levels, enabling you
to drill up/down to view aggregations at multiple hierarchical levels; for example to find
total sales by region, by city, or for an individual address. Because OLAP data is pre-
aggregated, queries to return the summaries it contains can be run quickly.
Different types of user might perform data analytical work at different stages of the
overall architecture. For example:
• Data scientists might work directly with data files in a data lake to explore
and model data.
• Data Analysts might query tables directly in the data warehouse to produce
complex reports and visualizations.
• Business users might consume pre-aggregated data in an analytical model
in the form of reports or dashboards.
Knowledge check
200 XP
• 3 minutes
Choose the best response for each of the questions below. Then select Check your answers.
Summary
Completed100 XP
• 1 minute
Data is at the core of most software applications and solutions. It can be represented in
many formats, stored in files and databases, and used to record transactions or to
support analysis and reporting.
• Unit 1 of 5
Introduction
Completed100 XP
• 1 minute
Over the last decade, the amount of data that systems and devices generate has
increased significantly. Because of this increase, new technologies, roles, and
approaches to working with data are affecting data professionals. Data professionals
typically fulfill different roles when managing, using, and controlling data. In this
module, you'll learn about the various roles that organizations often apply to data
professionals, the tasks and responsibilities associated with these roles, and the
Microsoft Azure services used to perform them.
Learning objectives
In this module you will learn how to:
There's a wide variety of roles involved in managing, controlling, and using data. Some
roles are business-oriented, some involve more engineering, some focus on research,
and some are hybrid roles that combine different aspects of data management. Your
organization may define roles differently, or give them different names, but the roles
described in this unit encapsulate the most common division of tasks and
responsibilities.
The three key job roles that deal with data in most organizations are:
The job roles define differentiated tasks and responsibilities. In some organizations, the
same person might perform multiple roles; so in their role as database administrator
they might provision a transactional database, and then in their role as a data engineer
they might create a pipeline to transfer data from the database to a data warehouse for
analysis.
Database Administrator
The database administrator is also responsible for managing the security of the data in
the database, granting privileges over the data, granting or denying access to users as
appropriate.
Data Engineer
They're also responsible for ensuring that the privacy of data is maintained within the
cloud and spanning from on-premises to the cloud data stores. They own the
management and monitoring of data pipelines to ensure that data loads perform as
expected.
Data Analyst
A data analyst processes raw data into relevant insights based on identified business
requirements to deliver relevant insights.
Note
The roles described here represent the key data-related roles found in most medium to
large organizations. There are additional data-related roles not mentioned here, such
as data scientist and data architect; and there are other technical professionals that work
with data, including application developers and software engineers.
Microsoft Azure is a cloud platform that powers the applications and IT infrastructure for
some of the world's largest organizations. It includes many services to support cloud
solutions, including transactional and analytical data workloads.
Some of the most commonly used cloud services for data are described below.
Note
This topic covers only some of the most commonly used data services for modern
transactional and analytical solutions. Additional services are also available.
Azure SQL
Azure SQL is the collective name for a family of relational database solutions
based on the Microsoft SQL Server database engine. Specific Azure SQL services include:
Database administrators typically provision and manage Azure SQL database systems to
support line of business (LOB) applications that need to store transactional data.
Data engineers may use Azure SQL database systems as sources for data pipelines that
perform extract, transform, and load (ETL) operations to ingest the transactional data
into an analytical system.
Data analysts may query Azure SQL databases directly to create reports, though in large
organizations the data is generally combined with data from other sources in an
analytical data store to support enterprise analytics.
As with Azure SQL database systems, open-source relational databases are managed by
database administrators to support transactional applications, and provide a data source
for data engineers building pipelines for analytical solutions and data analysts creating
reports.
Azure Cosmos DB
Azure Cosmos DB is a global-scale non-relational (NoSQL) database system
that supports multiple application programming interfaces (APIs), enabling you to store
and manage data as JSON documents, key-value pairs, column-families, and graphs.
Azure Storage
Azure Storage is a core Azure service that enables you to store data in:
Data engineers use Azure Storage to host data lakes - blob storage with a hierarchical
namespace that enables files to be organized in folders in a distributed file system.
Azure Data Factory is an Azure service that enables you to define and
schedule data pipelines to transfer and transform data. You can integrate your pipelines
with other Azure services, enabling you to ingest data from cloud data stores, process
the data using cloud-based compute, and persist the results in another data store.
Azure Data Factory is used by data engineers to build extract, transform, and load (ETL)
solutions that populate analytical data stores with data from transactional systems
across the organization.
Data engineers can use Azure Synapse Analytics to create a unified data analytics
solution that combines data ingestion pipelines, data warehouse storage, and data lake
storage through a single service.
Data analysts can use SQL and Spark pools through interactive notebooks to explore
and analyze data, and take advantage of integration with services such as Azure
Machine Learning and Microsoft Power BI to create data models and extract insights
from the data.
Azure Databricks
Data engineers can use existing Databricks and Spark skills to create analytical data
stores in Azure Databricks.
Data Analysts can use the native notebook support in Azure Databricks to query and
visualize data in an easy to use web-based interface.
Azure HDInsight
Data engineers can use Azure HDInsight to support big data analytics workloads that
depend on multiple open-source technologies.
Azure Data Explorer is a standalone service that offers the same high-
performance querying of log and telemetry data as the Azure Synapse Data Explorer
runtime in Azure Synapse Analytics.
Data analysts can use Azure Data Explorer to query and analyze data that includes a
timestamp attribute, such as is typically found in log files and Internet-of-things (IoT)
telemetry data.
Microsoft Purview
Data engineers can use Microsoft Purview to enforce data governance across the
enterprise and ensure the integrity of data used to support analytical workloads.
Microsoft Fabric
Knowledge check
200 XP
• 3 minutes
Choose the best response for each of the questions below. Then select Check your answers.
Which role is most likely to use Azure Data Factory to define a data pipeline for an
ETL process?
Database Administrator
Data Engineer
Data Analyst
3.
Which services would you use as a SaaS solution for data analytics?
Azure SQL Database
Azure Synapse Analytics
Microsoft Fabric
Summary
Completed100 XP
• 1 minute
Managing and working with data is a specialist skill that requires knowledge of multiple
technologies. Most organizations define job roles for the various tasks responsible for
managing data.
Next steps
Now that you’ve learned about professional data roles and the services they use,
consider learning more about data-related workloads on Microsoft Azure by pursuing a
Microsoft certification in Azure Data Fundamentals.
Go back to finish
1. Explore relational database services in Azure
Add
• Unit 1 of 6
Next
Introduction
Completed100 XP
• 1 minute
Azure supports multiple database services, enabling you to run popular relational
database management systems, such as SQL Server, PostgreSQL, and MySQL, in the
cloud.
Most Azure database services are fully managed, freeing up valuable time you’d
otherwise spend managing your database. Enterprise-grade performance with built-in
high availability means you can scale quickly and reach global distribution without
worrying about costly downtime. Developers can take advantage of industry-leading
innovations such as built-in security with automatic monitoring and threat detection,
automatic tuning for improved performance. On top of all of these features, you have
guaranteed availability.
In this module, you'll explore the options available for relational database services in
Azure.
Learning objectives
In this module, you'll learn how to:
Next
• 10 minutes
Azure SQL is a collective term for a family of Microsoft SQL Server based database
services in Azure. Specific Azure SQL services include:
SQL Server on Azure VMs Azure SQL Managed Instance Azure SQL Database
SQL Server Fully compatible with on-premises physical Near-100% compatibility with SQL Server. Most Supports most core
compatibility and virtualized installations. Applications on-premises databases can be migrated with database-level
and databases can easily be "lift and shift" minimal code changes by using the Azure capabilities
migrated without change. Database Migration service of SQL Server.
Some features
depended on by an on-pr
application may not be av
Architecture SQL Server instances are installed in a Each managed instance can support multiple You can provision a
virtual machine. Each instance can support databases. Additionally, instance pools can be single database in
multiple databases. used to share resources efficiently across smaller a dedicated,
instances. managed
(logical) server
; or you can use
an elastic pool to
share resources
across multiple
databases and take
advantage
of on-demand
scalability.
Availability 99.99% 99.99% 99.995%
Management You must manage all aspects of the server, Fully automated updates, backups, and recovery. Fully automated updates,
including operating system and SQL Server backups, and recovery.
updates, configuration, backups, and other
maintenance tasks.
SQL Server on Azure VMs Azure SQL Managed Instance Azure SQL Database
Use cases Use this option when you need to migrate Use this option for most cloud migration Use this option for new cl
or extend an on-premises SQL Server scenarios, particularly when you need minimal solutions, or to migrate
solution and retain full control over all changes to existing applications. applications that have min
aspects of server and database instance-level dependenc
configuration.
SQL Server running on an Azure virtual machine effectively replicates the database running on
real on-premises hardware. Migrating from the system running on-premises to an Azure
virtual machine is no different than moving the databases from one on-premises server to
another.
This approach is suitable for migrations and applications requiring access to operating system
features that might be unsupported at the PaaS level. SQL virtual machines are lift-and-
shift ready for existing applications that require fast migration to the cloud with minimal
changes. You can also use SQL Server on Azure VMs to extend existing on-premises
applications to the cloud in hybrid deployments.
Note
A hybrid deployment is a system where part of the operation runs on-premises, and part in
the cloud. Your database might be part of a larger system that runs on-premises, although
the database elements might be hosted in the cloud.
You can use SQL Server in a virtual machine to develop and test traditional SQL Server
applications. With a virtual machine, you have the full administrative rights over the DBMS
and operating system. It's a perfect choice when an organization already has IT resources
available to maintain the virtual machines.
• Create rapid development and test scenarios when you don't want to buy on-premises
non-production SQL Server hardware.
• Become lift-and-shift ready for existing applications that require fast migration to the
cloud with minimal changes or no changes.
• Scale up the platform on which SQL Server is running, by allocating more memory, CPU
power, and disk space to the virtual machine. You can quickly resize an Azure virtual
machine without the requirement that you reinstall the software that is running on it.
Business benefits
Running SQL Server on virtual machines allows you to meet unique and diverse business
needs through a combination of on-premises and cloud-hosted deployments, while using the
same set of server products, development tools, and expertise across these environments.
It's not always easy for businesses to switch their DBMS to a fully managed service. There may
be specific requirements that must be satisfied in order to migrate to a managed service that
requires making changes to the database and the applications that use it. For this reason,
using virtual machines can offer a solution, but using them doesn't eliminate the need to
administer your DBMS as carefully as you would on-premises.
Managed instances depend on other Azure services such as Azure Storage for backups, Azure
Event Hubs for telemetry, Microsoft Entra ID for authentication, Azure Key Vault for
Transparent Data Encryption (TDE) and a couple of Azure platform services that provide
security and supportability features. The managed instances make connections to these
services.
All communications are encrypted and signed using certificates. To check the trustworthiness
of communicating parties, managed instances constantly verify these certificates through
certificate revocation lists. If the certificates are revoked, the managed instance closes the
connections to protect the data.
Use cases
Consider Azure SQL Managed Instance if you want to lift-and-shift an on-premises SQL Server
instance and all its databases to the cloud, without incurring the management overhead of
running SQL Server on a virtual machine.
Azure SQL Managed Instance provides features not available in Azure SQL Database
(discussed below). If your system uses features such as linked servers, Service Broker (a
message processing system that can be used to distribute work across servers), or Database
Mail (which enables your database to send email messages to users), then you should use
managed instance. To check compatibility with an existing on-premises system, you can
install Data Migration Assistant (DMA). This tool analyzes your databases on SQL Server and
reports any issues that could block migration to a managed instance.
Business benefits
Azure SQL Managed Instance enables a system administrator to spend less time on
administrative tasks because the service either performs them for you or greatly simplifies
those tasks. Automated tasks include operating system and database management system
software installation and patching, dynamic instance resizing and configuration, backups,
database replication (including system databases), high availability configuration, and
configuration of health and performance monitoring data streams.
Azure SQL Managed Instance has near 100% compatibility with SQL Server Enterprise Edition,
running on-premises.
Azure SQL Managed Instance supports SQL Server Database engine logins and logins
integrated with Microsoft Entra ID. SQL Server Database engine logins include a username
and a password. You must enter your credentials each time you connect to the server.
Microsoft Entra logins use the credentials associated with your current computer sign-in, and
you don't need to provide them each time you connect to the server.
Note
A SQL Database server is a logical construct that acts as a central administrative point for
multiple single or pooled databases, logins, firewall rules, auditing rules, threat detection
policies, and failover groups.
Single Database
This option enables you to quickly set up and run a single SQL Server database. You create
and run a database server in the cloud, and you access your database through this server.
Microsoft manages the server, so all you have to do is configure the database, create your
tables, and populate them with your data. You can scale the database if you need more
storage space, memory, or processing power. By default, resources are pre-allocated, and
you're charged per hour for the resources you've requested. You can also specify
a serverless configuration. In this configuration, Microsoft creates its own server, which might
be shared by databases belonging to other Azure subscribers. Microsoft ensures the privacy
of your database. Your database automatically scales and resources are allocated or
deallocated as required.
Elastic Pool
This option is similar to Single Database, except that by default multiple databases can share
the same resources, such as memory, data storage space, and processing power through
multiple-tenancy. The resources are referred to as a pool. You create the pool, and only your
databases can use the pool. This model is useful if you have databases with resource
requirements that vary over time, and can help you to reduce costs. For example, your payroll
database might require plenty of CPU power at the end of each month as you handle payroll
processing, but at other times the database might become much less active. You might have
another database that is used for running reports. This database might become active for
several days in the middle of the month as management reports are generated, but with a
lighter load at other times. Elastic Pool enables you to use the resources available in the pool,
and then release the resources once processing has completed.
Use cases
Azure SQL Database gives you the best option for low cost with minimal administration. It
isn't fully compatible with on-premises SQL Server installations. It's often used in new cloud
projects where the application design can accommodate any required changes to your
applications.
Note
You can use the Data Migration Assistant to detect compatibility issues with your databases
that can impact database functionality in Azure SQL Database. For more information,
see Overview of Data Migration Assistant.
• Modern cloud applications that need to use the latest stable SQL Server features.
• Applications that require high availability.
• Systems with a variable load that need the database server to scale up and down quickly.
Business benefits
Azure SQL Database automatically updates and patches the SQL Server software to ensure
that you're always running the latest and most secure version of the service.
The scalability features of Azure SQL Database ensure that you can increase the resources
available to store and process data without having to perform a costly manual upgrade.
The service provides high availability guarantees, to ensure that your databases are available
at least 99.995% of the time. Azure SQL Database supports point-in-time restore, enabling
you to recover a database to the state it was in at any point in the past. Databases can be
replicated to different regions to provide more resiliency and disaster recovery.
Auditing tracks database events and writes them to an audit log in your Azure storage
account. Auditing can help you maintain regulatory compliance, understand database activity,
and gain insight into discrepancies and anomalies that might indicate business concerns or
suspected security violations.
SQL Database helps secure your data by providing encryption that protects data that is stored
in the database (at rest) and while it is being transferred across the network (in motion).
In addition to Azure SQL services, Azure data services are available for other popular
relational database systems, including MySQL, MariaDB, and PostgreSQL. The primary
reason for these services is to enable organizations that use them in on-premises apps to
move to Azure quickly, without making significant changes to their applications.
MySQL started life as a simple-to-use open-source database management system. It's the
leading open source relational database for Linux, Apache, MySQL, and PHP (LAMP) stack
apps. It's available in several editions; Community, Standard, and Enterprise. The
Community edition is available free-of-charge, and has historically been popular as a
database management system for web applications, running under Linux. Versions are
also available for Windows. Standard edition offers higher performance, and uses a
different technology for storing data. Enterprise edition provides a comprehensive set of
tools and features, including enhanced security, availability, and scalability. The Standard
and Enterprise editions are the versions most frequently used by commercial
organizations, although these versions of the software aren't free.
PostgreSQL is a hybrid relational-object database. You can store data in relational tables,
but a PostgreSQL database also enables you to store custom data types, with their own
non-relational properties. The database management system is extensible; you can add
code modules to the database, which can be run by queries. Another key feature is the
ability to store and manipulate geometric data, such as lines, circles, and polygons.
PostgreSQL has its own query language called pgsql. This language is a variant of the
standard relational query language, SQL, with features that enable you to write stored
procedures that run inside the database.
The server provides connection security to enforce firewall rules and, optionally, require
SSL connections. Many server parameters enable you to configure server settings such as
lock modes, maximum number of connections, and timeouts.
Azure Database for MySQL provides a global database system that scales up to large
databases without the need to manage hardware, network components, virtual servers,
software patches, and other underlying components.
Certain operations aren't available with Azure Database for MySQL. These functions are
primarily concerned with security and administration. Azure manages these aspects of the
database server itself.
You get the following features with Azure Database for MySQL:
The system uses pay-as-you-go pricing so you only pay for what you use.
Azure Database for MySQL servers provides monitoring functionality to add alerts, and to
view metrics and logs.
If you prefer PostgreSQL, you can choose Azure Database for PostgreSQL to
run a PaaS implementation of PostgreSQL in the Azure Cloud. This service provides the
same availability, performance, scaling, security, and administrative benefits as the MySQL
service.
The flexible-server deployment option for PostgreSQL is a fully managed database service.
It provides a high level of control and server configuration customizations, and provides
cost optimization controls.
Benefits of Azure Database for PostgreSQL
Azure Database for PostgreSQL is a highly available service. It contains built-in failure
detection and failover mechanisms.
Users of PostgreSQL will be familiar with the pgAdmin tool, which you can use to manage
and monitor a PostgreSQL database. You can continue to use this tool to connect to Azure
Database for PostgreSQL. However, some server-focused functionality, such as performing
server backup and restore, aren't available because the server is managed and maintained
by Microsoft.
Azure Database for PostgreSQL records information about queries run against databases
on the server, and saves them in a database named azure_sys. You query
the query_store.qs_view view to see this information, and use it to monitor the queries that
users are running. This information can prove invaluable if you need to fine-tune the
queries performed by your applications.
Knowledge check
200 XP
• 3 minutes
Choose the best response for each of the questions below. Then select Check your
answers.
1.
Which deployment option offers the best compatibility when migrating an existing
SQL Server on-premises solution?
Which database service is the simplest option for migrating a LAMP application to
Azure?
Summary
• 1 minute
Azure supports a range of database services that you can use to support new cloud
applications or migrate existing applications to the cloud.
Next
Introduction
• 1 minute
Most software applications need to store data. Often this takes the form of a
relational database, in which the data is organized in related tables and managed
by using Structured Query Language (SQL). However, many applications don't need
the rigid structure of a relational database and rely on non-relational (often
referred to as NoSQL) storage.
Azure Storage is one of the core services in Microsoft Azure, and offers a range of
options for storing data in the cloud. In this module, you explore the fundamental
capabilities of Azure storage and learn how it's used to support applications that
require non-relational data stores.
Learning objectives
In this module, you learn how to:
• Unit 2 of 8
Next
Azure Blob Storage is a service that enables you to store massive amounts of unstructured
data as binary large objects, or blobs, in the cloud. Blobs are an efficient way to store data
files in a format that is optimized for cloud-based storage, and applications can read and
write them by using the Azure blob storage API.
Within a container, you can organize blobs in a hierarchy of virtual folders, similar to files
in a file system on disk. However, by default, these folders are simply a way of using a "/"
character in a blob name to organize the blobs into namespaces. The folders are purely
virtual, and you can't perform folder-level operations to control access or perform bulk
operations.
Blob storage provides three access tiers, which help to balance access latency and storage
cost:
• The Hot tier is the default. You use this tier for blobs that are accessed
frequently. The blob data is stored on high-performance media.
• The Cool tier has lower performance and incurs reduced storage charges
compared to the Hot tier. Use the Cool tier for data that is accessed
infrequently. It's common for newly created blobs to be accessed frequently
initially, but less so as time passes. In these situations, you can create the blob
in the Hot tier, but migrate it to the Cool tier later. You can migrate a blob
from the Cool tier back to the Hot tier.
• The Archive tier provides the lowest storage cost, but with increased latency.
The Archive tier is intended for historical data that mustn't be lost, but is
required only rarely. Blobs in the Archive tier are effectively stored in an offline
state. Typical reading latency for the Hot and Cool tiers is a few milliseconds,
but for the Archive tier, it can take hours for the data to become available. To
retrieve a blob from the Archive tier, you must change the access tier to Hot
or Cool. The blob will then be rehydrated. You can read the blob only when
the rehydration process is complete.
You can create lifecycle management policies for blobs in a storage account. A lifecycle
management policy can automatically move a blob from Hot to Cool, and then to the
Archive tier, as it ages and is used less frequently (policy is based on the number of days
since modification). A lifecycle management policy can also arrange to delete outdated
blobs.
Azure Data Lake Store (Gen1) is a separate service for hierarchical data storage for
analytical data lakes, often used by so-called big data analytical solutions that work with
structured, semi-structured, and unstructured data stored in files. Azure Data Lake Storage
Gen2 is a newer version of this service that is integrated into Azure Storage; enabling you
to take advantage of the scalability of blob storage and the cost-control of storage tiers,
combined with the hierarchical file system capabilities and compatibility with major
analytics systems of Azure Data Lake Store.
Systems like Hadoop in Azure HDInsight, Azure Databricks, and Azure Synapse Analytics
can mount a distributed file system hosted in Azure Data Lake Store Gen2 and use it to
process huge volumes of data.
To create an Azure Data Lake Store Gen2 files system, you must enable the Hierarchical
Namespace option of an Azure Storage account. You can do this when initially creating
the storage account, or you can upgrade an existing Azure Storage account to support
Data Lake Gen2. Be aware however that upgrading is a one-way process – after upgrading
a storage account to support a hierarchical namespace for blob storage, you can’t revert it
to a flat namespace.
• Unit 4 of 8
Many on-premises systems comprising a network of in-house computers make use of file
shares. A file share enables you to store a file on one computer, and grant access to that file
to users and applications running on other computers. This strategy can work well for
computers in the same local area network, but doesn't scale well as the number of users
increases, or if users are located at different sites.
Azure Files is essentially a way to create cloud-based network shares, such as you typically
find in on-premises organizations to make documents and other files available to multiple
users. By hosting file shares in Azure, organizations can eliminate hardware costs and
maintenance overhead, and benefit from high availability and scalable cloud storage for files.
You create Azure File storage in a storage account. Azure Files enables you to share up to 100
TB of data in a single storage account. This data can be distributed across any number of file
shares in the account. The maximum size of a single file is 1 TB, but you can set quotas to
limit the size of each share below this figure. Currently, Azure File Storage supports up to
2000 concurrent connections per shared file.
After you've created a storage account, you can upload files to Azure File Storage using the
Azure portal, or tools such as the AzCopy utility. You can also use the Azure File Sync service
to synchronize locally cached copies of shared files with the data in Azure File Storage.
Azure File Storage offers two performance tiers. The Standard tier uses hard disk-based
hardware in a datacenter, and the Premium tier uses solid-state disks. The Premium tier offers
greater throughput, but is charged at a higher rate.
• Server Message Block (SMB) file sharing is commonly used across multiple
operating systems (Windows, Linux, macOS).
• Network File System (NFS) shares are used by some Linux and macOS versions. To
create an NFS share, you must use a premium tier storage account and create
and configure a virtual network through which access to the share can be
controlled.
Azure Table Storage is a NoSQL storage solution that makes use of tables
containing key/value data items. Each item is represented by a row that contains columns
for the data fields that need to be stored.
However, don't be misled into thinking that an Azure Table Storage table is like a table in
a relational database. An Azure Table enables you to store semi-structured data. All rows
in a table must have a unique key (composed of a partition key and a row key), and when
you modify data in a table, a timestamp column records the date and time the
modification was made; but other than that, the columns in each row can vary. Azure
Table Storage tables have no concept of foreign keys, relationships, stored procedures,
views, or other objects you might find in a relational database. Data in Azure Table storage
is usually denormalized, with each row holding the entire data for a logical entity. For
example, a table holding customer information might store the first name, last name, one
or more telephone numbers, and one or more addresses for each customer. The number
of fields in each row can be different, depending on the number of telephone numbers
and addresses for each customer, and the details recorded for each address. In a relational
database, this information would be split across multiple rows in several tables.
To help ensure fast access, Azure Table Storage splits a table into partitions. Partitioning is
a mechanism for grouping related rows, based on a common property or partition key.
Rows that share the same partition key will be stored together. Partitioning not only helps
to organize data, it can also improve scalability and performance in the following ways:
• Partitions are independent from each other, and can grow or shrink as rows
are added to, or removed from, a partition. A table can contain any number of
partitions.
• When you search for data, you can include the partition key in the search
criteria. This helps to narrow down the volume of data to be examined, and
improves performance by reducing the amount of I/O (input and output
operations, or reads and writes) needed to locate the data.
The key in an Azure Table Storage table comprises two elements; the partition key that
identifies the partition containing the row, and a row key that is unique to each row in the
same partition. Items in the same partition are stored in row key order. If an application
adds a new row to a table, Azure ensures that the row is placed in the correct position in
the table. This scheme enables an application to quickly perform point queries that
identify a single row, and range queries that fetch a contiguous block of rows in a
partition.
Summary
• 1 minute
Azure Storage is a key service in Microsoft Azure, and enables a wide range of data
storage scenarios and solutions.
Introduction
• 1 minute
Relational databases store data in relational tables, but sometimes the structure
imposed by this model can be too rigid, and often leads to poor performance
unless you spend time implementing detailed tuning. Other models, collectively
known as NoSQL databases, exist. These models store data in other structures, such
as documents, graphs, key-value stores, and column family stores.
Azure Cosmos DB is a highly scalable cloud database service for NoSQL data.
Learning objectives
In this module, you'll learn how to:
• Unit 2 of 6
Describe Azure Cosmos DB
Completed100 XP
• 5 minutes
Note
Cosmos DB uses indexes and partitioning to provide fast read and write
performance and can scale to massive volumes of data. You can enable multi-
region writes, adding the Azure regions of your choice to your Cosmos DB account
so that globally distributed users can each work with data in their local replica.
• IoT and telematics. These systems typically ingest large amounts of data
in frequent bursts of activity. Cosmos DB can accept and store this
information quickly. The data can then be used by analytics services,
such as Azure Machine Learning, Azure HDInsight, and Power BI.
Additionally, you can process the data in real-time using Azure
Functions that are triggered as data arrives in the database.
• Retail and marketing. Microsoft uses Cosmos DB for its own e-
commerce platforms that run as part of Windows Store and Xbox Live.
It's also used in the retail industry for storing catalog data and for
event sourcing in order processing pipelines.
• Gaming. The database tier is a crucial component of gaming
applications. Modern games perform graphical processing on
mobile/console clients, but rely on the cloud to deliver customized and
personalized content like in-game stats, social media integration, and
high-score leaderboards. Games often require single-millisecond
latencies for reads and write to provide an engaging in-game
experience. A game database needs to be fast and be able to handle
massive spikes in request rates during new game launches and feature
updates.
• Web and mobile applications. Azure Cosmos DB is commonly used
within web and mobile applications, and is well suited for modeling
social interactions, integrating with third-party services, and for
building rich personalized experiences. The Cosmos DB SDKs can be
used to build rich iOS and Android applications using the popular
Xamarin framework.
For additional information about uses for Cosmos DB, read Common Azure Cosmos
DB use cases.
• Unit 3 of 6
Next
A SQL query for an Azure Cosmos DB database containing customer data might
look similar to this:
SQLCopy
SELECT *
FROM customers c
WHERE c.id = "[email protected]"
The result of this query consists of one or more JSON documents, as shown here:
JSONCopy
{
"id": "[email protected]",
"name": "Joe Jones",
"address": {
"street": "1 Main St.",
"city": "Seattle"
}
}
JavaScriptCopy
db.products.find({id: 123})
JSONCopy
{
"id": 123,
"name": "Hammer",
"price": 2.99
}
Azure Cosmos DB for PostgreSQL
Azure Cosmos DB for PostgreSQL is a native PostgreSQL, globally distributed
relational database that automatically shards data to help you build highly scalable
apps. You can start building apps on a single node server group, the same way you
would with PostgreSQL anywhere else. As your app's scalability and performance
requirements grow, you can seamlessly scale to multiple nodes by transparently
distributing your tables. PostgreSQL is a relational database management system
(RDBMS) in which you define relational tables of data, for example you might
define a table of products like this:
Expand table
ProductID ProductName
123 Hammer
162 Screwdriver
You could then query this table to retrieve the name and price of a specific product
using SQL like this:
SQLCopy
SELECT ProductName, Price
FROM Products
WHERE ProductID = 123;
The results of this query would contain a row for product 123, like this:
Expand table
ProductName Price
Hammer 2.99
Expand table
PartitionKey RowKey Name Email
1 123 Joe Jones [email protected]
1 124 Samir Nadoy [email protected]
You can then use the Table API through one of the language-specific SDKs to make
calls to your service endpoint to retrieve data from the table. For example, the
following request returns the row containing the record for Samir Nadoy in the
table above:
textCopy
https://endpoint/Customers(PartitionKey='1',RowKey='124')
Expand table
ID Name Manager
1 Sue Smith
2 Ben Chan Sue Smith
SQLCopy
SELECT * FROM Employees WHERE ID = 2
Gremlin syntax includes functions to operate on vertices and edges, enabling you
to insert, update, delete, and query data in the graph. For example, you could use
the following code to add a new employee named Alice that reports to the
employee with ID 1 (Sue)
Copy
g.addV('employee').property('id', '3').property('firstName', 'Alice')
g.V('3').addE('reports to').to(g.V('1'))
The following query returns all of the employee vertices, in order of ID.
Copy
g.V().hasLabel('employee').order().by('id')
Next steps
Now that you've learned about Azure Cosmos DB for non-relational data storage,
consider learning more about data-related workloads on Azure by pursuing a
Microsoft certification in Azure Data Fundamentals.
1. Explore fundamentals of large-scale analytics
Add
• Unit 1 of 10
Next
Introduction
Completed100 XP
• 1 minute
Learning objectives
In this module, you will learn how to:
Large-scale data analytics architecture can vary, as can the specific technologies
used to implement it; but in general, the following elements are included:
In either case, pipelines consist of one or more activities that operate on data. An
input dataset provides the source data, and activities can be defined as a data flow
that incrementally manipulates the data until an output dataset is produced.
Pipelines can connect to external data sources to integrate with a wide variety of
data services.
Data lakehouses
A data lake is a file store, usually on a distributed file system for high performance
data access. Technologies like Spark or Hadoop are often used to process queries
on the stored files and return data for reporting and analytics. These systems often
apply a schema-on-read approach to define tabular schemas on semi-structured
data files at the point where the data is read for analysis, without applying
constraints when it's stored. Data lakes are great for supporting a mix of structured,
semi-structured, and even unstructured data that you want to analyze without the
need for schema enforcement when the data is written to the store.
You can use a hybrid approach that combines features of data lakes and data
warehouses in a lake database or data lakehouse. The raw data is stored as files in a
data lake, and a relational storage layer abstracts the underlying files and expose
them as tables, which can be queried using SQL. SQL pools in Azure Synapse
Analytics include PolyBase, which enables you to define external tables based on
files in a data lake (and other sources) and query them using SQL. Synapse
Analytics also supports a Lake Database approach in which you can use database
templates to define the relational schema of your data warehouse, while storing the
underlying data in data lake storage – separating the storage and compute for your
data warehousing solution. Data lakehouses are a relatively new approach in Spark-
based systems, and are enabled through technologies like Delta Lake; which adds
relational storage capabilities to Spark, so you can define tables that enforce
schemas and transactional consistency, support batch-loaded and streaming data
sources, and provide a SQL API for querying.
On Azure, there are three main platform-as-a-service (PaaS) services that you can
use to implement a large-scale analytical store
Note
Each of these services can be thought of as an analytical data store, in the sense
that they provide a schema and interface through which the data can be queried. In
many cases however, the data is actually stored in a data lake and the service is
used to process the data and run queries. Some solutions might even combine the
use of these services. An extract, load, and transform (ELT) ingestion process might
copy data into the data lake, and then use one of these services to transform the
data, and another to query it. For example, a pipeline might use a MapReduce job
running in HDInsight or a notebook running in Azure Databricks to process a large
volume of data in the data lake, and then load it into tables in a SQL pool in Azure
Synapse Analytics.
Knowledge check
200 XP
• 3 minutes
Choose the best response for each of the questions below. Then select Check your
answers.
Which Azure PaaS services can you use to create a pipeline for data ingestion
and processing?
Azure SQL Database and Azure Cosmos DB
Azure Synapse Analytics and Azure Data Factory
Azure HDInsight and Azure Databricks
2.
What must you define to implement a pipeline that reads data from Azure
Blob Storage?
Apache Hadoop
Apache Spark
Check your answers
Summary
Completed100 XP
• 1 minute
Large-scale data analytics is a complex workload that can involve many different
technologies. This module has provided a high-level overview of the key features of
an analytics solution, and explored some of the Microsoft services that you can use
to implement one.
Next steps
Now that you've learned about large-scale data warehousing, consider learning
more about data-related workloads on Azure by pursuing a Microsoft certification
in Azure Data Fundamentals.
Module incomplete:
Next
Introduction
Completed100 XP
• 1 minute
Increased use of technology by individuals, companies, and other organizations,
together with the proliferation of smart devices and Internet access has led to a
massive growth in the volume of data that can be generated, captured, and analyzed.
Much of this data can be processed in real-time (or at least, near real-time) as a
perpetual stream of data, enabling the creation of systems that reveal instant insights
and trends, or take immediate responsive action to events as they occur.
Learning objectives
In this module, you'll learn about the basics of stream processing and real-time
analytics, and the services in Microsoft Azure that you can use to implement real-time
data processing solutions. Specifically, you'll learn how to:
For example, suppose you want to analyze road traffic by counting the number of cars
on a stretch of road. A batch processing approach to this would require that you
collect the cars in a parking lot, and then count them in a single operation while
they're at rest.
If the road is busy, with a large number of cars driving along at frequent intervals, this
approach may be impractical; and note that you don't get any results until you have
parked a batch of cars and counted them.
A real world example of batch processing is the way that credit card companies handle
billing. The customer doesn't receive a bill for each separate credit card purchase but
one monthly bill for all of that month's purchases.
• The time delay between ingesting the data and getting the results.
• All of a batch job's input data must be ready before a batch can be
processed. This means data must be carefully checked. Problems with data,
errors, and program crashes that occur during batch jobs bring the whole
process to a halt. The input data must be carefully checked before the job
can be run again. Even minor data errors can prevent a batch job from
running.
For example, a better approach to our hypothetical car counting problem might be to
apply a streaming approach, by counting the cars in real-time as they pass:
In this approach, you don't need to wait until all of the cars have parked to start
processing them, and you can aggregate the data over time intervals; for example, by
counting the number of cars that pass each minute.
Stream processing is ideal for time-critical operations that require an instant real-time
response. For example, a system that monitors a building for smoke and heat needs to
trigger alarms and unlock doors to allow residents to escape immediately in the event
of a fire.
Understand differences between batch and streaming
data
Apart from the way in which batch processing and streaming processing handle data,
there are other differences:
• Data scope: Batch processing can process all the data in the dataset.
Stream processing typically only has access to the most recent data
received, or within a rolling time window (the last 30 seconds, for
example).
• Data size: Batch processing is suitable for handling large datasets
efficiently. Stream processing is intended for individual records or micro
batches consisting of few records.
• Performance: Latency is the time taken for the data to be received and
processed. The latency for batch processing is typically a few hours.
Stream processing typically occurs immediately, with latency in the order
of seconds or milliseconds.
• Analysis: You typically use batch processing to perform complex analytics.
Stream processing is used for simple response functions, aggregates, or
calculations such as rolling averages.
The following diagram shows some ways in which batch and stream processing can be
combined in a large-scale data analytics architecture.
1. Data events from a streaming data source are captured in real-time.
2. Data from other sources is ingested into a data store (often a data lake) for
batch processing.
3. If real-time analytics is not required, the captured streaming data is written
to the data store for subsequent batch processing.
4. When real-time analytics is required, a stream processing technology is
used to prepare the streaming data for real-time analysis or visualization;
often by filtering or aggregating the data over temporal windows.
5. The non-streaming data is periodically batch processed to prepare it for
analysis, and the results are persisted in an analytical data store (often
referred to as a data warehouse) for historical analysis.
6. The results of stream processing may also be persisted in the analytical
data store to support historical analysis.
7. Analytical and visualization tools are used to present and explore the real-
time and historical data.
Note
Commonly used solution architectures for combined batch and stream data processing
include lambda and delta architectures. Details of these architectures are beyond the
scope of this course, but they incorporate technologies for both large-scale batch data
processing and real-time stream processing to create an end-to-end analytical
solution.
• 4 minutes
There are many technologies that you can use to implement a stream processing
solution, but while specific implementation details may vary, there are common
elements to most streaming architectures.
1. An event generates some data. This might be a signal being emitted by a sensor,
a social media message being posted, a log file entry being written, or any other
occurrence that results in some digital data.
2. The generated data is captured in a streaming source for processing. In simple
cases, the source may be a folder in a cloud data store or a table in a database. In
more robust streaming solutions, the source may be a "queue" that encapsulates
logic to ensure that event data is processed in order and that each event is
processed only once.
3. The event data is processed, often by a perpetual query that operates on the
event data to select data for specific types of events, project data values, or
aggregate data values over temporal (time-based) periods (or windows) - for
example, by counting the number of sensor emissions per minute.
4. The results of the stream processing operation are written to an output (or sink),
which may be a file, a database table, a real-time visual dashboard, or another
queue for further processing by a subsequent downstream query.
Real-time analytics in Azure
Microsoft Azure supports multiple technologies that you can use to implement real-
time analytics of streaming data, including:
• Azure Stream Analytics: A platform-as-a-service (PaaS) solution that you can use
to define streaming jobs that ingest data from a streaming source, apply a
perpetual query, and write the results to an output.
• Spark Structured Streaming: An open-source library that enables you to
develop complex streaming solutions on Apache Spark based services,
including Azure Synapse Analytics, Azure Databricks, and Azure HDInsight.
• Azure Data Explorer: A high-performance database and analytics service that is
optimized for ingesting and querying batch or streaming data with a time-series
element, and which can be used as a standalone Azure service or as an Azure
Synapse Data Explorer runtime in an Azure Synapse Analytics workspace.
The following services are commonly used to ingest data for stream processing on
Azure:
• Azure Event Hubs: A data ingestion service that you can use to manage queues
of event data, ensuring that each event is processed in order, exactly once.
• Azure IoT Hub: A data ingestion service that is similar to Azure Event Hubs, but
which is optimized for managing event data from Internet-of-things (IoT) devices.
• Azure Data Lake Store Gen 2: A highly scalable storage service that is often used
in batch processing scenarios, but which can also be used as a source of
streaming data.
• Apache Kafka: An open-source data ingestion solution that is commonly used
together with Apache Spark. You can use Azure HDInsight to create a Kafka
cluster.
The output from stream processing is often sent to the following services:
• Azure Event Hubs: Used to queue the processed data for further downstream
processing.
• Azure Data Lake Store Gen 2 or Azure blob storage: Used to persist the
processed results as a file.
• Azure SQL Database or Azure Synapse Analytics, or Azure Databricks: Used to
persist the processed results in a database table for querying and analysis.
• Microsoft Power BI: Used to generate real time data visualizations in reports and
dashboards.
Azure Stream Analytics is a service for complex event processing and analysis of
streaming data. Stream Analytics is used to:
• Ingest data from an input, such as an Azure event hub, Azure IoT Hub, or
Azure Storage blob container.
• Process the data by using a query to select, project, and aggregate data
values.
• Write the results to an output, such as Azure Data Lake Gen 2, Azure SQL
Database, Azure Synapse Analytics, Azure Functions, Azure event hub,
Microsoft Power BI, or others.
Once started, a Stream Analytics query will run perpetually, processing new data as it
arrives in the input and storing results in the output.
Azure Stream Analytics is a great technology choice when you need to continually
capture data from a streaming source, filter or aggregate it, and send the results to a
data store or downstream process for analysis and reporting.
Note
To learn more about the capabilities of Azure Stream Analytics, see the Azure Stream
Analytics documentation.
Apache Spark is a distributed processing framework for large scale data analytics. You
can use Spark on Microsoft Azure in the following services:
Spark can be used to run code (usually written in Python, Scala, or Java) in parallel
across multiple cluster nodes, enabling it to process very large volumes of data
efficiently. Spark can be used for both batch processing and stream processing.
Spark Structured Streaming is a great choice for real-time analytics when you need to
incorporate streaming data into a Spark based data lake or analytical data store.
Note
For more information about Spark Structured Streaming, see the Spark Structured
Streaming programming guide.
Delta Lake
Delta Lake is an open-source storage layer that adds support for transactional
consistency, schema enforcement, and other common data warehousing features to
data lake storage. It also unifies storage for streaming and batch data, and can be used
in Spark to define relational tables for both batch and stream processing. When used
for stream processing, a Delta Lake table can be used as a streaming source for queries
against real-time data, or as a sink to which a stream of data is written.
The Spark runtimes in Azure Synapse Analytics and Azure Databricks include support
for Delta Lake.
Delta Lake combined with Spark Structured Streaming is a good solution when you
need to abstract batch and stream processed data in a data lake behind a relational
schema for SQL-based querying and analysis.
Note
For more information about Delta Lake, see What is Delta Lake?
Microsoft Fabric includes native support for real-time data analytics, including real-
time data ingestion from multiple streaming sources.
In Microsoft Fabric, you can use an eventstream to capture real-time event data from a
streaming source and persist it in a destination such as a table in a Lakehouse or a KQL
database.
When writing eventstream data to a Lakehouse table, you can apply aggregations and
filters to summarize the captured data. A KQL database supports tables based on the
Data Explorer engine, enabling you to perform real-time analytics on the data in tables
by running KQL queries. After capturing real-time data in a table, you can use Power BI
in Microsoft Fabric to create real-time data visualizations.
Next unit: Exercise: Explore Realtime Analytics in
Microsoft Fabric
Knowledge check
200 XP
• 3 minutes
1.
Which service would you use to continually capture data from an IoT Hub,
aggregate it over temporal periods, and store results in Azure SQL Database?
Azure Cosmos DB
Azure Stream Analytics
Azure Storage
Summary
Completed100 XP
• 1 minute
Next steps
Now that you've learned about stream processing and real-time analytics, consider
learning more about data-related workloads on Azure by pursuing a Microsoft
certification in Azure Data Fundamentals.
Next
Introduction
Completed100 XP
• 1 minute
Data modeling and visualization is at the heart of business intelligence (BI) workloads that
are supported by large-scale data analytics solutions. Essentially, data visualization powers
reporting and decision making that helps organizations succeed.
In this module, you'll learn about fundamental principles of analytical data modeling and
data visualization, using Microsoft Power BI as a platform to explore these principles in
action.
Learning objectives
After completing this module, you'll be able to:
There are many data visualization tools that data analysts can use to explore data and
summarize insights visually; including chart support in productivity tools like Microsoft
Excel and built-in data visualization widgets in notebooks used to explore data in services
such as Azure Synapse Analytics and Azure Databricks. However, for enterprise-scale
business analytics, an integrated solution that can support complex data modeling,
interactive reporting, and secure sharing is often required.
Microsoft Power BI
Microsoft Power BI is a suite of tools and services that data analysts can use to build
interactive data visualizations for business users to consume.
A typical workflow for creating a data visualization solution starts with Power BI Desktop,
a Microsoft Windows application in which you can import data from a wide range of data
sources, combine and organize the data from these sources in an analytics data model,
and create reports that contain interactive visualizations of the data.
After you've created data models and reports, you can publish them to the Power BI
service; a cloud service in which reports can be published and interacted with by business
users. You can also do some basic data modeling and report editing directly in the service
using a web browser, but the functionality for this is limited compared to the Power BI
Desktop tool. You can use the service to schedule refreshes of the data sources on which
your reports are based, and to share reports with other users. You can also define
dashboards and apps that combine related reports in a single, easy to consume location.
Users can consume reports, dashboards, and apps in the Power BI service through a web
browser, or on mobile devices by using the Power BI phone app.
Analytical models enable you to structure data to support analysis. Models are based on
related tables of data and define the numeric values that you want to analyze or report
(known as measures) and the entities by which you want to aggregate them (known
as dimensions). For example, a model might include a table containing numeric measures
for sales (such as revenue or quantity) and dimensions for products, customers, and time.
This would enable you aggregate sale measures across one or more dimensions (for
example, to identify total revenue by customer, or total items sold by product per month).
Conceptually, the model forms a multidimensional structure, which is commonly referred
to as a cube, in which any point where the dimensions intersect represents an aggregated
measure for those dimensions.)
Note
Although we commonly refer to an analytical model as a cube, there can be more (or
fewer) than three dimensions – it’s just not easy for us to visualize more than three!
Tables and schema
Dimension tables represent the entities by which you want to aggregate numeric
measures – for example product or customer. Each entity is represented by a row with a
unique key value. The remaining columns represent attributes of an entity – for example,
products have names and categories, and customers have addresses and cities. It’s
common in most analytical models to include a Time dimension so that you can
aggregate numeric measures associated with events over time.
The numeric measures that will be aggregated by the various dimensions in the model are
stored in Fact tables. Each row in a fact table represents a recorded event that has numeric
measures associated with it. For example, the Sales table in the schema below represents
sales transactions for individual items, and includes numeric values for quantity sold and
revenue.
This type of schema, where a fact table is related to one or more dimension tables, is
referred to as a star schema (imagine there are five dimensions related to a single fact
table – the schema would form a five-pointed star!). You can also define a more complex
schema in which dimension tables are related to additional tables containing more details
(for example, you could represent attributes of product categories in a
separate Category table that is related to the Product table – in which case the design is
referred to as a snowflake schema. The schema of fact and dimension tables is used to
create an analytical model, in which measure aggregations across all dimensions are pre-
calculated; making performance of analysis and reporting activities much faster than
calculating the aggregations each time.)
Attribute hierarchies
One final thing worth considering about analytical models is the creation of
attribute hierarchies that enable you to quickly drill-up or drill-down to find aggregated
values at different levels in a hierarchical dimension. For example, consider the attributes
in the dimension tables we’ve discussed so far. In the Product table, you can form a
hierarchy in which each category might include multiple named products. Similarly, in
the Customer table, a hierarchy could be formed to represent multiple named customers
in each city. Finally, in the Time table, you can form a hierarchy of year, month, and day.
The model can be built with pre-aggregated values for each level of a hierarchy, enabling
you to quickly change the scope of your analysis – for example, by viewing total sales by
year, and then drilling down to see a more detailed breakdown of total sales by month.
There are many kinds of data visualization, some commonly used and some more
specialized. Power BI includes an extensive set of built-in visualizations, which can be
extended with custom and third-party visualizations. The rest of this unit discusses some
common data visualizations but is by no means a complete list.
Tables and text are often the simplest way to communicate data. Tables are useful when
numerous related values must be displayed, and individual text values in cards can be a
useful way to show important figures or metrics.
Line charts
Line charts can also be used to compare categorized values and are useful when you need
to examine trends, often over time.
Pie charts
Pie charts are often used in business reports to visually compare categorized values as
proportions of a total.
Scatter plots
Scatter plots are useful when you want to compare two numeric measures and identify a
relationship or correlation between them.
Maps
Maps are a great way to visually compare values for different geographic areas or
locations.
Knowledge check
200 XP
• 3 minutes
Choose the best response for each of the questions below. Then select Check your answers.
Which tool should you use to import data from multiple data sources and create a
report?
Power BI Desktop
Power BI Phone App
Azure Data Factory
2.
What should you define in your data model to enable drill-up/down analysis?
A measure
A hierarchy
A relationship
3.
Which kind of visualization should you use to analyze pass rates for multiple exams
over time?
A pie chart
A scatter plot
A line chart
Summary
Completed100 XP
• 1 minute
Data modeling and visualization enables organizations to extract insights from data.
Next steps
Now that you've learned about data modeling and visualization, consider learning more
about data-related workloads on Azure by pursuing a Microsoft certification in Azure Data
Fundamentals.
Module incomplete:
Go back to finish