Notes - DP900
Notes - DP900
Azure Data
Fundamentals
Agenda
Below topics will be covered
Databases that hold tables in this form are called relational databases
Semi-structured data is information that doesn't reside in a relational database but still has
some structure to it. Examples include documents held in JavaScript Object Notation (JSON) format.
Not all data is structured or even semi-structured. For example, audio and video files, and binary
data files might not have a specific structure. They're referred to as unstructured data.
Next
Data processing
Data processing is simply the conversion of raw data to meaningful information through a process
Depending on how the data is ingested into your system, you could process each data item as it arrives,
or buffer the raw data and process it in groups
Columns
Rename a Database
Rename
DML (Data Manipulation Language)
Used to store, modify, retrieve, delete and update data in a database.
Index That helps improves the data retrieval speed CREATE INDEX index_name ON table_name;
View The fields in a view are fields from one or more real tables in the database. ( Virtual Table)
Select * from student_details
NOT NULL Constraint − Ensures that a column cannot have a NULL value.
UNIQUE Constraint − Ensures that all the values in a column are different.
INDEX − Used to create and retrieve data from the database very quickly.
• Domain Integrity − Enforces valid entries for a given column by restricting the type,
the format, or the range of values.
• Referential integrity − Rows cannot be deleted, which are used by other records.
• User-Defined Integrity − Enforces some specific business rules that do not fall into entity,
domain or referential integrity.
OLTP vs OLAP
Management of transactional data using Complex business analysis on large
computer systems business databases.
Choose OLTP when you need to efficiently Choose OLAP, when you need to execute
process and store business transactions and complex analytical and ad hoc queries
immediately make them available to client without impacting your OLTP systems.
applications in a consistent way.
Gives full control over infra Give runtime environment/platform Gives access to the end users
resources such as virtual machine To deploy application and
/storage etc Development tools.
Azure takes care of all the
You must take care of all the Azure takes care of all the admin
admin tasks.
Admin tasks such as patching, tasks including automated backups
upgrades, backups.
Microsoft takes care of all your administrative tasks including server patching, backups and updates.
You have no direct control over the platform on which the services run.
By default, your DB is protected by a server level firewall
Azure SQL Database ( PaaS)
Single Database Elastic Pool Managed Instance
This option enables you to quickly set up and run a single SQL Server database.(Cheapest)
By default, resources are pre-allocated, and you're charged per hour for the resources you’ve requested
You can also specify a serverless configuration. Your database automatically scales and resources
are allocated or deallocated as required.
This option is similar to Single Database, except that by default multiple databases can share the
same resources, such as memory, data storage space, and processing power.
The resources are referred to as a pool. You create the pool, and only your databases can use the
pool.
You are charged per Pool.
Azure SQL Database ( PaaS)
Managed Instance
Managed instance effectively runs a fully controllable instance of SQL Server in the cloud
You can install multiple databases on the same instance. You have complete control over this
instance, much as you would for an on-premises server
The Managed instance service automates backups, software patching, database monitoring, and other
general tasks, but you have full control over security and resource allocation for your databases
Managed instance has near 100% compatibility with SQL Server Enterprise Edition, running on-
premises.
Consider Azure SQL Database managed instance if you want to lift-and-shift an on-premises SQL
Server instance and all its databases to the cloud, without incurring the management overhead of
running SQL Server on a virtual machine. (BYOL)
SQL Server in a Virtual Machine ( IaaS)
SQL Server on Virtual Machines enables you to use full versions of SQL Server in the Cloud
without having to manage any on-premises hardware
You can easily move your on-premises SQL Database to Azure VM (Windows/Linux).
This approach is suitable for migrations and applications requiring access to operating system
features that might be unsupported at the PaaS level.
SQL virtual machines are lift-and-shift ready for existing applications that require fast migration
to the cloud with minimal changes.
You get all the cloud benefits such as scalability, elasticity, high performance with no limitation of
DBMS.
You remain responsible for maintaining the SQL Server software and performing the various
administrative tasks to keep the database running from day-to-day.
IaaS PaaS SaaS
SQL Server in Virtual
Machine
Single Database
Elastic Pool
Managed Instance
How to work with Non-Relational Data on Azure (25-30%)
Non-Relational DB (NOSQL)
NoSQL database stands for “Not Only SQL” or “Not SQL.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a NoSQL
database system encompasses a wide range of database technologies that can store structured, semi-
structured, unstructured data.
Doesn’t follow fixed schema structure
Doesn’t support features of a relational database
Column based Columns are divides into column families which holds related data
Object based Unstructured/semi data storage for binary large object: images, videos, VM disk image
Azure CosmosDB
Azure Cosmos DB is a multi-model NoSQL database management system.
Cosmos DB manages data as a partitioned set of documents.
A document is a collection of fields, identified by a key.
The fields in each document can vary, and a field can contain child documents.
Uses partition keys for high performance/query optimization
Example
## Document 1 ## ## Document 2 ##
{ {
"customerID": "101", "customerID": "102",
"name": "name":
{ {
"first": "Piyush", "title" : "Mr"
"last": "Sachdeva" "firstname": "Piyush",
} "lastname": "Sachdeva"
} }
}
CosmosDB APIs
SQL API Enables you to run SQL queries over JSON data.
Table API This interface enables you to use the Azure Table Storage API to store and retrieve
documents.
MongoDB API Many organizations run MongoDB(document-based DB) on-premises. You can use the
MongoDB API for Cosmos DB to enable a MongoDB application to run unchanged against a Cosmos
DB database or you can migrate MongoDB to CosmosDB in the cloud.
Cassandra DB API is a column-based DBMS ,the primary purpose of the Cassandra API is to enable
you to quickly migrate Cassandra databases and applications to Cosmos DB.
Gremlin API. The Gremlin API implements a graph database interface to Cosmos DB. A graph is a
collection of data objects(Nodes) and directed relationships(Edges). Data is still held as a set of
documents in Cosmos DB, but the Gremlin API enables you to perform graph queries over data.
Azure Table Storage
Azure Table Storage implements the NoSQL key-value model
In this model, the data for an item is stored as a set of fields, and the item is identified by a unique key.
Items are referred to as rows, and fields are known as columns.
Unlike RDBMS, it allows you to store unstructured data
Simple to scale and allows upto 5PB of data
Fast read/write as comparable to a relational DB, use partition key to increase performance.
Row insertion and data retrieval is fast.
Azure Blob Storage
Azure Blob storage is a service that enables you to store massive amounts of unstructured data, or
blobs, in the cloud.
Many applications need to store large, binary data objects, such as images, video, virtual machine
Images and so on. These are called Blobs.
Inside an Azure storage account, you create blobs inside containers(folders). You can group similar blobs
together in a container.
Set of blocks
Collection of fixed size pages Optimized to support append operations
Each block vary in size,
512-bytes each You can only add blocks to the end of an
up to 100MB
Supports random read/write append blob
Up to 100MB Update/deleting existing blocks is not
supported
Azure Blob Storage: Access Tiers
Hot Tier Cool Tier
The Hot tier is the default. Used for infrequent data access
Azure File Storage exposes file shares using the Server Message Block 3.0 (SMB) protocol.
Once you've created a storage account, you can upload files to Azure File Storage using the
Azure portal, or tools such as the AzCopy utility.
Graph based: When you need to define relationship in form of graphs. Cosmos Gremlin API
Key-Value: Data is accessed using a single key , used for caching, user profile mgt, session mgt.
Azure Table Storage Cosmos Table API
Document: JSON documents for content/inventory mgt, product catalog Cosmos SQL API
File share in the cloud , SMB 3.0 Protocol Azure File Share
Analytics workload on Azure (25-30%)
Data Analytics Core Concepts
Data analytics is concerned with examining, transforming, and arranging data so that you can study it
and extract useful information
Data Analytics stages :
Ingestion: Taking the data from multiple sources into your processing system.
Processing: Transformation of data into more meaningful form
Visualization: Graphical representation of processed data in the form of graphs, diagrams, charts ,
Maps etc., for reporting and business intelligence purpose.
ETL vs ELT
ETL (Extract , Transform and Load) ELT (Extract , Load and Transform)
Data Ingestion Aggregating Validation Target data store is a data warehouse using either Hadoop
Filtering Joining Cluster or Azure Synapse Analytics.
Sorting Cleaning Target datastore should be powerful enough to transform the
data
De-duplication
Data Analytics Techniques
What has why things What actions What will happen What might happen if
happened, based happened. should we take to in the future based circumstances
on historical data achieve a target on past trends changes: AI/ML
SSMS ( SQL Server Management Studio): complex admin task, platform configuration,
security mgt, user mgt, vulnerability assessment, performance tuning, query Synapse Analytics
When queries are long running and affect day to day operations
When you want to archive data (remove historical data from day-to-day system)
Azure Analysis
Services PowerBI
Azure Synapse Analytics
Table Storage
On-prem DB
Orchestration pipeline
Azure Data Services for Data Warehousing
Azure Data Factory
Azure Data Factory is described as a data integration service. Responsible for collection, transformation and
storage of data collected from multiple sources.
Pipeline Triggers
Scheduled trigger
Event-Based
Manual
Azure Data Lake Storage
A data lake is a repository for large quantities of raw data
You can think of a data lake as a staging point for your ingested data, before it’s transported and
converted into a format suitable for performing analytics
Data Lake Storage organizes your files into directories and subdirectories for improved file organization.
(Hierarchical Namespace)
Compatible with HDFS(Hadoop Distributed File System) used to examine huge datasets.
Role-Based Access Control (RBAC) on your data at file and directory level( POSIX access control list)
Data Sources
CosmosDB
To implement azure Data Lake you
Azure Data Lake need to have a storage account
Data Ingestion
Azure Databricks
Azure Databricks is an Apache Spark environment running on Azure to provide big data
processing, streaming, and machine learning.
In this model, Databricks performs your computations incrementally, and continuously updates
the result as streaming data arrives.
Azure Databricks provides a graphical user interface where you can define and test your
processing step by step, before submitting it as a set of batch tasks.
Azure Synapse Analytics
You can ingest data from external sources, such as flat files, Azure Data Lake, or another database
management systems, and then transform and aggregate this data into a format suitable for
analytics processing
You can perform complex queries over this data and generate reports, graphs, and charts.
It stores and process the data locally for faster processing
This approach enables you to repeatedly query the same data without the overhead of
fetching and converting it each time.
You can also use this data as input to further analytical processing, using Azure Analysis Services.
Azure Synapse Analytics leverages a massively parallel processing (MPP) architecture.
This architecture includes a control node and a pool of compute nodes.
Control node receive the processing request from applications and distribute
to compute nodes for parallel processing evenly.
Results from each node are then sent back to control node where it gets
combined into overall result
In a SQL pool, each compute node uses an Azure SQL Database and Azure
Storage to handle a portion of the data.
You can combine data from multiple sources, including Azure SQL Database, Azure Synapse Analytics, Azure
Data Lake store, Azure Cosmos DB, and many others.
Recommended Usage
If you have large amounts of ingested data that require preprocessing, you can use Synapse Analytics to
process the data and reduce into smaller datasets which can further be analyzed by Azure Analysis Service.
Azure HD Insight
Azure HDInsight is a big data processing service, that provides the platform for technologies such as
Spark in an Azure environment
HDInsight implements a clustered model that distributes processing across a set of computers
This model is similar to that used by Synapse Analytics, except that the nodes are running the Spark
processing engine rather than Azure SQL Database.
Data Processing
Data Sets
A dataset in Azure Data Factory represents the data that you want to ingest (input) or store.
If your data has a structure, a dataset specifies how the data is structured.
For example, if you are using blob storage as input The dataset would specify which blob to ingest,
and the format of the information in the blob (binary data, JSON, delimited text, and so on)
Control Flow
To orchestrate a pipeline
Integration Runtime
Compute environment for pipeline
Trigger
That initiates the pipeline
Parts of Power BI
Visualizations Datasets Reports
Building blocks
of Power BI
Dashboards Tiles
Reports in PowerBI
Static Report
Printed and shared
Formatted
Paginated Contains data on multiple pages
Use Power BI report builder to create the paginated report
Share the report by Power BI service
Viewed on screen
Customized as per your requirements
More visuals
Interactive Make use of 'hover’
User can change layout of design
Use PowerBI server to serve the interactive reports. (Premium)
Power BI content workflow
Share Build
• Share the report • Build reports using
power BI desktop
This Photo by Unknown Author is licensed under CC BY-SA-NC