0% found this document useful (0 votes)
29 views

Notes - DP900

Uploaded by

shruti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Notes - DP900

Uploaded by

shruti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

DP-900

Azure Data
Fundamentals
Agenda
Below topics will be covered

• Core Data concepts


• Relational Data workload

• NOSQL Data Workload

• Data Analytics and Processing


Core Data Concepts (15-20%)
What is Data?
Collection of facts such as numbers, descriptions, and observations used in decision making.

Structured Semi-Structured Unstructured


Structured data is typically tabular data that is represented by rows and columns in a database.

Databases that hold tables in this form are called relational databases
Semi-structured data is information that doesn't reside in a relational database but still has
some structure to it. Examples include documents held in JavaScript Object Notation (JSON) format.

Not all data is structured or even semi-structured. For example, audio and video files, and binary
data files might not have a specific structure. They're referred to as unstructured data.

Next
Data processing
Data processing is simply the conversion of raw data to meaningful information through a process
Depending on how the data is ingested into your system, you could process each data item as it arrives,
or buffer the raw data and process it in groups

Streaming Batch Processing

Processing data as it arrives is called streaming


Streaming Data: When you play a video on Youtube, Netflix. The service streams the data through your browser
In real-time.

Buffering and processing the data in groups is called batch processing.


Batch processing: Counting of votes in election where data is collected and counted in batches.
RDBMS
Data represented in the form of rows and columns

Collection of related data entries are called Tables

Columns

Employee ID Name Department


1 Piyush IT Record/Row
2 John HR
3 David Management

Collection of multiple tables and database objects : Relational Database


Normalization

Store and organize


relational data in most Improves data integrity
efficient manner

Create relationships Enforces constraints and


between database tables fixed schema
SQL Commands
(DDL) Data Definition Language
Helps defining structure of database or schema
Defines how the data is stored in a database
To create a database and its objects like (table,
index, views, store procedure, function, and
Create triggers)

Alter Alters the structure of the existing database

Delete objects from Database( Tables ,


Drop index, views)

Truncate Removes all record from a Table

Add comments to a data dictionary


Comment

Rename a Database
Rename
DML (Data Manipulation Language)
Used to store, modify, retrieve, delete and update data in a database.

Retrieve Data from a


Select Database

Insert Insert data into a table

Update existing data within a


Update table

Delete Delete records from a database table


Database Objects
Most of the major database engines offer the same set of major database object types:
Students Grades
Student ID Name Age ID Name Grade StudentID
Table 121 Piyush 32 101 Piyush B 121
123 David 30 201 David A 123
124 John 28 301 John C 124

Index That helps improves the data retrieval speed CREATE INDEX index_name ON table_name;

View The fields in a view are fields from one or more real tables in the database. ( Virtual Table)
Select * from student_details

CREATE VIEW view_name AS CREATE VIEW student_details AS Name Age Grade


SELECT column1, column2, ... SELECT s.Name, s.Age, g.Grades Piyush 32 B
FROM table_name FROM students s, grade g
David 30 A
WHERE s.studentID = g.studentID;
WHERE condition; John 28 C
SQL CONSTRAINTS
Rules enforced on data columns on a table.
These are used to limit the type of data that can go into a table.
These ensures the accuracy and reliability of the data in the database.

NOT NULL Constraint − Ensures that a column cannot have a NULL value.

CREATE TABLE table_name ( CREATE TABLE students (


column1 datatype constraint, StudentID int NOT NULL,
column2 datatype constraint, Name varchar(255) NOT NULL,
column3 datatype constraint, FirstName varchar(255) NOT NULL,
.... lastName varchar(255)
); );
DEFAULT Constraint − Provides a default value for a column when none is specified.

CREATE TABLE students (


StudentID int NOT NULL,
Name varchar(255) NOT NULL,
Address varchar(255) DEFAULT ’India’
);

UNIQUE Constraint − Ensures that all the values in a column are different.

CREATE TABLE students (


StudentID int NOT NULL UNIQUE,
Name varchar(255) NOT NULL,
FirstName varchar(255) NOT NULL,
lastName varchar(255)
);
PRIMARY Key − Uniquely identifies each row/record in a database table.

CREATE TABLE students (


StudentID int PRIMARY KEY, NOT
Name varchar(255) NOT NULL,
UNIQUE + NULL = PRIMARY KEY
Address varchar(255) DEFAULT ’India’
);

FOREIGN Key − Uniquely identifies a row/record in any another database table.

Primary Key Foreign Key

Student ID Name Age ID Name Grade StudentID

121 Piyush 32 Grades 101 Piyush B 121


Students 201 David A 123
123 David 30
124 John 28 301 John C 124
CHECK Constraint − ensures that all values in a column satisfy certain conditions.

CREATE TABLE students (


StudentID int NOT NULL,
Name varchar(255) NOT NULL,
FirstName varchar(255) NOT NULL,
Age int CHECK (Age>=18)
);

INDEX − Used to create and retrieve data from the database very quickly.

CREATE INDEX index_name ON table_name;


Data Integrity
• Entity Integrity − There are no duplicate rows in a table.

• Domain Integrity − Enforces valid entries for a given column by restricting the type,
the format, or the range of values.
• Referential integrity − Rows cannot be deleted, which are used by other records.
• User-Defined Integrity − Enforces some specific business rules that do not fall into entity,
domain or referential integrity.
OLTP vs OLAP
Management of transactional data using Complex business analysis on large
computer systems business databases.

OLTP systems record business interactions It can be used to perform complex


as they occur in the day-to-day operation analytical queries without negatively
of the organization affecting day to day business operations.

Choose OLTP when you need to efficiently Choose OLAP, when you need to execute
process and store business transactions and complex analytical and ad hoc queries
immediately make them available to client without impacting your OLTP systems.
applications in a consistent way.

Business Transactions related to payments, Reporting and forecasting, trend reports,


orders, inventories etc. market sentiments, recommendations and
suggestions etc
IaaS PaaS SaaS
Infrastructure as a Service Platform as a Service Software as a Service

Gives full control over infra Give runtime environment/platform Gives access to the end users
resources such as virtual machine To deploy application and
/storage etc Development tools.
Azure takes care of all the
You must take care of all the Azure takes care of all the admin
admin tasks.
Admin tasks such as patching, tasks including automated backups
upgrades, backups.

Pay-per-use Pay-per-service model Pay-per-subscription model

Azure DevOps, Azure Web App, DropBox, Office 365 , Teams


Azure VM, VNET,
AWS EC2 servers OpenShift
How to work with Relational Data on Azure (25-30%)
Azure Data Services for RDBMS
Azure Data Services fall into the PaaS category.
These services are a series of DBMSs managed by Microsoft in the cloud.

Azure SQL Azure Database


Database for PostgreSQL

Azure Database Azure Database


for MySQL for MariaDB

Microsoft takes care of all your administrative tasks including server patching, backups and updates.
You have no direct control over the platform on which the services run.
By default, your DB is protected by a server level firewall
Azure SQL Database ( PaaS)
Single Database Elastic Pool Managed Instance

This option enables you to quickly set up and run a single SQL Server database.(Cheapest)

By default, resources are pre-allocated, and you're charged per hour for the resources you’ve requested

You can also specify a serverless configuration. Your database automatically scales and resources
are allocated or deallocated as required.

This option is similar to Single Database, except that by default multiple databases can share the
same resources, such as memory, data storage space, and processing power.

The resources are referred to as a pool. You create the pool, and only your databases can use the
pool.
You are charged per Pool.
Azure SQL Database ( PaaS)
Managed Instance

Managed instance effectively runs a fully controllable instance of SQL Server in the cloud
You can install multiple databases on the same instance. You have complete control over this
instance, much as you would for an on-premises server
The Managed instance service automates backups, software patching, database monitoring, and other
general tasks, but you have full control over security and resource allocation for your databases

Managed instance has near 100% compatibility with SQL Server Enterprise Edition, running on-
premises.

Consider Azure SQL Database managed instance if you want to lift-and-shift an on-premises SQL
Server instance and all its databases to the cloud, without incurring the management overhead of
running SQL Server on a virtual machine. (BYOL)
SQL Server in a Virtual Machine ( IaaS)
 SQL Server on Virtual Machines enables you to use full versions of SQL Server in the Cloud
without having to manage any on-premises hardware

 You can easily move your on-premises SQL Database to Azure VM (Windows/Linux).

 This approach is suitable for migrations and applications requiring access to operating system
features that might be unsupported at the PaaS level.

 SQL virtual machines are lift-and-shift ready for existing applications that require fast migration
to the cloud with minimal changes.

 You get all the cloud benefits such as scalability, elasticity, high performance with no limitation of
DBMS.
 You remain responsible for maintaining the SQL Server software and performing the various
administrative tasks to keep the database running from day-to-day.
IaaS PaaS SaaS
SQL Server in Virtual
Machine

Azure SQL Database


Azure Database for MySQL
Azure Database for PostgreSQL
Azure Database for MariaDB

Single Database
Elastic Pool
Managed Instance
How to work with Non-Relational Data on Azure (25-30%)
Non-Relational DB (NOSQL)
NoSQL database stands for “Not Only SQL” or “Not SQL.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a NoSQL
database system encompasses a wide range of database technologies that can store structured, semi-
structured, unstructured data.
Doesn’t follow fixed schema structure
Doesn’t support features of a relational database

Types of NOSQL Data Stores


Documents High volume of JSON data

Graphs Relationship between nodes and edges with graph

Key-Value Multiple key-value pairs

Column based Columns are divides into column families which holds related data

Object based Unstructured/semi data storage for binary large object: images, videos, VM disk image
Azure CosmosDB
 Azure Cosmos DB is a multi-model NoSQL database management system.
 Cosmos DB manages data as a partitioned set of documents.
 A document is a collection of fields, identified by a key.
 The fields in each document can vary, and a field can contain child documents.
 Uses partition keys for high performance/query optimization

 Example
## Document 1 ## ## Document 2 ##
{ {
"customerID": "101", "customerID": "102",
"name": "name":
{ {
"first": "Piyush", "title" : "Mr"
"last": "Sachdeva" "firstname": "Piyush",
} "lastname": "Sachdeva"
} }
}
CosmosDB APIs
SQL API Enables you to run SQL queries over JSON data.

Table API This interface enables you to use the Azure Table Storage API to store and retrieve
documents.

MongoDB API Many organizations run MongoDB(document-based DB) on-premises. You can use the
MongoDB API for Cosmos DB to enable a MongoDB application to run unchanged against a Cosmos
DB database or you can migrate MongoDB to CosmosDB in the cloud.

Cassandra DB API is a column-based DBMS ,the primary purpose of the Cassandra API is to enable
you to quickly migrate Cassandra databases and applications to Cosmos DB.
Gremlin API. The Gremlin API implements a graph database interface to Cosmos DB. A graph is a
collection of data objects(Nodes) and directed relationships(Edges). Data is still held as a set of
documents in Cosmos DB, but the Gremlin API enables you to perform graph queries over data.
Azure Table Storage
Azure Table Storage implements the NoSQL key-value model
In this model, the data for an item is stored as a set of fields, and the item is identified by a unique key.
Items are referred to as rows, and fields are known as columns.
Unlike RDBMS, it allows you to store unstructured data
Simple to scale and allows upto 5PB of data
Fast read/write as comparable to a relational DB, use partition key to increase performance.
Row insertion and data retrieval is fast.
Azure Blob Storage
Azure Blob storage is a service that enables you to store massive amounts of unstructured data, or
blobs, in the cloud.
Many applications need to store large, binary data objects, such as images, video, virtual machine
Images and so on. These are called Blobs.
Inside an Azure storage account, you create blobs inside containers(folders). You can group similar blobs
together in a container.

Block Blobs Page Blobs Append Blobs

Set of blocks
Collection of fixed size pages Optimized to support append operations
Each block vary in size,
512-bytes each You can only add blocks to the end of an
up to 100MB
Supports random read/write append blob
Up to 100MB Update/deleting existing blocks is not
supported
Azure Blob Storage: Access Tiers
Hot Tier Cool Tier

The Hot tier is the default. Used for infrequent data access

Used for Frequently access data. Cheaper than hot tier

Provide highest performance Lower performance than hot tier

Costliest among three You can migrate your storage from


Hot to cool tier to save storage cost.
Archive Tier
Used for archival storage Cheapest among all Highest latency Take hours for data retrieval
To retrieve a blob from the Archive tier, you must change the access tier to Hot or Cool.
The blob will then be rehydrated.
You can read the blob only when the rehydration process is complete.
Azure File Storage
Azure File Storage enables you to create files shares in the cloud and access these file shares
from anywhere with an internet connection.

Azure File Storage exposes file shares using the Server Message Block 3.0 (SMB) protocol.

Once you've created a storage account, you can upload files to Azure File Storage using the
Azure portal, or tools such as the AzCopy utility.

Azure File Storage offers two performance tiers.


The Standard tier uses hard disk-based hardware in a datacenter
Premium tier uses solid-state disks. The Premium tier offers greater throughput, but is
charged at a higher rate.
NOSQL DB Suitable for?
Cosmos Cassandra API
Column based: When you need low latency, time-series, session details, telemetry data, analytics.

Object based: Store unstructured data or Blobs Azure Blob Storage

Graph based: When you need to define relationship in form of graphs. Cosmos Gremlin API

Key-Value: Data is accessed using a single key , used for caching, user profile mgt, session mgt.
Azure Table Storage Cosmos Table API

Document: JSON documents for content/inventory mgt, product catalog Cosmos SQL API

File share in the cloud , SMB 3.0 Protocol Azure File Share
Analytics workload on Azure (25-30%)
Data Analytics Core Concepts
Data analytics is concerned with examining, transforming, and arranging data so that you can study it
and extract useful information
Data Analytics stages :

Ingestion Processing Visualization

Ingestion: Taking the data from multiple sources into your processing system.
Processing: Transformation of data into more meaningful form
Visualization: Graphical representation of processed data in the form of graphs, diagrams, charts ,
Maps etc., for reporting and business intelligence purpose.
ETL vs ELT
ETL (Extract , Transform and Load) ELT (Extract , Load and Transform)

Extract Transform Extract Load Transform


Load

Data Ingestion Aggregating Validation Target data store is a data warehouse using either Hadoop
Filtering Joining Cluster or Azure Synapse Analytics.
Sorting Cleaning Target datastore should be powerful enough to transform the
data
De-duplication
Data Analytics Techniques

1. DESCRIPTIVE 2. DIAGNOSTICS 3. PRESCRIPTIVE 4. PREDICTIVE 5. COGNITIVE

What has why things What actions What will happen What might happen if
happened, based happened. should we take to in the future based circumstances
on historical data achieve a target on past trends changes: AI/ML

Sales reports, Comparison reports Recommendation, Forecasting reports, Self-driving cars,


profit and loss statements, Drill-down reports Suggestions, Video to audio conversion,
quarterly earnings reports Advise on best Audio transcribing,
approach
Azure Tools for Data Analytics
Arm Template: To Automate Azure resource provisioning ( IaaC)

Azure CLI: Command line tool to interact with Azure resources


Azure Data Studio: Execute queries on SQL sever/big data cluster, restore a Db, execute
admin tasks via sqlcmd/Powershell, Create and run SQL Notebooks

SSMS ( SQL Server Management Studio): complex admin task, platform configuration,
security mgt, user mgt, vulnerability assessment, performance tuning, query Synapse Analytics

Sqlcmd: Command line SQL utility


Data Warehousing
- Central Repository of data collected from one or more sources.
- Current and historical data used for reporting and analysis
- Can rename or reformat columns to make it easier for users to create reports
- Users can run reports without affecting the day-to-day business

When to use data warehousing

When queries are long running and affect day to day operations

When data needs further processing (ETL or ELT)

When you want to archive data (remove historical data from day-to-day system)

When you need to integrate data from multiple sources


Data Warehousing Flow
Ingestion
Storage and Pre-processing Analysis
CosmosDB Visualization

Azure Data Lake

Azure Analysis
Services PowerBI
Azure Synapse Analytics
Table Storage

On-prem DB

Azure Data Factory

Orchestration pipeline
Azure Data Services for Data Warehousing
Azure Data Factory

Azure Data Factory is described as a data integration service. Responsible for collection, transformation and
storage of data collected from multiple sources.

A logical grouping of activities to perform some task


A data factory can contain multiple pipelines
Sequential or parallel

Pipeline Triggers

Scheduled trigger

Tumbling windows ( run as scheduled with the historical data)

Event-Based
Manual
Azure Data Lake Storage
A data lake is a repository for large quantities of raw data

You can think of a data lake as a staging point for your ingested data, before it’s transported and
converted into a format suitable for performing analytics
Data Lake Storage organizes your files into directories and subdirectories for improved file organization.
(Hierarchical Namespace)
Compatible with HDFS(Hadoop Distributed File System) used to examine huge datasets.

Role-Based Access Control (RBAC) on your data at file and directory level( POSIX access control list)
Data Sources
CosmosDB
To implement azure Data Lake you
Azure Data Lake need to have a storage account

Table Storage It Stores data that is in parquet format


Azure Data Factory
On-prem DB
Storage

Data Ingestion
Azure Databricks
Azure Databricks is an Apache Spark environment running on Azure to provide big data
processing, streaming, and machine learning.

Can consume and process large amounts of data very quickly.


Azure Databricks also supports structured stream processing

In this model, Databricks performs your computations incrementally, and continuously updates
the result as streaming data arrives.

Azure Databricks provides a graphical user interface where you can define and test your
processing step by step, before submitting it as a set of batch tasks.
Azure Synapse Analytics
You can ingest data from external sources, such as flat files, Azure Data Lake, or another database
management systems, and then transform and aggregate this data into a format suitable for
analytics processing
You can perform complex queries over this data and generate reports, graphs, and charts.
It stores and process the data locally for faster processing
This approach enables you to repeatedly query the same data without the overhead of
fetching and converting it each time.
You can also use this data as input to further analytical processing, using Azure Analysis Services.
Azure Synapse Analytics leverages a massively parallel processing (MPP) architecture.
This architecture includes a control node and a pool of compute nodes.

You can pause Azure Synapse Analytics to reduce cost.


Azure Synapse Analytics flow
It includes a control node and a pool of compute nodes

Control node receive the processing request from applications and distribute
to compute nodes for parallel processing evenly.
Results from each node are then sent back to control node where it gets
combined into overall result

It supports two computational models: SQL pools and Spark Pools

In a SQL pool, each compute node uses an Azure SQL Database and Azure
Storage to handle a portion of the data.

To receive data from multiple sources it uses a technology called PolyBase

It uses storage as it is a disk based processing engine and supports manual


node scaling
Spark pools are optimized for in-memory processing and you can enable
autoscaling of nodes.
Azure Analysis Service
Azure Analysis Services enables you to build tabular models to support OLAP queries.

You can combine data from multiple sources, including Azure SQL Database, Azure Synapse Analytics, Azure
Data Lake store, Azure Cosmos DB, and many others.

You use these data sources to build models


A model is essentially a set of queries and expressions that retrieve data from the various data sources and
generate results.
Analysis Services includes a graphical designer to help you connect data sources together and define queries
that combine, filter, and aggregate data

Recommended Usage

If you have large amounts of ingested data that require preprocessing, you can use Synapse Analytics to
process the data and reduce into smaller datasets which can further be analyzed by Azure Analysis Service.
Azure HD Insight
Azure HDInsight is a big data processing service, that provides the platform for technologies such as
Spark in an Azure environment
HDInsight implements a clustered model that distributes processing across a set of computers
This model is similar to that used by Synapse Analytics, except that the nodes are running the Spark
processing engine rather than Azure SQL Database.

Break down of data and distribute for processing

Data Processing

Create, load and query the data similar to


PolyBase
Data Ingestion using Data factory
Azure Data Factory is a data ingestion and transformation service that allows you to load raw data from many
different sources, both on-premises and in the cloud.
Data Factory can clean, transform, and restructure the data, before loading it into a repository such as a data
warehouse.
Once the data is in the data warehouse, you can analyze it.
Azure Data Factory uses several different resources: linked services, datasets, and pipelines

Data Sources Analysis


CosmosDB

Azure Data Lake

Table Storage Azure Analysis


Services
Azure Data Factory
On-prem DB
Storage Azure Data Factory
Linked Services
Data Factory moves data from a data source to a destination.
A linked service provides the information needed for Data Factory to connect to a source or
destination

Data Sets
A dataset in Azure Data Factory represents the data that you want to ingest (input) or store.

If your data has a structure, a dataset specifies how the data is structured.

For example, if you are using blob storage as input The dataset would specify which blob to ingest,
and the format of the information in the blob (binary data, JSON, delimited text, and so on)
Control Flow
To orchestrate a pipeline

Integration Runtime
Compute environment for pipeline

Trigger
That initiates the pipeline

Mapping Data flow


Data flows allow data engineers to develop data transformation logic without writing code.
Power BI
- Data visualization service which lets you generate dashboards, graphs and reports.
- Can consume data from various data sources to create interactive visualizations

Create Share Consume

Parts of Power BI
Visualizations Datasets Reports
Building blocks
of Power BI

Dashboards Tiles
Reports in PowerBI

Static Report
Printed and shared
Formatted
Paginated Contains data on multiple pages
Use Power BI report builder to create the paginated report
Share the report by Power BI service

Viewed on screen
Customized as per your requirements
More visuals
Interactive Make use of 'hover’
User can change layout of design
Use PowerBI server to serve the interactive reports. (Premium)
Power BI content workflow

Connect Pull Edit


Connect to the data • Pull what you need • Edit, transform data
source that has data into the data model as you need

Share Build
• Share the report • Build reports using
power BI desktop
This Photo by Unknown Author is licensed under CC BY-SA-NC

This Photo by Unknown Author is licensed under CC BY-SA

You might also like