0% found this document useful (0 votes)
970 views

09 - Azure Data Engineering Cheatsheet

Uploaded by

ancgate
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
970 views

09 - Azure Data Engineering Cheatsheet

Uploaded by

ancgate
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Azure Data Engineering

Cheatsheet
Introduction to Azure
Data Engineering
Azure Data Engineering is a powerful tool for data professionals of all
levels. It provides a comprehensive suite of services to help you manage,
store, and analyze data in the cloud.
Azure Data Platform Overview

Azure Data Platform Data Storage Data Processing Data Analytics


Overview Azure Data Platform offers a Azure Data Platform provides Azure Data Platform offers a
Azure Data Platform is a cloud- variety of storage options, powerful data processing range of analytics services,
based data engineering platform including Azure Blob Storage, capabilities with Azure including Azure Machine
that provides a comprehensive Azure Data Lake Storage, and Databricks, Azure HDInsight, Learning, Azure Cognitive
suite of tools and services for data Azure SQL Database. and Azure Stream Analytics. Services, and Power BI.
storage, processing, and analytics.

Azure Data Platform provides a comprehensive suite of tools and services for data storage,
processing, and analytics to help you get started with Azure Data Engineering
Azure for Beginners: Your First Steps

Azure is easy to get started with for beginners.


*Microsoft Azure Documentation
Beginners Tools

Azure SQL Database Azure Synapse Analytics Azure Machine Learning

This is a fully managed relational database with This is an analytics service that brings together This service allows you to build, train, and deploy
auto-scale, integral intelligence, and robust enterprise data warehousing and Big Data machine learning models using Azure's scalable
security analytics infrastructure

Azure Cosmos DB Azure Stream Analytics Azure Event Hub

This is a globally distributed, multi-model This is a real-time analytics service designed to This is a big data streaming platform and event
database service for managing data at large help you analyze streaming data from devices, ingestion service that can receive and process
scale, with built-in support for NoSQL sensors, infrastructure, and applications millions of events per second

Azure Data Lake Storage Azure Power BI


Azure HDInsight
This is a secure, scalable, and cost-effective data This is a business analytics tool that delivers
lake that allows you to store and analyze large This is a fully-managed cloud service that makes insights throughout your organization. Connect
amounts of data it easy, fast, and cost-effective to process big to hundreds of data sources, simplify data prep,
data using popular open-source frameworks and drive ad hoc analysis
such as Hadoop, Spark, and Hive

Azure Data Factory

A fast, easy, and collaborative Apache Spark-


based big data analytics service designed for
data science and data engineering.
Intermediate

Azure Synapse Analytics Azure Functions: Azure Analysis Services:


Accelerates time to insight across Serverless offering for writing code Enterprise grade analytics as a
data warehouses and big data that responds to events. service.
systems.

Azure Logic Apps: Azure Data Catalog:


Azure Stream Analytics:
Cloud service for scheduling, Fully managed service for
Real-time event processing engine automating, and orchestrating registering and discovering
that can process millions of events tasks. enterprise data sources.
per second.

Azure Event Grid: Azure Monitor and Azure


Azure HDInsight: Application Insights:
Service for building applications
Fully-managed cloud service that with event-based architectures. Tools for collecting, analyzing, and
simplifies big data analytics. acting on telemetry from cloud and
on-premises environments.
Advanced Tools

Azure Databricks Azure Machine Learning


Optimize big data solutions on Azure handle complex transformations understanding the basics of machine learning and how to implement
and operations, and integrate with other Azure services models using Azure ML, or integrate Azure ML with other services

Azure Event Hub Azure Monitor and Azure Application Insights


This is a big data streaming platform and event ingestion service that can
Setting up monitoring and alerting for your data solutions, and
receive and process millions of events per second
comfortable analyzing logs to troubleshoot issues

Azure Synapse Analytics Azure IoT Hub


Designing and implementing advanced analytics solutions using Azure Understanding the principles of IoT data collection, analysis, and
Synapse Analytics, processing is useful.

Azure Data Factory


Azure Stream Analytics
Developing complex ETL pipelines that integrate multiple sources and
Dealing with real-time data, designing solutions using Stream Analytics,
services, using data flow transformations and advanced scheduling
and managing data streams using Event Hubs
capabilities
Azure SQL Database

Fully managed service Auto scaling capabilities Built-in intelligence Enterprise-grade security
Azure handles most of the Database can scale compute Provides performance Includes encryption, firewalls,
database management functions resources up or down monitoring, security alerts, and role-based access, and
such as upgrading, patching, automatically based on automated tuning using machine compliance with standards.
backups, monitoring. workload. learning capabilities.

Azure SQL Database is a fully managed cloud database service that provides a high performance
and secure relational database platform without the management overhead.
Azure Cosmos DB

Globally distributed Multi-model Elastically scalable


Data is replicated across any number of Supports document, key-value, graph, and Storage and throughput scale on demand
Azure regions for low latency and high column-family data models and independent of each other
availability

Azure Cosmos DB is a fully managed database service that provides turnkey global
distribution, elastic scaling, single-digit millisecond latency, five well-defined
consistency models and guaranteed high availability.
Azure Data Lake Storage

Secure Scalable Cost-effective


Built-in encryption and access controls Store petabytes of data Pay per use pricing model

Azure Data Lake Storage is an enterprise-wide hyperscale


repository for big data analytics workloads.
Azure Databricks

Fast analytics Easy to use Collaborative


Perform blazing fast analytics and machine Simplified workflow and notebooks make it Integrated tools like GitHub, Jira and MLflow
learning with Apache Spark easy to work with data enable collaboration

Azure Databricks provides a fast, easy and collaborative Apache Spark analytics service
optimized for data science and data engineering workloads.
Azure Databricks
Azure Databricks provides a managed Apache Spark
environment on Azure that can handle large-scale data
processing. As an advanced data engineer, you should know
how to optimize Spark jobs and clusters on Databricks for
performance. This includes tuning Spark memory,
partitioning data effectively, and choosing optimal file
formats. You should also know how to integrate Databricks
with other Azure services like Azure Storage, Azure Synapse
Analytics, Azure Data Factory, and Azure Machine Learning
to build complete data pipelines and machine learning
workflows.
Azure Data Factory

Data movement Data integration Monitoring


Copy and transform data from over 80 data Orchestrate data flows to join, aggregate Track data pipelines and get alerts for
sources. Ability to schedule and monitor data and enrich data from disparate sources. failures and bottlenecks.
transfers.

Azure Data Factory is a cloud-based data integration service that can orchestrate and
automate the movement and transformation of data at scale.
Azure Data Factory

Design complex ETL Schedule and Monitor and manage


Leverage data flows
workflows orchestrate pipelines pipelines

Use Azure Data Factory to Utilize data flow functionality in Use advanced scheduling Monitor pipeline runs, dataset
design and implement ADF to enable mapping data capabilities of ADF to define dependencies, failure points etc.
advanced ETL pipelines that sources to sinks, applying triggers, windows, dependencies in ADF. Set alerts, diagnose
integrate data from multiple transformations, and improving etc. to orchestrate execution of issues, and take corrective
sources like databases, file pipeline performance. pipelines. actions.
storage, SaaS applications etc.
Azure Synapse Analytics

Combines SQL analytics Scalable Enterprise-ready


Provides the ability to query data using both Allows you to independently scale compute and Built for enterprise workloads with mission-
SQL and Apache Spark storage as needed critical capabilities

Azure Synapse Analytics brings together the best of SQL and


Apache Spark analytics into an enterprise-ready service.
Azure Synapse Analytics

• Star Schema Design • Workload Isolation


Design star schema data models to optimize for fast queries Isolate workloads using dedicated SQL pools for predictable
and analytics performance

• Columnstore Indexes • Query Tuning


Leverage columnstore indexes for faster performance on Tune and optimize complex queries for best performance
large data volumes

• Result Set Caching


Implement result set caching to reduce query times for
repeated queries
Azure Synapse Analytics
Azure Synapse Analytics is an integrated analytics service by Microsoft
Azure that provides enterprise-grade data warehousing. It allows you to
analyze data across both big data and traditional data systems to gain
quick insights.
Azure Stream Analytics

Real-time analytics Multiple data sources Built-in machine learning Serverless


Processes streaming data to Ingests data from multiple Has built-in machine learning Fully managed, serverless
enable real-time analytics and sources like IoT devices, models for anomaly detection service with auto-scaling
insights websites, etc and pattern recognition

Azure Stream Analytics is a highly scalable real-time analytics


service to gain actionable insights from streaming data.
Azure Stream Analytics

Overview Key Capabilities Benefits

Azure Stream Analytics is a real-time event - Process real-time data streams from - Get real-time insights and react quickly\n-
processing engine that can process millions multiple sources like IoT devices, websites, etc. Process large volumes of data with low
of events per second for real-time insights. \n- Write SQL-like queries to transform data latency\n- Save on infrastructure costs with
streams\n- Integrate with other Azure services serverless implementation
for visualization, storage, etc.
Azure HDInsight

Managed cluster service Open source frameworks Integrates with other Azure services
Provisions HDInsight clusters to process big Supports Hadoop, Spark, Hive, LLAP, Kafka Combines with Data Lake Store, Data Factory,
data and more

Azure HDInsight provides a fully-managed Hadoop service to easily


process large datasets using popular big data frameworks.
Azure HDInsight

HDInsight architecture diagram HDInsight dashboard screenshot Spark processing on HDInsight diagram
Diagram showing the architecture and Screenshot of the HDInsight dashboard showing
components of Azure HDInsight. cluster monitoring and management. Diagram illustrating running Spark jobs on
HDInsight Spark clusters.
Serverless compute

Event-driven scale
Azure
Functions
Microservices architecture

Integrations and
workflows
Azure Logic Apps

Introduction Key Capabilities Use Cases Benefits

Introduce Azure Logic Apps as a Highlight key capabilities like Provide examples of use cases Summarize benefits like
cloud service to automate workflow automation, like order processing, ETL increased productivity, lower
workflows and business integration, scheduling, pipelines, business process costs, improved agility, faster
processes by integrating enterprise scalability and automation across applications. time to market.
systems and services. security.
Azure Event Grid

Event handling service

Handles events from any


source

Sends events to any destination

Used for event-based


architectures
Azure Analysis Services
Azure Analysis Services is an enterprise-grade analytics service that enables
you to govern, deploy, and deliver business intelligence solutions. It provides
enterprise-grade data modeling with the scalability and performance of
cloud.
Azure Data Catalog

Metadata and discovery service Register data sources Discover data sources
You can register data sources like SQL Server, Users can search and filter the catalog to
Azure Data Catalog is a fully managed Oracle, Blob Storage, etc. to be indexed by discover data sources relevant to their needs.
service that serves as a system of registration Data Catalog.
and system of discovery for enterprise data
sources.

Annotations and metadata Data lineage


Users can annotate data sources with descriptions, tags, documentation, Data Catalog provides visibility into data sources and their relationships
and other metadata. across the enterprise data estate.
Azure Monitor and Application Insights

Azure Monitor dashboard showing metrics Application map in Application Insights Analyzing telemetry in Application Insights

Azure Monitor provides a dashboard to visualize Application Insights generates a visual map of all Application Insights enables analysis of telemetry
metrics across cloud resources. components of your application. like requests, dependencies, logs etc.
Implementing real-time data processing pipelines with Stream
Analytics

Azure Stream Ingesting and managing high volume data streams with Event Hubs

Analytics and
Event Hubs Integrating Stream Analytics with other Azure services

Monitoring and
troubleshooting Stream
Analytics jobs
Azure Machine
Learning
Azure Machine Learning allows you to build, train, and
deploy machine learning models in Azure. As an
advanced data engineer, you should know how to
leverage Azure ML to implement predictive models and
integrate them with other Azure data services like
Azure Synapse Analytics and Azure Databricks for
advanced analytics pipelines.
Azure Monitor and Log Analytics

Set up metrics and alerts Use Log Analytics Analyze logs with queries
workspaces
Configure metrics and alerts in Collect logs and telemetry from Write Kusto queries to analyze
Azure Monitor to get notified about services and applications into Log collected log data and troubleshoot
critical events and anomalies. Analytics workspaces for analysis. issues.

Visualize insights with Integrate with other Manage solutions and


dashboards services workbooks
Create custom dashboards in Azure Use Azure Monitor data in other Install pre-built monitoring
Portal to visualize monitoring data. services like Azure Data Explorer for solutions and customize Azure
advanced analytics. Monitor workbooks.
Azure IoT Hub

Device data collection Device management Data analysis


Azure IoT Hub enables bi-directional Azure IoT Hub provides capabilities to Azure IoT Hub integrates seamlessly with
communication between IoT devices and manage IoT devices including device other Azure services like Stream
the cloud. It allows reliable and secure provisioning, configuration, monitoring Analytics, Time Series Insights, Azure
data ingestion from millions of devices. and software updates. Machine Learning to enable real-time
data analysis and insights.

Data processing Security Visualization


Azure IoT Hub routes device data to Azure IoT Hub implements security Power BI and Time Series Insights can be
services like Azure Functions, Event Grid, standards like TLS, SAS tokens, X.509 used to visualize real-time device data
Logic Apps, Storage for processing, certificates to secure device-to-cloud and and insights coming from Azure IoT Hub.
analysis and integration with business cloud-to-device communication.
systems.
Security and Compliance

• Implement Role-Based Access Control• Utilize Azure Key Vault


(RBAC) Securely store keys, passwords, certificates for access
control.
Use RBAC to grant users and applications least privilege
access to resources.
• Enable Azure Security Center
• Enable Azure Policy Get security recommendations and threat protection.

Define and enforce organization-wide policies and


standards. • Comply with regulations
Leverage Azure compliance offerings like HIPAA, PCI DSS,
• Encrypt data at rest and in transit FedRAMP.

Use encryption, certificates, and secure protocols like HTTPS


to protect data.
Optimization
Monthly cost estimates for different Azure service tiers

2,000
1,000
500

Basic Tier Standard Tier Premium Tier


Implement CI/CD pipelines

Integrate pipeline with data engineering code


Integration
with DevOps
Automate deployment of data solutions

Monitor and log


pipeline executions
Azure Architecture Design

2022
Select right Azure
services like Azure 2022 2022 2022
Synapse, Databricks etc Implement data tiers Design for scalability Optimize costs by right-
based on data and and caching for and high availability. sizing, serverless
workload requirements. optimized data access. computing, automation.

2022 2022 2022


Design interactions Leverage tools like Enable security
between services for Azure Resource through access
data movement, Manager for controls, encryption,
transformation, infrastructure as code. VPNs.
analytics.
API Management
Azure API Management helps organizations publish APIs to external,
partner, and internal developers to unlock the potential of their data and
services. It provides the core API management capabilities to ensure a
successful API program through developer engagement, analytics, security,
and protection.

You might also like