09 - Azure Data Engineering Cheatsheet
09 - Azure Data Engineering Cheatsheet
Cheatsheet
Introduction to Azure
Data Engineering
Azure Data Engineering is a powerful tool for data professionals of all
levels. It provides a comprehensive suite of services to help you manage,
store, and analyze data in the cloud.
Azure Data Platform Overview
Azure Data Platform provides a comprehensive suite of tools and services for data storage,
processing, and analytics to help you get started with Azure Data Engineering
Azure for Beginners: Your First Steps
This is a fully managed relational database with This is an analytics service that brings together This service allows you to build, train, and deploy
auto-scale, integral intelligence, and robust enterprise data warehousing and Big Data machine learning models using Azure's scalable
security analytics infrastructure
This is a globally distributed, multi-model This is a real-time analytics service designed to This is a big data streaming platform and event
database service for managing data at large help you analyze streaming data from devices, ingestion service that can receive and process
scale, with built-in support for NoSQL sensors, infrastructure, and applications millions of events per second
Fully managed service Auto scaling capabilities Built-in intelligence Enterprise-grade security
Azure handles most of the Database can scale compute Provides performance Includes encryption, firewalls,
database management functions resources up or down monitoring, security alerts, and role-based access, and
such as upgrading, patching, automatically based on automated tuning using machine compliance with standards.
backups, monitoring. workload. learning capabilities.
Azure SQL Database is a fully managed cloud database service that provides a high performance
and secure relational database platform without the management overhead.
Azure Cosmos DB
Azure Cosmos DB is a fully managed database service that provides turnkey global
distribution, elastic scaling, single-digit millisecond latency, five well-defined
consistency models and guaranteed high availability.
Azure Data Lake Storage
Azure Databricks provides a fast, easy and collaborative Apache Spark analytics service
optimized for data science and data engineering workloads.
Azure Databricks
Azure Databricks provides a managed Apache Spark
environment on Azure that can handle large-scale data
processing. As an advanced data engineer, you should know
how to optimize Spark jobs and clusters on Databricks for
performance. This includes tuning Spark memory,
partitioning data effectively, and choosing optimal file
formats. You should also know how to integrate Databricks
with other Azure services like Azure Storage, Azure Synapse
Analytics, Azure Data Factory, and Azure Machine Learning
to build complete data pipelines and machine learning
workflows.
Azure Data Factory
Azure Data Factory is a cloud-based data integration service that can orchestrate and
automate the movement and transformation of data at scale.
Azure Data Factory
Use Azure Data Factory to Utilize data flow functionality in Use advanced scheduling Monitor pipeline runs, dataset
design and implement ADF to enable mapping data capabilities of ADF to define dependencies, failure points etc.
advanced ETL pipelines that sources to sinks, applying triggers, windows, dependencies in ADF. Set alerts, diagnose
integrate data from multiple transformations, and improving etc. to orchestrate execution of issues, and take corrective
sources like databases, file pipeline performance. pipelines. actions.
storage, SaaS applications etc.
Azure Synapse Analytics
Azure Stream Analytics is a real-time event - Process real-time data streams from - Get real-time insights and react quickly\n-
processing engine that can process millions multiple sources like IoT devices, websites, etc. Process large volumes of data with low
of events per second for real-time insights. \n- Write SQL-like queries to transform data latency\n- Save on infrastructure costs with
streams\n- Integrate with other Azure services serverless implementation
for visualization, storage, etc.
Azure HDInsight
Managed cluster service Open source frameworks Integrates with other Azure services
Provisions HDInsight clusters to process big Supports Hadoop, Spark, Hive, LLAP, Kafka Combines with Data Lake Store, Data Factory,
data and more
HDInsight architecture diagram HDInsight dashboard screenshot Spark processing on HDInsight diagram
Diagram showing the architecture and Screenshot of the HDInsight dashboard showing
components of Azure HDInsight. cluster monitoring and management. Diagram illustrating running Spark jobs on
HDInsight Spark clusters.
Serverless compute
Event-driven scale
Azure
Functions
Microservices architecture
Integrations and
workflows
Azure Logic Apps
Introduce Azure Logic Apps as a Highlight key capabilities like Provide examples of use cases Summarize benefits like
cloud service to automate workflow automation, like order processing, ETL increased productivity, lower
workflows and business integration, scheduling, pipelines, business process costs, improved agility, faster
processes by integrating enterprise scalability and automation across applications. time to market.
systems and services. security.
Azure Event Grid
Metadata and discovery service Register data sources Discover data sources
You can register data sources like SQL Server, Users can search and filter the catalog to
Azure Data Catalog is a fully managed Oracle, Blob Storage, etc. to be indexed by discover data sources relevant to their needs.
service that serves as a system of registration Data Catalog.
and system of discovery for enterprise data
sources.
Azure Monitor dashboard showing metrics Application map in Application Insights Analyzing telemetry in Application Insights
Azure Monitor provides a dashboard to visualize Application Insights generates a visual map of all Application Insights enables analysis of telemetry
metrics across cloud resources. components of your application. like requests, dependencies, logs etc.
Implementing real-time data processing pipelines with Stream
Analytics
Azure Stream Ingesting and managing high volume data streams with Event Hubs
Analytics and
Event Hubs Integrating Stream Analytics with other Azure services
Monitoring and
troubleshooting Stream
Analytics jobs
Azure Machine
Learning
Azure Machine Learning allows you to build, train, and
deploy machine learning models in Azure. As an
advanced data engineer, you should know how to
leverage Azure ML to implement predictive models and
integrate them with other Azure data services like
Azure Synapse Analytics and Azure Databricks for
advanced analytics pipelines.
Azure Monitor and Log Analytics
Set up metrics and alerts Use Log Analytics Analyze logs with queries
workspaces
Configure metrics and alerts in Collect logs and telemetry from Write Kusto queries to analyze
Azure Monitor to get notified about services and applications into Log collected log data and troubleshoot
critical events and anomalies. Analytics workspaces for analysis. issues.
2,000
1,000
500
2022
Select right Azure
services like Azure 2022 2022 2022
Synapse, Databricks etc Implement data tiers Design for scalability Optimize costs by right-
based on data and and caching for and high availability. sizing, serverless
workload requirements. optimized data access. computing, automation.