0% found this document useful (0 votes)
29 views

Module 4

Uploaded by

Neha Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Module 4

Uploaded by

Neha Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Module 4 :

Microsoft Azure Data Fundamentals: Explore


data analytics in Azure
Describe data warehousing architecture
Large-scale data analytics architecture can vary, as can the specific technologies used
to implement it;
Data ingestion and processing – data from one or more transactional data stores, files, real-time streams, or
other sources is loaded into a data lake or a relational data warehouse. The load operation usually involves
an extract, transform, and load (ETL) or extract, load, and transform (ELT) process in which the data is cleaned,
filtered, and restructured for analysis. In ETL processes, the data is transformed before being loaded into an
analytical store, while in an ELT process the data is copied to the store and then transformed. Either way, the
resulting data structure is optimized for analytical queries

Analytical data store – data stores for large scale analytics include relational data warehouses, file-
system based data lakes, and hybrid architectures that combine features of data warehouses and data
lakes (sometimes called data lakehouses or lake databases). We'll discuss these in more depth later.

Explore data ingestion pipelines


On Azure, large-scale data ingestion is best implemented by creating pipelines that orchestrate ETL processes.
You can create and run pipelines using Azure Data Factory, or you can use a similar pipeline engine in Azure
Synapse Analytics or Microsoft Fabric if you want to manage all of the components of your data analytics
solution in a unified workspace.

Explore analytical data stores


There are two common types of analytical data store.

Data warehouses
A data warehouse is a relational database in which the data is stored in a schema that is optimized for data
analytics rather than transactional workloads. Commonly, the data from a transactional store is transformed
into a schema in which numeric values are stored in central fact tables, which are related to one or
more dimension tables that represent entities by which the data can be aggregated
Data lakehouses
A data lake is a file store, usually on a distributed file system for high performance data access. Technologies
like Spark or Hadoop are often used to process queries on the stored files and return data for reporting and
analytics. These systems often apply a schema-on-read approach to define tabular schemas on semi-structured
data files at the point where the data is read for analysis, without applying constraints when it's stored.

Explore platform-as-a-service (PaaS) solutions


Azure Synapse Analytics is a unified, end-to-end solution for large scale data analytics. It brings together
multiple technologies and capabilities, enabling you to combine the data integrity and reliability of a scalable,
high-performance SQL Server based relational data warehouse with the flexibility of a data lake and open-
source Apache Spark. It also includes native support for log and telemetry analytics with Azure Synapse Data
Explorer pools, as well as built in data pipelines for data ingestion and transformation.

Azure Databricks is an Azure implementation of the popular Databricks platform. Databricks is a


comprehensive data analytics solution built on Apache Spark, and offers native SQL capabilities as well as
workload-optimized Spark clusters for data analytics and data science.

Azure HDInsight is an Azure service that supports multiple open-source data analytics cluster types. Although
not as user-friendly as Azure Synapse Analytics and Azure Databricks, it can be a suitable option if your
analytics solution relies on multiple open-source frameworks or if you need to migrate an existing on-premises
Hadoop-based solution to the cloud

Explore Microsoft Fabric


Scalable analytics with PaaS services can be complex, fragmented, and expensive. With Microsoft Fabric, you
don't have to spend all of your time combining various services and implementing interfaces through which
business users can access them. Instead, you can use a single product that is easy to understand, set up, create,
and manage. Fabric is a unified software-as-a-service (SaaS) offering, with all your data stored in a single open
format in OneLake.

Check your knowledge


1. Which Azure PaaS services can you use to create a pipeline for data ingestion and processing?

Azure SQL Database and Azure Cosmos DB

Azure Synapse Analytics and Azure Data Factory

That's correct. Both Azure Synapse Analytics and Azure Data Factory include the capability to create
pipelines.
Azure HDInsight and Azure Databricks
2. What must you define to implement a pipeline that reads data from Azure Blob Storage?

A linked service for your Azure Blob Storage account

That's correct. You need to create linked services for external services you want to use in the pipeline.
A dedicated SQL pool in your Azure Synapse Analytics workspace

An Azure HDInsight cluster in your subscription

3. Which open-source distributed processing engine does Azure Synapse Analytics include?

Apache Hadoop

Apache Spark

That's correct. Azure Synapse Analytics includes an Apache Spark runtime.


Apache Storm

Understand batch and stream processing

You might also like