Module 4
Module 4
Analytical data store – data stores for large scale analytics include relational data warehouses, file-
system based data lakes, and hybrid architectures that combine features of data warehouses and data
lakes (sometimes called data lakehouses or lake databases). We'll discuss these in more depth later.
Data warehouses
A data warehouse is a relational database in which the data is stored in a schema that is optimized for data
analytics rather than transactional workloads. Commonly, the data from a transactional store is transformed
into a schema in which numeric values are stored in central fact tables, which are related to one or
more dimension tables that represent entities by which the data can be aggregated
Data lakehouses
A data lake is a file store, usually on a distributed file system for high performance data access. Technologies
like Spark or Hadoop are often used to process queries on the stored files and return data for reporting and
analytics. These systems often apply a schema-on-read approach to define tabular schemas on semi-structured
data files at the point where the data is read for analysis, without applying constraints when it's stored.
Azure HDInsight is an Azure service that supports multiple open-source data analytics cluster types. Although
not as user-friendly as Azure Synapse Analytics and Azure Databricks, it can be a suitable option if your
analytics solution relies on multiple open-source frameworks or if you need to migrate an existing on-premises
Hadoop-based solution to the cloud
That's correct. Both Azure Synapse Analytics and Azure Data Factory include the capability to create
pipelines.
Azure HDInsight and Azure Databricks
2. What must you define to implement a pipeline that reads data from Azure Blob Storage?
That's correct. You need to create linked services for external services you want to use in the pipeline.
A dedicated SQL pool in your Azure Synapse Analytics workspace
3. Which open-source distributed processing engine does Azure Synapse Analytics include?
Apache Hadoop
Apache Spark