0% found this document useful (0 votes)
10 views49 pages

Fabric and Data Bricks

The document discusses the integration of Azure Databricks and Microsoft Fabric to enhance data analytics through a unified lakehouse architecture. It highlights the benefits of this synergy, including improved data management, scalability, and advanced analytics capabilities, while detailing the medallion architecture's role in streamlining data workflows. Overall, the combination of these technologies aims to empower organizations to efficiently manage and derive actionable insights from their data assets.

Uploaded by

agrawalsnake
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views49 pages

Fabric and Data Bricks

The document discusses the integration of Azure Databricks and Microsoft Fabric to enhance data analytics through a unified lakehouse architecture. It highlights the benefits of this synergy, including improved data management, scalability, and advanced analytics capabilities, while detailing the medallion architecture's role in streamlining data workflows. Overall, the combination of these technologies aims to empower organizations to efficiently manage and derive actionable insights from their data assets.

Uploaded by

agrawalsnake
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Synergy in Analytics:

Unifying Azure Databricks


and Microsoft Fabric
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 2

Synergy in Analytics:
Unifying Azure Databricks
and Microsoft Fabric

3/ 26 /
Empower modern data analytics with Data Factory and Azure Databricks activity
Azure Databricks and Microsoft Fabric in Microsoft Fabric

4/ 28 /
Simplify analytics workloads with Azure Enhance organisational capabilities with
Databricks and Microsoft Fabric generative AI

10 / 33 /
Medallion architecture in Azure Databricks Explore real-world use cases with
and Microsoft Fabric hands‑on examples

15 / 42 /
Use lakehouse data with Azure Databricks Achieve excellence with Azure Databricks
and Microsoft Fabric and Microsoft Fabric

22 / 43 /
Better together: Azure Databricks, Unity Next steps
Catalogue and Microsoft Fabric Purview
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 3

Empower modern data analytics


with Azure Databricks and
Microsoft Fabric
Today, organisations need to understand Both Azure Databricks and Microsoft
how to manage the ever-increasing flood Fabric are comprehensive analytics
of data in an efficient and insightful solutions. Fabric has more
manner. A data lakehouse combines the business‑user‑friendly tools, and Azure
vast storage of a data lake with the Databricks has an integrated AI platform,
structured processing of various data but since they both rely on the same data
services. It supports extensive data storage layer, they can be used together as a more
and complex analytics without powerful whole. Azure Databricks, Fabric
compromise. It is more than a storage and OneLake allow organisations to
solution; it enhances data intelligence and streamline their data architecture,
supports advanced analytics, addressing simplifying analytics workloads and
the challenge of converting abundant data enabling efficient data management
into actionable insights. and analysis across a unified platform.

Cloud environments offer vast


computational resources and scalability on
demand; as an organisation’s data grows,
their infrastructure can grow alongside it
seamlessly and cost‑effectively. This
synergy between cloud platforms and the
data lakehouse architecture is pivotal,
providing a resilient and adaptable
foundation for any enterprise looking to
thrive in the data‑driven economy.
Effective management and robust security
measures are essential in the cloud-based
data landscape to protect this
invaluable asset.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 4

Simplify analytics workloads


with Azure Databricks and
Microsoft Fabric
The modern data lakehouse architecture enables enterprises to utilise the synergies between
Azure Databricks and Microsoft Fabric. Both Azure Databricks and Fabric offer a unified,
comprehensive set of tools for a broad spectrum of advanced analytics scenarios and work
together to provide a complete range of solutions for working with a data lakehouse. With
elements from data engineering, data science, data warehousing and Power BI, they deliver
wide-ranging analytics features, a cohesive experience for users and a single data repository
accessible to various analytics tools. Azure Databricks also provides comprehensive governance
and lineage tracking of both data and AI assets in a single unified experience.

Figure 1: Azure Databricks and Microsoft Fabric integration in a lakehouse architecture


Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 5

Maximise data potential with • Centralised storage: Using OneLake


Azure Databricks and Microsoft within Fabric allows centralised data
management, which simplifies data
Fabric
access and governance, ensuring Azure
Azure Databricks and Fabric integration Databricks can directly utilise the data
allows users to seamlessly switch between for analytical processes.
platforms, offering customers a cohesive
and powerful solution for data • Seamless integration: The seamless
management and analytics and facilitating integration between Azure Databricks
AI and machine learning projects with ease and Data Factory in Fabric facilitates
and efficiency. streamlined workflows from data
ingestion and validation to
transformation. This integration enables
Data management a cohesive data management strategy
OneLake centralises data from diverse that supports data analytics, data
sources. The integration of Azure Databricks science, as well as AI projects.
with Fabric not only revolutionises data
management, scalability and data • Enhanced security and accessibility:
processing but also centralises data from a Premium Azure Databricks workspaces
wide range of sources through OneLake. support credential passthrough,
This comprehensive approach ensures that strengthening the security and ease of
Azure Databricks can seamlessly connect access to OneLake resources. This
with data stored in Azure Data Lake Storage feature ensures secure and
(ADLS), various databases and OneLake straightforward access to the centralised
itself. This simplifies the management of data for further processing and analysis.
vast data volumes, enhances the ability to
scale data projects and streamlines the data
processing pipeline.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 6

Scalability and reproducibility • Dynamic data processing


Azure Databricks and Fabric support capabilities: Azure Databricks offers
scalable data workflows, reproducible AI dynamic data processing capabilities
and analytics projects and dynamic data that adapt to varying data volumes and
processing capabilities. This integration processing requirements. This flexibility
allows organisations to manage large is crucial for organisations to scale their
volumes of data efficiently, ensure data analytics operations efficiently.
consistent results across their data
environments and adapt processing power
to meet diverse requirements, driving Data processing
reliable and scalable data operations. Combining the robust data transformation
capabilities of Azure Databricks with the
• Scalable data workflows: Azure sophisticated orchestration of Data Factory
Databricks activities within Data Factory pipelines, these technologies synergise to
are designed to support scalable data support a comprehensive range of data
workflows. Organisations can efficiently processing tasks.
handle vast amounts of data, scaling
their data processing and analytics • Efficient data transformation: Azure
operations as needed without Databricks excels at transforming data
compromising performance or reliability. stored in OneLake and other sources.
Together, Azure Databricks and Fabric
• Reproducible AI projects: The support an extensive range of data
integration ensures that AI and analytics processing tasks, including data
projects are reproducible, benefiting exploration, cleaning and preparation,
from features such as data versioning crucial for preparing datasets for AI and
and lineage tracking. These features, machine learning.
available natively on both platforms,
enhance the reliability of AI projects
and ensure consistency across
data environments.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 7

• Orchestration of complex workflows: Enterprises don’t have to rely on just one


Data Factory pipelines that include tool to process their data; instead, they can
Azure Databricks activities allow for the select the best tool for each project.
orchestration of complex data Lakehouse architecture revolutionises how
transformation workflows. These enterprises manage, scale and process
pipelines can validate data sources, copy their data. This innovative approach
data to designated storage and execute ensures data integrity and consistency
notebooks for data transformation, through transactional support, fostering a
providing a comprehensive solution for more effective data management
data processing. platform. Furthermore, the architecture’s
ability to amalgamate different storage
The synergy between Azure Databricks formats under one roof simplifies the
and Fabric, especially through Data Factory, complex landscape of data estates. It
enhances data management, ensures facilitates the dynamic adjustment of
scalability and reproducibility and computing resources in line with real-time
facilitates efficient data processing. This demand, eliminating wasteful
integration is vital for organisations using over‑provisioning and enhancing cost
their data for insightful analytics and efficiency and resource utilisation.
AI‑driven decision making.
At the heart of this architecture is the
integration of extensive data lakes and
Advantages of lakehouse structured data warehouses, creating an
architecture optimal environment for fostering AI and
machine learning innovation. This ensures
The lakehouse architecture is built on the access to both computational power and
open-source Delta Lake storage format. In data, accelerating innovation and
addition to its technical capabilities, such as streamlining the management of diverse
ACID transaction consistency, it enhances data systems.
the overall effectiveness of the entire
platform. Further, it enables use by multiple
processing engines at the same time as it
uses open formats and allows tools such as
Azure Databricks and Fabric to work with
the same copy of the data at the same time.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 8

Security | Governance | Compliance

Data Data Data Data Real-time and Business


Integration Engineering Science Warehouse streaming Intelligence
analytics

Open and governed data lakehouse

Data lake

Appliances Cloud DW Databases Hadoop …

Hybrid and multicloud data sources

Figure 2: Analytics in the lakehouse

With the unique combination of the vast storage capacity of a data lake and the structured,
query-optimised environment of a data warehouse, the modern lakehouse emerges as the ideal
platform for developing and deploying AI algorithms. This dual capability ensures that AI projects
can utilise the necessary computational power and data accessibility, speeding up innovation and
reducing overhead costs associated with managing separate data systems.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 9

By simplifying data architecture and Enterprises can use the advanced machine
reducing infrastructure complexity, learning and AI capabilities of Azure
businesses can focus on creating value Databricks and Fabric on their full data
through AI rather than grappling with estate stored in a data lakehouse. These
data management challenges in the tools include end-to-end experiment
following ways: management and automated machine
learning toolkits that can super-charge
• The lakehouse architecture stores Delta AI projects.
Lake files in an ADLS account. This cloud
storage service is extremely cost
effective, and the Delta Lake format
allows the storage of both structured
and unstructured data.

• Building AI models requires massive


amounts of compute power from both
traditional CPUs and advanced GPUs.
Since the data lakehouse architecture
empowers the use of multiple compute
engines, including both Azure
Databricks and Fabric, enterprises can
bring the right type of processing
power to their data exploration and
data modelling tasks.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 10

Medallion architecture in Azure


Databricks and Microsoft Fabric
The medallion architecture is a sophisticated approach within the broader concept of the
lakehouse architecture, designed to streamline data workflows from ingestion to insights.
At its core, it consists of three layers: bronze, silver and gold, each serving a distinct purpose
in the data lifecycle.

Figure 3: Medallion architecture


Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 11

The three layers of the medallion Azure Databricks and Fabric utilise this
architecture are: architecture to enhance their data
management and analytical offerings.
1. Bronze (raw): In this layer, raw data is Together, they create a robust
initially ingested, retaining its original environment in which data not only flows
form. It acts as a staging area and is seamlessly through each stage of the
crucial for capturing the full granularity medallion architecture, but is also enriched
of data without any loss of fidelity. and made more accessible.

2. Silver (validated): In this layer, data


from disparate sources is matched, Lakehouse integration
merged and conformed, making it with Azure Databricks and
ready for more complex analytical tasks.
Microsoft Fabric
The silver layer is designed to provide
an enterprise view of key business The lakehouse-first pattern represents a
entities and is critical for supporting transformative approach to data
self-service analytics and intermediate management and analytics. This approach
data storage needs. is built on a tiered data storage system,
organising data into bronze, silver and
3. Gold (enriched): In this layer, data is gold layers. This structured flow of data
further optimised for specific business facilitates more efficient data processing,
needs and is often structured into analytics and machine learning
de-normalised, read-optimised formats applications to transform raw data into
that are suitable for high-performance data that’s optimised for businesses.
querying and reporting. The gold layer
typically hosts data models that are Azure Databricks excels in processing
directly used in business intelligence large volumes of data with its Spark-based
applications and decision support analytics engine, effectively handling
systems. Data becomes a true business data transformations necessary for
asset in the gold layer, offering valuable transitioning data from the bronze layer
and actionable insights. to the silver layer.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 12

Fabric provides a cohesive analytics • Optimisation of Fabric engines for


platform that integrates deeply with Azure Apache Parquet and Delta Lake: By
Databricks. It offers sophisticated data redesigning data processing engines to
management tools and helps connect and be optimised for these formats, the
ingest data seamlessly from various sources system ensures high-performance data
through its extensive connector ecosystem. operations, which are crucial for
processing large volumes of data
This integration ensures that data moves efficiently.
freely through each layer of the medallion
architecture, maintaining integrity and • Interoperability across the system:
consistency while minimising complexity The ability of Azure Databricks to read
and overhead. any Fabric artifact in OneLake highlights
the interoperable nature of these
technologies, ensuring that data can be
Foundation of open-source seamlessly accessed and utilised across
storage formats different parts of the lakehouse
architecture.
Adopting Apache Parquet and Delta Lake
enables OneLake and Azure Databricks to The use of open-source storage formats
optimise Fabric engines and enhance within a lakehouse architecture maximises
interoperability across their platforms. This data utility, increases operational efficiency
strategy ensures the robust handling of and reduces the complexity that’s
large datasets, facilitates seamless data traditionally associated with managing
access across the lakehouse architecture large-scale data architectures. Apache
and reduces the complexities typically Parquet and Delta Lake facilitate this by
associated with managing large-scale data ensuring that data is stored in a robust and
architectures: widely compatible format, making it easier
for organisations to integrate and analyse
• Standardisation on Apache Parquet data across diverse systems and platforms.
and Delta Lake: OneLake adopts these
formats for handling large datasets and
support for transactional capabilities
(ACID properties). This standardisation
ensures that all data across the Fabric
engines is optimised for both
performance and compatibility,
leading to more efficient data
processing workflows.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 13

Integration of open-source Within this architecture, OneLake helps


formats with medallion centralise data management without
necessitating physical data movement.
architecture
Data stored in various locations can be
The use of open-source formats such as accessed and analysed as if it were within
Apache Parquet and Delta Lake a single repository. The federation
standardises data storage and access capabilities of Azure Databricks further
within this integrated system, supporting complement this by allowing queries
advanced data management features such across different data stores, thereby
as ACID transactions and schema enhancing the flexibility and scope of
evolution. The powerful combination of data analytics.
processing capabilities of Azure Databricks
and the management tools of Fabric within The synergy between Azure Databricks
the medallion architecture enables and Fabric provides a robust foundation
businesses to approach data architecture for building advanced lakehouse
in a scalable, efficient and highly conducive architectures. This combination simplifies
manner, facilitating the generation of data management across disparate data
transformative insights. sources and enhances the analytical
capabilities of organisations, enabling
Azure Databricks excels in data processing them to derive actionable insights more
and analytics, using Apache Spark to efficiently and with greater accuracy. This
perform robust data transformations and empowers enterprises with the ability to
analyses at scale. Its integration with ADLS maximise the value of their data assets in
Gen2 allows Azure Databricks to handle a secure and scalable manner.
massive datasets efficiently, preparing data
for further analytical processing. Fabric
extends the capabilities of Azure
Databricks by offering additional tools for
data management, such as easy access to
data sources through more than 200
native connectors and streamlined data
ingestion mechanisms. This allows
enterprises to implement a comprehensive
data strategy that covers everything from
ingestion to insightful analytics.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 14

Lakehouse-first pattern • Integrating Azure Databricks


scenarios lakehouse data with Fabric
workloads: In this scenario, data stored
By adopting the medallion architecture and managed in an Azure Databricks
with Azure Databricks and Fabric, lakehouse can be directly used with
integrating lakehouse data with Fabric the analytics tools in Fabric. This
workloads and utilising OneLake data with integration supports a more agile
Lakehouse Federation, organisations can response to data-driven insights and
enable a sophisticated, tiered storage operational needs.
model within a lakehouse environment.
This unified approach facilitates efficient • Utilising OneLake data in Azure
data processing and management. Databricks with Lakehouse
It supports agile responses to data-driven Federation: This set-up allows the
insights and operational needs while utilisation of data across multiple
enhancing the scalability and flexibility storage systems within Azure
of data analytics across multiple Databricks. By federating data sources,
storage systems: users can query data across these
sources as if they were a single entity,
• Medallion architecture with Azure enhancing the flexibility and scalability
Databricks and Fabric: This approach of data analytics operations.
utilises a tiered storage model within a
data lakehouse environment, facilitating
efficient data processing and
management. By using Azure
Databricks in conjunction with Fabric,
organisations can manage large-scale
data analytics pipelines more effectively.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 15

Use lakehouse data with Azure


Databricks and Microsoft Fabric
The synergy between Azure Databricks Interacting with
and Microsoft Fabric offers organisations a lakehouse data
powerful and efficient way to handle their
data workloads. From ingestion and Lakehouse data within the Azure
storage to analysis and reporting, ecosystem is typically stored in cloud
enterprises benefit from a secure and locations, which can be categorised into
governed framework. This flexibility two primary types:
empowers teams to select the platform
that best fits their project needs, ensuring 1. ADLS accounts: ADLS is a cloud
seamless integration within the broader storage system optimised for analytics
enterprise ecosystem. workloads. Enterprises can create and
manage ADLS accounts to suit their
For example, a data science team that data administration needs.
works primarily in notebook coding
environments will appreciate the rich 2. OneLake: OneLake is also an ADLS
features of the Azure Databricks UI and the account, but unlike other accounts,
flexibility for managing advanced Spark Azure customers do not directly
libraries on the clusters, and AI engineers manage it. Instead, it is created as part
will appreciate the native ability to of and administered by Fabric. It does
fine‑tune models on their data. Business not appear in the Azure portal, and
analysts may prefer the ease of use of the although customers can interact with
low-code dataflows in Fabric for quickly the data it contains, they do not have
building pipelines that transform data and much control over the account itself.
create new datasets in the gold layer of the
lakehouse. Both teams can use their
preferred tools to work on the same
datasets without the need for either
team to make copies of the data in their
own environment.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 16

Warehousing AI

Ingest, ETL and


real-time processing

UC: Governance,
sharing, integration

Figure 4: Linking ADLS accounts to OneLake with shortcuts

OneLake introduces shortcuts to help data Integrate Azure Databricks


professionals access data across OneLake with Power BI for enhanced
and multiple ADLS accounts. Shortcuts
data visualisation
allow data professionals to link data in
their Unity Catalogue (with Azure For advanced visualisation and
Databricks shortcuts) or external ADLS dashboarding scenarios with an Azure
accounts to OneLake, making them appear data lakehouse, most enterprises choose
unified. Users can access data from these Microsoft Power BI as their tool of choice.
accounts seamlessly, without realising This powerful visualisation and analytics
they’re from different sources. Shortcuts tool is now offered as part of Fabric,
help efficiently manage data by virtualising allowing enterprises to fully integrate the
access to external data sources without administration and billing of Power BI with
unnecessary duplication, supporting other Fabric resources.
scalable and efficient AI model training
and deployment processes.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 17

Azure Databricks integrates seamlessly The previous section detailed how to


with Power BI. Databricks SQL warehouses use Azure Databricks to process raw data,
and Unity Catalogue offer a flexible and prepare it for reporting and then write
scalable solution for Power BI in the it to the lakehouse.
lakehouse. Data that has been processed
by Azure Databricks can be used in
three ways: Direct Lake mode in Power
BI to read and visualise
1. Azure Databricks Direct Publish for
Azure Databricks data
Power BI: Databricks can now
automatically sync tables, including With OneLake storage, the files are stored
relationships, to Power BI semantic in the efficient Delta Lake format. These
models in a single click. This helps Delta Lake tables have been optimised by
analysts build reports and dashboards the VertiPaq engine, making them highly
faster than ever before. efficient for consumption by Power BI. This
enables Power BI to directly interact with
2. Azure Databricks connector in Power the Delta Lake tables stored in OneLake
BI Desktop: Azure Databricks allows without the need for an intermediary
the Power BI client to connect to an caching layer, such as Azure Analysis
Azure Databricks cluster, which can Services or Power BI datasets. This new
query and process the lakehouse data mode of access, called Direct Lake mode,
for you and send the results to Power BI provides real‑time data access without the
for visualisation. need for refreshing models in Power BI.

3. Power BI Direct Lake mode: Power BI


can use its new Direct Lake mode to
directly read Delta Lake data that has
been written to an Azure storage
location. This can be data that was
written by Azure Databricks or by
Fabric, and the storage location can be
the OneLake account or any other
ADLS account.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 18

Directly publish datasets


to Power BI workspaces:
● Publish from Azure UI, without
Power BI Desktop
● Publish entire schemas with table
relationships (PK/FK)

Figure 5: Deep Power BI integration

The default dataset includes all tables from the lakehouse, allowing users to establish
relationships and apply various modelling changes. These datasets from Unity Catalogue can
be directly published to Power BI. Users can access and edit a published semantic model with
the web modelling editor that’s accessible through Power BI.

In the model view within the web modelling editor, you can see whether there’s a Direct Lake
connection by hovering the cursor over the table headers. Direct Lake also allows for the
creation of new Power BI datasets directly through the web. This process ensures the use of
Direct Lake for the connection. To learn more about using the web editor for semantic models,
the following document will help you get started: Edit data models in the Power BI service
(preview) – Power BI | Microsoft Learn.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 19

Integrate Azure Databricks This set-up not only simplifies the


with Power BI to enhance analytics stack, but also enhances data
quality and accessibility for BI tools,
data workflows
enabling more sophisticated data
Integrating Azure Databricks with Power BI modelling and analytics directly on
provides significant advantages for data large datasets.2
management and visualisation, enhancing
both security and performance in data 3. Last, the integration supports advanced
analytics workflows: analytics scenarios, simplifying the
management and analysis of large data
1. First, the integration allows more secure volumes. The DirectQuery option in
and interactive data visualisation Power BI plays a crucial role here,
experiences directly from the data lake, allowing users to perform real-time
avoiding the latency and costs analysis without moving data out of the
associated with traditional data lakehouse. This capability is critical for
processing workflows. It uses Microsoft maintaining up-to-the-minute accuracy
Entra ID for authentication, simplifying in reports and dashboards, providing
the user experience and increasing businesses with insights that are both
security, eliminating the need for deep and immediately actionable.3
personal access tokens. This integration
ensures that security controls at the These features collectively make Azure
data lake level are enforced within Databricks and Power BI a robust
Power BI, maintaining consistent combination, offering businesses
security policies across platforms.1 advanced tools to harness their data
effectively and securely. To learn more
2. Second, the semantic lakehouse about Power BI and Azure lakehouse
architecture streamlines data ingestion integration, consult the following
and storage. It provides a unified document: https://learn.microsoft.com/
storage layer that supports an extensive fabric/get-started/directlake-overview
range of data formats and structures,
significantly boosting the efficiency of
data processing and transformation.
2 Power Up with Power BI and Lakehouse in Azure
Databricks: Part 3 – Tuning Azure Databricks SQL –
Microsoft Community Hub
1 Power Up your BI with Microsoft Power BI and
3 https://docs.databricks.com/partners/bi/power-bi.html
Lakehouse in Azure Databricks: Part 1 – Essentials –
Microsoft Community Hub
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 20

Use Data Activator to alert on 1. Confirm that your Power BI report,


changes in Azure Databricks which includes freezer temperature data,
is published online to a Fabric workspace
data via Power BI
equipped with Premium capacity.
A common scenario for enterprises is that
they want to receive alerts if particular 2. Choose the temperature visual:
metrics exceed certain thresholds. For
example, they may want to know if there’s a. Access the report: Open the specific
a sudden, unexpected spike in sales for a Power BI report that tracks freezer
particular item, or they may want to know temperatures.
if the volume of transactions has
plummeted below the normal range, b. Select the relevant visual: Find
indicating a possible problem in the the visual that displays the freezer
transaction pipeline. temperatures.

These scenarios can be handled by a new 3. Click the ellipsis (…) in the top-right
feature in Fabric called Data Activator. This corner of the temperature visual and
no-code tool monitors data in a Power BI select Set Alert or use the Set Alert
report and automatically takes action if the button found in the Power BI toolbar.
data matches certain patterns or hits
specified thresholds. When these events 4. In the Set Alert pane, specify how you
occur, Data Activator can take an action wish to receive alerts (email or Teams).
such as alerting a user or launching a If your visual includes multiple freezers
Power Automate workflow. (dimensions), use the For each
dropdown to select the specific
In order to enable Data Activator, please dimension (freezer) to monitor.
follow the official documentation here:
https://learn.microsoft.com/fabric/ 5. Define the alert condition, such as when
data-activator/ the temperature drops below 30° F. Data
Activator will monitor the temperature
To create an alert with Data Activator and notify you when this condition
when a freezer’s temperature falls below is met.
30° F in a Power BI report, follow these
steps for monitoring freezer temperatures 6. Decide where to save your Data
within a Fabric workspace: Activator trigger in Power BI. You can
add it to an existing reflex item or create
a new one.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 21

7. Click Create alert to finalise your Data and automatically takes action if the data
Activator trigger. You can optionally matches certain conditions or exceeds
deselect Start my alert if you prefer to thresholds. When these events occur,
edit the trigger in Data Activator before Alerts will take a specified action, such as
activating it. sending a notification via email, Slack or
Teams. The alert can also call a webhook
By following these steps, you’ve action, allowing users to build extensible,
successfully set up an alert in Data custom workflows based on changes in
Activator to notify you when a monitored the data.
freezer’s temperature falls below 30° F,
allowing you to take immediate action if A monitor is a process that runs on a
necessary. Once these data updates are specified schedule to check the data
complete, you should receive the alert quality of a particular table. When a user
from Data Activator that was configured. creates a monitor, it computes the data
quality metrics for the table and stores the
current values in a separate system table.
Use Lakehouse Monitoring Each time the monitor runs, it recomputes
with Alerts to alert on changes the quality metrics and compares them to
the original values. If the quality has
in Azure Databricks
deteriorated, then an alert will be raised.
Enterprises often require alerts when data For details on how a monitor can be
quality metrics exceed certain thresholds. created, consult the following document:
For example, they may want to know if https://docs.databricks.com/lakehouse-
there’s a sudden, unexpected spike in the monitoring/create-monitor-ui.html
number of missing values within a
particular field, indicating a possible If a monitor detects that the quality of the
problem in the transaction pipeline, or if data in the table has declined, it will raise
the quality of predictions from a machine the specified alert. This can be used to
learning model has declined, indicating a send a notification to the data engineering
need to retrain the model on newer data. teams so they can investigate further. For
details on how these alerts can be
These scenarios can be handled with an configured, check the following document:
Azure Databricks feature called Lakehouse https://docs.databricks.com/lakehouse-
Monitoring with Alerts. This no-code tool monitoring/monitor-alerts.html
monitors data quality in Unity Catalogue
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 22

Better together: Azure


Databricks, Unity Catalogue and
Microsoft Fabric Purview
As the demand for analytics grows and Unity Catalogue in Azure
data platforms evolve into more intricate Databricks
systems, governing the platform –
management of data availability, usability, Azure Databricks includes Unity Catalogue,
integrity and security – becomes which provides centralised fine-grained
paramount. In a data lakehouse access control for an organisation’s data
architecture, data governance helps ensure storage locations, auditing of data access
that data is properly catalogued, classified and lineage tracking from ingestion to all
and managed. By implementing effective data workloads and Azure Databricks
data governance, organisations can provides column-level and row-level
manage their data properly and use it to access controls and data discovery tools.
drive business value. It now also includes system tables, which
provide a straightforward way to query
Effective data governance in a data audit data, billing data and lineage.
lakehouse architecture requires the Additionally, Unity Catalogue is supported
implementation of policies, procedures by AI capabilities to automatically
and standards for managing data. This document tables and columns, facilitate
includes defining data ownership and semantic search and help surface related
stewardship, establishing data quality data products.
standards and implementing data security
and compliance measures. To provide
these crucial data governance capabilities,
both Azure Databricks and Microsoft
Fabric offer powerful, modern features.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 23

Microsoft Purview to govern The broad governance capabilities of


Microsoft Fabric Microsoft Purview can extend into the
Azure Databricks environment, where
Fabric integrates with Microsoft Purview Unity Catalogue applies specific
for data governance, information governance and security measures to
protection and data loss prevention. The Azure Databricks workspaces. This
information protection features enable integration allows organisations to
enterprises to discover, classify and protect maintain a consistent governance strategy
the data stored in the lakehouse and apply across all platforms, enhancing security
sensitivity labels to it. Data loss prevention and operational efficiency. Organisations
uses policies to detect when sensitive data can ensure that data policies are uniformly
is uploaded to Power BI semantic models applied, data lineage is clear and auditable
or other supported Fabric assets. It can and all regulatory compliance
also help detect common sensitive data. requirements are met across their entire
Fabric also includes tools for discovering data estate.
data lineage so that data can be tracked
through the analytical process as it moves Unity Catalogue offers a sophisticated and
from its original source, through the centralised governance solution for
various transformations and into the managing a variety of data assets within
various reporting models. the Azure Databricks lakehouse platform.
It integrates seamlessly with Azure to
provide fine-grained governance
Microsoft Purview and Unity capabilities, including access control,
Catalogue to streamline data auditing and data lineage. Unity Catalogue
simplifies data management across
governance
multiple Azure Databricks workspaces,
Microsoft Purview and Unity Catalogue enabling organisations to enforce
are two powerful tools designed to consistent security and compliance policies
enhance data governance and across their data assets, whether they are
management within cloud environments, files, tables or machine learning models.
particularly for the users of the extensive
cloud services offered by Microsoft.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 24

Unity Catalogue provides a single point of Best practices


control for data access policies, which
apply uniformly across all workspaces. This The integration of Azure Databricks with
ensures that data governance is not only Microsoft Purview focuses on maximising
centralised, but also deeply integrated into data governance and security within Azure
the Azure Databricks environment, Databricks environments. Key best
enhancing security and governance. practices for this integration include:
Additionally, Unity Catalogue supports
comprehensive data discovery, making it • Secure access to critical data:
easier for users to find and access the data Microsoft Purview can be used to
they need while adhering to the defined automatically discover and classify data
access controls and policies. This unified within Azure Databricks, visualise data
approach helps streamline operations and lineage and manage access controls
reduces the complexity typically associated effectively. This ensures that only
with managing large and diverse approved personnel can access sensitive
data environments. or critical data and that all data policies
are consistently applied across
Microsoft Purview allows enterprises to Azure services.
maintain control over their data through
Fabric, enabling seamless integration and • Use two separate connectors to
management of data from various sources manage metadata: Microsoft Purview
down to detailed reports. Along with a offers two separate connectors for
suite of tools to protect sensitive data Azure Databricks. Most enterprises will
across different environments, Microsoft use the Azure Databricks Unity
Purview provides capabilities such as Catalogue connector because Unity
sensitive data discovery, classification and Catalogue enables many of the modern
protection using sensitivity labels. It also features in Azure Databricks. However,
facilitates comprehensive auditing and for customers who have not yet
data loss prevention strategies specifically migrated to Unity Catalogue and are
tailored for complex environments such as still using Hive to manage their
Power BI semantic models. metadata, Microsoft Purview also has
an Azure Databricks Hive Metastore
connector that can be used.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 25

• Utilise custom rule sets: Enterprises The combined capabilities of Microsoft


can use Microsoft Purview to scan Purview, Azure security in OneLake and
catalogues, schemas, tables and views. Unity Catalogue support a resilient and
As a best practice, enterprises should agile data governance strategy, enabling
use custom rule sets in addition to the businesses to use their data assets
rule sets provided by Microsoft Purview. effectively in a digital landscape.
Creating a custom rule set for different
regions of the world can speed up the
scanning process by using only the
classification rules required in a
particular region.

• Indicate data sensitivity with


labelling tools: Labelling tools in
Microsoft Purview can be used on Unity
Catalogue data to indicate the
sensitivity of files and data columns.
These labels travel with the data and
can be used by other tools in the
Microsoft data ecosystem, such as
SharePoint and Power BI, to
automatically apply data handling
policies.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 26

Data Factory and Azure


Databricks activity in
Microsoft Fabric
Azure Databricks activity in Microsoft Users can configure Azure Databricks
Fabric represents a significant evolution in clusters that are used for data processing
data processing within cloud environments, directly within Fabric, like the functionality
integrating the extensive capabilities of offered in Azure Data Factory. This
Azure Data Factory into a more unified includes the ability to set up Azure spot
and robust framework. With the new instances for accessing unused Azure
Azure Databricks activity, users can easily compute capacity at reduced costs and
create and manage data pipelines in Fabric, specifying cluster policies to ensure that
incorporating sophisticated analytics cluster configurations meet organisational
and processing tasks directly into standards and requirements.
their workflows.

One activity that encompasses all three job types: Unity Catalog support and
Notebook, Jar, Python Policy ID integration

Figure 6: Seamless integration between Azure Databricks and Data Factory with Microsoft Fabric
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 27

Moreover, the enhanced Azure Databricks • Monitor notebook runs in real time:
integration introduces several new During operation, users can initiate a
features. They are: data pipeline and immediately monitor
its execution, with the system providing
• Configure Unity Catalogue access direct links to the outputs in the Azure
mode: Users can configure the access Databricks instance. This real-time
mode of Unity Catalogue, which monitoring capability allows users to
enhances governance and security by track the details of the notebook runs,
managing permissions more including cluster performance and
meticulously. computational efficiency, directly
from Fabric.
• Run multiple tasks in a single
Databricks activity: Users can run This integration not only brings existing
various tasks, such as notebooks, JARs Azure Databricks capabilities from Azure
and Python scripts, within a single Data Factory into Fabric but also
Databricks activity, streamlining the introduces new functionalities such as
process and reducing the complexity cluster policy and Unity Catalogue
previously associated with managing support, enhancing the overall data
multiple types of data jobs. management and analytics experience.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 28

Enhance organisational
capabilities with generative AI
Advanced AI models have changed the technology landscape. Enterprises are trying to unlock
the potential of their data and use AI to expand their business capabilities. This includes the use
of generative AI to build their data lakehouse on Azure.

The next section will detail how to describe your desired architecture to ChatGPT and have it
generate code that can be implemented in Azure.

Generate architecture and code with Azure OpenAI

Using prompt engineering, you can generate architecture plans and code using the same process
that would be used to summarise a technical article.

Prompt Input

I want you to act as a data engineer. I will provide details about


a scenario where a customer needs to utilise data stored in ADLS Gen2.
It will be accessed by Azure Databricks and Microsoft Fabric to load
data for machine learning models. It will be your job to come up with
an architecture to address this need.

Answer from Azure OpenAI

To address the scenario where a customer needs to utilise data


stored in Azure Data Lake Storage Gen2 (ADLS Gen2), accessed by Azure
Databricks and Microsoft Fabric for loading data into machine learning
models, we’ll design an architecture that leverages these services
efficiently. This architecture aims to streamline data processing,
enhance data governance and ensure scalable machine learning
operations.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 29

Architecture Overview

• Data storage layer: Azure Data Lake Storage Gen2: Acts as the
central repository for storing raw data, processed data and
machine learning artefacts. ADLS Gen2 is optimised for large-scale
analytics scenarios and supports hierarchical namespace, which
simplifies data management.

• Data processing and analytics layer: Azure Databricks: Serves


as the primary platform for data processing, analysis and
machine learning model development. Azure Databricks provides a
collaborative environment with support for multiple programming
languages and integrates seamlessly with ADLS Gen2 for direct
access to data.

• Data management and governance layer: Microsoft Fabric: Utilised


for managing and governing the data across the entire data
landscape. Microsoft Fabric allows creating a unified data
governance model that encompasses data in ADLS Gen2 and data
processed or generated by Azure Databricks.

• Machine learning operationalisation: Azure Machine Learning:


Integrates with Azure Databricks for deploying and managing machine
learning models. This service provides tools for model versioning,
monitoring and serving, facilitating the operationalisation of
machine learning models.

Detailed Workflow

1. Data ingestion: Data is ingested into ADLS Gen2 from various sources,
including structured databases, IoT devices, log files, etc. This
data is stored in raw format within a hierarchical file system
structure.

Note
The rest of the content generated by Azure OpenAI can be found in the Appendix.
Prompts and answers reflected here may not reflect your exact experience.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 30

Generate and correct code language. It can generate executable


with Databricks Assistant SQL queries or transform code from one
language to another, such as converting
Databricks Assistant is an AI‑powered tool Python pandas code to PySpark. This
integrated within the Azure Databricks feature speeds up development and
environment, designed to enhance coding reduces manual coding errors.
productivity by assisting users with code
generation, error resolution and • Error resolution: Databricks Assistant
documentation directly within Databricks can quickly identify and clarify coding
notebooks. It uses the capabilities of Azure errors, offering solutions by generating
AI services, enhancing the way developers corrective code snippets. This is valuable
interact with data and code within the for both novice and experienced
platform. programmers, as it provides immediate
solutions to common syntax and
Databricks Assistant acts as a robust tool runtime issues, thus minimising
that supports developers by automating downtime.
routine tasks, optimising code, explaining
functionalities and troubleshooting, all • Documentation: It assists in
within the Azure Databricks workspace. documenting code by automatically
This not only speeds up the development generating comments that explain the
process but also helps in maintaining a functionality of code blocks, supporting
high standard of code quality and the maintenance of clean and
documentation, making it an asset in data understandable codebases essential for
engineering and analysis workflows. long-term project sustainability and
team collaboration.
Databricks Assistant enhances productivity
in data science and engineering by utilising • Contextual help and learning:
AI to help with code generation, error Databricks Assistant aids in context by
resolution and documentation within the understanding the user’s environment,
Azure Databricks environment in the including frequently used tables,
following ways: schema and previous queries. It uses
this context to provide precise answers
• Code generation: Databricks Assistant and tailored code snippets, enhancing
simplifies coding by allowing users to project-specific support.
input their requirements in natural
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 31

• Visualisations and dashboards: In Azure Databricks workspaces can be


visualisation environments such as provisioned with a free 14-day trial that
Lakeview, Databricks Assistant can includes access to Assistant, allowing
generate data visualisations based on potential users to evaluate its capabilities
user prompts, allowing the quick and integrate it into their workflows
creation and iteration of visual displays without an initial investment.
without needing deep technical
expertise. Here are some examples of how Databricks
Assistant can be used in different scenarios
• Data ingestion and ETL processes: to enhance productivity and simplify tasks:
Databricks Assistant streamlines data
ingestion and ETL tasks by automating 1. Code generation
code generation and optimisation,
expediting the set-up and execution of Scenario: A user needs to extract and
data pipelines. summarise sales data by region from
a DataFrame.
• Security and compliance: Azure
Databricks ensures that interactions User input:
with Databricks Assistant are secure,
adhering to user permissions and data
governance policies, which makes it Generate a SQL query to sum
all sales by region from the
suitable for use in sensitive and sales_data table.
regulated environments.

• Integration and accessibility: Response:


Databricks Assistant is accessible from
notebooks, SQL editors and file editors,
making it a versatile tool for various sqlCopy code
data tasks.
SELECT region, SUM(sales) AS
total_sales FROM sales_data
• Feedback and iteration: Users can GROUP BY region;
provide feedback directly through the
platform, which helps improve
Databricks Assistant’s accuracy Utility: This helps the user quickly get
and functionality. This ensures that the the desired code without having to
tool evolves in response to user needs manually write the SQL query.
and challenges.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 32

2. Fixing errors Response: Databricks Assistant adds


comments before each significant line
Scenario: A user has written PySpark or block of code explaining what it
code that fails due to a syntax error. does, such as initialising variables, error
handling and logic flows.
User input:
Utility: Ensures that the code is
understandable for future reference or
This code isn’t working. for other team members, enhancing
Can you help fix it?
maintainability.

Diagnosis and fix: Databricks Assistant These examples illustrate the practical
identifies a missing comma in a benefits of Databricks Assistant in
DataFrame operation and suggests the real‑world development environments,
corrected code with a change streamlining the coding process,
highlighted. simplifying error resolution and ensuring
thorough documentation.
Utility: The user gets immediate
feedback and correction suggestions, Note
speeding up troubleshooting and Prompts and answers reflected here may
reducing frustration. not reflect your exact experience.

3. Code documentation

Scenario: A developer wants to add


comments to a complex function to
improve code readability.

User input:

Can you document this


function explaining
each step?
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 33

Explore real-world use cases


with hands‑on examples
In earlier examples, you used Python 2. Restart the Python kernel: After
code to read the data and aggregate it to installation, you need to restart the
answer some business questions. Python kernel to apply the updates.
This section looks at an alternative to Execute dbutils.library.
Python code to read data and how AI can restartPython() in a new cell to
be used to allow business users to query reset the environment.
lakehouse data using English instead of
a query language. 3. Set the OpenAI API key: Your API key
from OpenAI is necessary for
authentication. Implement it by setting
Use the English SDK for Spark an environment variable with the
to write queries in Azure os.environ['OPENAI _ API _ KEY']
= '<your-openai-api-key>'
Databricks and Fabric
Python code, replacing <your-
To utilise the English SDK for Apache openai-api-key> with your actual
Spark, the following requirements should API key.
be met:
4. Activate the SDK: To use the SDK,
activate it within your notebook. This
Note involves initialising the SDK with your
Azure Databricks recommends using GPT-4. preferred language model (such as
GPT-4) and then activating it to start
interpreting English queries.
1. Install the English SDK package:
Begin by adding the SDK to your 5. Create a DataFrame: Use SQL queries
environment. Use the %pip install within the notebook to fetch data from
pyspark-ai --upgrade command in your Azure Databricks workspace and
your notebook to ensure you have the save it as a DataFrame. This DataFrame
latest version. will be the basis for your English queries.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 34

6. Query using English: Finally, query the quarter, using natural language.
DataFrame by asking questions in plain This approach simplifies complex data
English. The SDK interprets these analysis tasks into straightforward
questions and executes the English questions.
corresponding SQL queries, returning
the results directly to your notebook.
Creating a notebook in
An example query using English with the Microsoft Fabric
English SDK for Apache Spark could be
something such as: Fabric notebooks are a key tool for crafting
Apache Spark jobs and conducting
machine learning experiments. With
What was the average trip support for advanced visualisations and
distance for each day during
the month of January 2016?
Markdown text integration, it offers a
Print the averages to the web-based interactive platform that’s
nearest tenth. popular among data scientists and
engineers for coding. Data scientists
rely on these notebooks to develop
This query demonstrates how plain English and deploy machine learning models,
can be utilised to conduct data analysis including experimentation, model
activities, such as calculating averages tracking and deployment phases.
from a dataset, with the English SDK, Fabric notebooks offer:
allowing Apache Spark to interpret and
execute English-language instructions. • Immediate usability with no
set-up required
Another example query using English for
the English SDK for Apache Spark could be: • An intuitive, low-code interface for data
exploration and processing

Show me the total revenue for • Enhanced data security through


each product category in the
last quarter.
integrated enterprise-level features

• The ability to analyse data in various


This type of query illustrates how users can formats (including CSV, TXT, JSON,
request specific financial metrics, such as Parquet and Delta Lake) with the robust
total revenue, broken down by categories, capabilities of Spark
over a defined period, such as the last
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 35

Creating notebooks 2. Choose Import Notebook in the


When creating a notebook, users have two same window to import an existing
options: create a new one or import an notebook, such as an Azure Databricks
existing one. Organisations can create a notebook file.
new notebook by following the familiar
Fabric item creation workflow: 3. Once you have a notebook open, you
can add code to it to write data to
1. Initiate a new notebook directly from OneLake.
the Fabric Data Engineering or the
Data Science home page, or Working with data in OneLake is
throughout the workspace New option. straightforward and does not involve a
complex set-up to access the data.

Loading data into OneLake via a Microsoft Fabric data engineering notebook

from pyspark.sql import SparkSession

# Initialize Spark session (assuming it's not already initialized)

spark = SparkSession.builder.appName("ParkDataImport").getOrCreate()

# URL to the CSV file

data_url = "https://www.dropbox.com/s/268uogek0mcypn9/park-data.
csv?raw=1"

# Read the CSV data directly into a Spark DataFrame

df = spark.read.option("header", "true").csv(data_url)

# Assuming csv_table_name, parquet_table_name, and delta_table_name are


defined elsewhere in your code

csv_table_name = "park_data_csv"

parquet_table_name = "park_data_parquet"

delta_table_name = "park_data_delta"
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 36

# Save dataframe as CSV files to Files section of the default Lakehouse

df.write.mode("overwrite").format("csv").save("Files/" + csv_table_name)

# Save dataframe as Parquet files to Files section of the default


Lakehouse

df.write.mode("overwrite").format("parquet").save("Files/" + parquet_
table_name)

# Save dataframe as a delta lake, parquet table to Tables section of the


default Lakehouse

df.write.mode("overwrite").format("delta").saveAsTable(delta_table_name)

# Save the dataframe as a delta lake, appending the data to an existing


table

# Make sure the table exists and the schema matches to avoid errors

df.write.mode("append").format("delta").saveAsTable(delta_table_name)

Reading and data analysis

Once the data has been successfully uploaded, try reading and analysing the data:

# Basic Data Analysis

# Count of animal sightings by type (excluding squirrels)

animal_sightings = spark.sql("""

SELECT Animal_Type, COUNT(*) as Total_Sightings

FROM park_data_view

WHERE Animal_Type != 'Squirrel'

GROUP BY Animal_Type

ORDER BY Total_Sightings DESC


Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 37

""")

animal_sightings.show()

# Average temperature and most common weather conditions

avg_temp = spark.sql("""

SELECT AVG(Temperature) as Average_Temperature

FROM park_data_view

""")

avg_temp.show()

common_weather = spark.sql("""

SELECT Weather, COUNT(*) as Frequency

FROM park_data_view

GROUP BY Weather

ORDER BY Frequency DESC

LIMIT 5

""")

common_weather.show()

# Total count of squirrel sightings

squirrel_sightings = spark.sql("""

SELECT COUNT(*) as Total_Squirrel_Sightings

FROM park_data_view

WHERE Animal_Type = 'Squirrel'

""")

squirrel_sightings.show()
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 38

Creating and modifying a Delta table from Parquet in Azure


Databricks with changes reflected in Fabric
Azure Databricks and Fabric provide a data lakehouse environment that allows businesses to
access and analyse their data simultaneously, using different tools. This supports a wide range of
data processing activities on the same set of data, making it easier for organisations to manage
and derive insights from their information efficiently.

1. Open your Azure Databricks workspace in a browser of your choice and launch a new Azure
Databricks notebook.

Figure 7: Example notebook


Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 39

2. Copy and paste the following script into your new notebook. Then, execute the following
Python script in your notebook to create a Delta table within your ADLS Gen2 account.
This script reads some sample Parquet data and then writes it as a Delta table into your
ADLS account:

#python

# Adjust the file path to point to your sample parquet data using the
following format:

"abfss://<storage name>@<container name>.dfs.core.windows.


net/<filepath>"

# The line below reads Parquet files from your ADLS account

df = spark.read.format('Parquet').load("abfss://[email protected].
core.windows.net/demo/full/dimension_city/")

#This line writes the read data as Delta tables back into your ADLS
account

df.write.mode("overwrite").format("delta").save("abfss://datasetsv1@
olsdemo.dfs.core.windows.net/demo/adb_dim_city_delta/")

And, of course, Azure Databricks can also read the data in the ADLS account.

3. Azure Databricks can also modify the same sets of data that were originally created previously
with Fabric. To see this in action, append some new rows to the Delta Lake tables you created
in OneLake:

# Import the necessary libraries

from pyspark.sql import SparkSession

from pyspark.sql.functions import lit

# Initialize a Spark session

spark = SparkSession.builder.appName("AppendToDeltaTable").
getOrCreate()
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 40

# Define the path to your Delta Lake table in OneLake

# Replace '<your-delta-table-path>' with the actual path to your Delta


Lake table

delta_table_path = "abfss://<container-name>@<storage-account-name>.
dfs.core.windows.net/<your-delta-table-path>"

# Create a DataFrame with the new rows you want to append

# Replace the column names and values with those relevant to your table

new_rows = [

("NewValue1", 10),

("NewValue2", 20)

# Add as many rows as needed

# Define the schema based on your Delta Lake table structure

# This is an example schema; adjust it to match your table's columns


and data types

schema = ["ColumnName1", "ColumnName2"]

# Create a DataFrame with the new data

new_data_df = spark.createDataFrame(new_rows, schema)

# Append the new data to the Delta Lake table

# Ensure the table format is set to 'delta' for Delta Lake


compatibility

new_data_df.write.format("delta").mode("append").save(delta_table_path)

# Verify by reading back the data from the Delta Lake table

df = spark.read.format("delta").load(delta_table_path)

df.show()
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 41

As the examples illustrate, a data 6. Input the server hostname and HTTP
lakehouse, built on any platform with the path you obtained earlier.
advantages of open platforms, enables
enterprises to use a variety of engines to 7. Decide between the Import and
work on the same copy of the data at the DirectQuery modes for your data
same time. connectivity. For more insights into
these options, consider reading about
the use of DirectQuery in Power BI
Azure Databricks connector Desktop.
within Power BI
8. Select your preferred authentication
The Power BI connector for Azure method:
Databricks provides seamless integration
between Power BI and Azure Databricks, a. Personal Access Token: Enter
enabling organisations to connect, analyse your Azure Databricks personal
and visualise data stored in Azure access token.
Databricks with ease. This integration
supports Microsoft Entra ID authentication, b. Microsoft Entra ID: Choose Sign in
removing the need for administrators to and follow the prompts.
generate personal access tokens for
connection. It is designed to enhance data c. Username/Password: This option is
connectivity and analysis experiences, typically not applicable.
allowing for efficient and secure data
visualisation directly from the data lake. 9. After authentication, Power BI will
present you with the Navigator
1. Obtain your Azure Databricks server window. Here, you can select the Azure
hostname and HTTP path for setting up Databricks data you wish to query. If
the connection in Power BI. your workspace has Unity Catalogue
enabled, you’ll first select a catalogue,
2. Launch Power BI Desktop. followed by a schema and a table.

3. Choose Get Data from the home screen For workloads that need the processing
or navigate through File > Get Data. power and flexibility offered by Azure
Databricks, enterprises can use the
4. Search for Azure Databricks. advanced visualisation capabilities of
Power BI along with Azure Databricks.
5. Select Azure Databricks connector
and then click Connect.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 42

Achieve excellence with Azure


Databricks and Microsoft Fabric
The integration of Azure Databricks and Furthermore, the combined capabilities of
Microsoft Fabric represents a Azure Databricks and Fabric address the
transformative approach to managing and entire lifecycle of AI and machine learning
analysing data within modern cloud projects, from data ingestion and storage
environments. Azure Databricks provides a to analysis and deployment. This
high-performance platform for data integration helps organisations maintain
processing and AI-driven analytics, while control over their data processes and
Fabric enhances these capabilities with ensure that AI initiatives are both scalable
robust data management tools. This and compliant with industry regulations.
combination allows organisations to
harness advanced analytics and AI Organisations can enhance their data
solutions more effectively. governance, boost operational efficiency
and foster innovation in AI and analytics
Together, Azure Databricks and Fabric with the powerful integration of Azure
streamline analytics workloads by Databricks and Fabric, making them well
providing seamless data access without equipped in the face of a data-driven
the need for redundant data copies. economy and enabling them to achieve
This integration supports direct queries their strategic goals.
from Power BI, leading to improved
performance and a simplified
data architecture.

The synergy between Azure Databricks


and Fabric unlocks the full potential of the
lakehouse architecture, enhancing the
ability to manage, process and analyse
large datasets more effectively. This
enables organisations to deploy AI
solutions with ease and tackle complex
analytical challenges.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 43

Next steps
• Talk to a sales specialist.
• Get more value from your data with Microsoft Fabric.
• Unlock insights from your data and build AI solutions with
Azure Databricks.

• 
Watch the webinar to learn more about how Azure Databricks

and Microsoft Fabric are better together.


Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 44

Appendix
Prompt input

I want you to act as a data engineer. I will provide details about a


scenario where a customer needs to utilise data stored in ADLS Gen2. It
will be accessed by Azure Databricks and Microsoft Fabric to load data
for machine learning models. It will be your job to come up with an
architecture to address this need.

Answer from Azure OpenAI

To address the scenario where a customer needs to utilise data stored


in Azure Data Lake Storage Gen2 (ADLS Gen2), accessed by Azure
Databricks and Microsoft Fabric for loading data into machine learning
models, we’ll design an architecture that leverages these services
efficiently. This architecture aims to streamline data processing,
enhance data governance and ensure scalable machine learning
operations.

Architecture Overview

• Data storage layer: Azure Data Lake Storage Gen2: Acts as the
central repository for storing raw data, processed data and
machine learning artefacts. ADLS Gen2 is optimised for large-scale
analytics scenarios and supports hierarchical namespace, which
simplifies data management.

• Data processing and analytics layer: Azure Databricks: Serves


as the primary platform for data processing, analysis and
machine learning model development. Azure Databricks provides a
collaborative environment with support for multiple programming
languages and integrates seamlessly with ADLS Gen2 for direct
access to data.

• Data management and governance layer: Microsoft Fabric: Utilised


for managing and governing the data across the entire data
landscape. Microsoft Fabric allows creating a unified data
governance model that encompasses data in ADLS Gen2 and data
processed or generated by Azure Databricks.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 45

• Machine learning operationalisation: Azure Machine Learning:


Integrates with Azure Databricks for deploying and managing machine
learning models. This service provides tools for model versioning,
monitoring and serving, facilitating the operationalisation of
machine learning models.

Detailed Workflow

1. Data ingestion: Data is ingested into ADLS Gen2 from various sources,
including structured databases, IoT devices, log files, etc. This
data is stored in raw format within a hierarchical file system
structure.

2. Data processing and exploration:

• Azure Databricks notebooks are used to access raw data stored


in ADLS Gen2. Data engineers and scientists process and explore
this data using Spark, performing tasks like data cleaning,
transformation and feature engineering.

• Processed data can be stored back into ADLS Gen2 in an optimised


format (e.g., Delta Lake) for further analysis or machine learning.

3. Data governance and lineage:

• Microsoft Fabric is configured to catalogue data assets in ADLS


Gen2 and data products generated by Azure Databricks. It provides
a unified view of the data, enforcing policies, data quality rules
and security measures.

• Microsoft Fabric enables setting up data lineage and tracking to


maintain visibility into data transformations and usage throughout
the data lifecycle.

4. Machine learning model training and deployment:

• Data scientists use Azure Databricks for developing and training


machine learning models, leveraging the integrated MLflow for
experiment tracking and model management.

• Trained models are registered in Azure Machine Learning, which


is used to deploy these models to production as web services or
containers, enabling real-time or batch scoring.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 46

5. Monitoring and management:

• Azure Machine Learning provides model monitoring capabilities,


tracking model performance and data drift over time. It facilitates
retraining workflows when models become stale.

• Microsoft Fabric oversees the governance and compliance aspect,


ensuring that data access and model deployments adhere to
organisational and regulatory standards.

This architecture provides a comprehensive solution for leveraging


ADLS Gen2 data in Azure Databricks and Microsoft Fabric, facilitating
efficient machine learning model development, governance and
operationalisation. It ensures scalability, security and compliance,
catering to the dynamic needs of modern data-driven organisations.

Note
Prompts and answers reflected here may not reflect your exact experience.

Prompt input

Based on the above, can you provide detailed build instructions


with code?

Answer from Azure OpenAI

Creating a detailed build guide with code for integrating Azure


Data Lake Storage Gen2 (ADLS Gen2), Azure Databricks and Microsoft
Fabric into a cohesive architecture for machine learning requires
several steps. Below, we’ll outline key phases in this process,
including set-up, data processing and machine learning model lifecycle
management. This guide assumes a basic understanding of Azure services
and a subscription where you can deploy these services.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 47

1. Set-up and Configuration

Azure Data Lake Storage Gen2

1. Create an ADLS Gen2 Storage Account:

• Navigate to the Azure Portal.

• Create a new storage account, selecting your subscription, resource


group and the ‘StorageV2’ account kind.

• Enable the hierarchical namespace to use ADLS Gen2 features.

2. Configure Storage Account Access:

• Within the storage account, go to the ‘Access keys’ section and


note down the connection string for later use.

• Alternatively, set up Azure Active Directory (Microsoft Entra ID)


authentication for more secure access.

Azure Databricks

1. Create an Azure Databricks Workspace:

• In the Azure Portal, create a new Azure Databricks service.

• Once the deployment is complete, navigate to the Azure Databricks


workspace and launch the workspace.

2. Create a Cluster in Azure Databricks:

• Inside the Azure Databricks workspace, go to ‘Compute’.

• Create a new cluster, ensuring it’s compatible with the libraries


you plan to use.

3. Mount ADLS Gen2 to Azure Databricks (for direct access to data):

• In a Azure Databricks notebook, use the following code to mount


your ADLS Gen2 storage for direct access:
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 48

Python

configs = { "fs.azure.account.auth.type": "OAuth", "fs.azure.


account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.
oauth2.ClientCredsTokenProvider", "fs.azure.account.oauth2.
client.id": "<Application-ID>", "fs.azure.account.oauth2.client.
secret": dbutils.secrets.get(scope="<scope-name>", key="<secret-
key>"), "fs.azure.account.oauth2.client.endpoint": "https://
login.microsoftonline.com/<Directory-ID>/oauth2/token" } dbutils.
fs.mount( source="abfss://<filesystem>@<storage-account-name>.
dfs.core.windows.net/", mount_point="/mnt/<mount-name>", extra_
configs=configs)

• Replace placeholders with your actual Microsoft Entra ID


application registration details and ADLS Gen2 storage details.

Microsoft Fabric

1. Setting Up Microsoft Fabric:

• Currently, Microsoft Fabric is a conceptual integration point


and may not have direct customer-facing set-up instructions
available publicly. Typically, set-up would involve configuring
your environment to catalogue and manage data assets across Azure
Databricks and ADLS Gen2.

• For demonstration purposes, we’ll proceed to the next steps


assuming Microsoft Fabric is used to govern and secure data access,
ensuring compliance and data quality.

2. Data Processing with Azure Databricks

1. Load Data from ADLS Gen2:

• Use the mounted path to read data into a Azure Databricks


DataFrame:

pythonCopy code

df = spark.read.csv("/mnt/<mount-name>/path/to/data.csv")

2. Data Transformation:

• Perform necessary data transformations using Spark DataFrame


operations.
Synergy in Analytics: Unifying Azure Databricks and Microsoft Fabric 49

3. Save Processed Data:

• Save the processed data back to ADLS Gen2 in an optimised


format, such as Delta Lake:

pythonCopy code

df.write.format("delta").save("/mnt/<mount-name>/path/to/
processed/data/")

3. Machine Learning Model Training and Deployment

1. Model Training:

• Use Azure Databricks to train a machine learning model using the


processed data.

• Example using Spark MLlib for logistic regression:

pythonCopy code

from pyspark.ml.classification import LogisticRegression #


Assuming df is a DataFrame prepared for training # Train
a model lr = LogisticRegression(featuresCol='features',
labelCol='label') model = lr.fit(df) # Save the model model.
write().overwrite().save("/mnt/<mount-name>/path/to/model")

2. Model Deployment:

• Integrate with Azure Machine Learning for model deployment,


monitoring and management. This part is complex and typically
involves registering the model, creating a scoring script and
deploying the model as a web service.

Note
Prompts and answers reflected here may not reflect your exact experience.

©2024 Microsoft Corporation. All rights reserved. This document is provided ‘as-is’. Information and views expressed in this document, including URLs and other
internet website references, may change without notice. You bear the risk of using it. This document does not provide you with any legal rights to any intellectual
property in any Microsoft product. You may copy and use this document for your internal reference purposes.

You might also like