0% found this document useful (0 votes)
280 views150 pages

Fabric Data Science 1 150

This document provides an overview of the data science capabilities available in Microsoft Fabric. It describes how Microsoft Fabric can be used to complete the typical data science process steps of problem formulation, data preparation, modeling, and operationalization. Key capabilities include notebooks for exploration and modeling, Data Wrangler for data cleansing, MLflow for experiment tracking, SynapseML for scalable modeling, and semantic link for collaboration between data scientists and analysts. Predictions can be scored and served to Power BI for insights.

Uploaded by

pascalburume
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
280 views150 pages

Fabric Data Science 1 150

This document provides an overview of the data science capabilities available in Microsoft Fabric. It describes how Microsoft Fabric can be used to complete the typical data science process steps of problem formulation, data preparation, modeling, and operationalization. Key capabilities include notebooks for exploration and modeling, Data Wrangler for data cleansing, MLflow for experiment tracking, SynapseML for scalable modeling, and semantic link for collaboration between data scientists and analysts. Predictions can be scored and served to Power BI for insights.

Uploaded by

pascalburume
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 150

Tell us about your PDF experience.

Data Science documentation in


Microsoft Fabric
Learn more about the Data Science experience in Microsoft Fabric

About Data Science

e OVERVIEW

What is Data Science?

Get started

b GET STARTED

Introduction to Data science tutorials

Data science roles and permissions

Machine learning models

Tutorials

g TUTORIAL

How to use end-to-end AI samples

Retail recommendation

Fraud detection

Forecasting

Discover relationships in a Power BI dataset using semantic link

Analyze functional dependencies in a Power BI dataset using semantic link

Preparing Data
c HOW-TO GUIDE

How to accelerate data prep with Data Wrangler

How to read and write data with Pandas

Training Models

e OVERVIEW

Training Overview

c HOW-TO GUIDE

Train with Spark MLlib

Train models with Scikit-learn

Train models with PyTorch

Tracking models and experiments

c HOW-TO GUIDE

Machine learning experiments

Fabric autologging

Using SynapseML

b GET STARTED

Your First SynapseML Model

Install another version of SynapseML

c HOW-TO GUIDE

Use OpenAI for big data with SynapseML

Use Congnitive Services with SynapseML


Deep learning with ONNX and SynapseML

Perform multivariate anomaly detection with SynapseML

Tune hyperperameters with SynapseML

Using R

e OVERVIEW

R overview

c HOW-TO GUIDE

How to use SparkR

How to use sparklyr

R library management

R visualization

Reference docs

i REFERENCE

SemPy Python SDK

Using semantic link

e OVERVIEW

What is semantic link?

c HOW-TO GUIDE

Read from Power BI and write data consumable by Power BI

Detect, explore, and validate functional dependencies in your data

Explore and validate relationships in datasets


What is Data Science in Microsoft
Fabric?
Article • 10/04/2023

) Important

Microsoft Fabric is in preview.

Microsoft Fabric offers Data Science experiences to empower users to complete end-to-
end data science workflows for the purpose of data enrichment and business insights.
You can complete a wide range of activities across the entire data science process, all
the way from data exploration, preparation and cleansing to experimentation, modeling,
model scoring and serving of predictive insights to BI reports.

Microsoft Fabric users can access a Data Science Home page. From there, they can
discover and access various relevant resources. For example, they can create machine
learning Experiments, Models and Notebooks. They can also import existing Notebooks
on the Data Science Home page.

You might know how a typical data science process works. As a well-known process,
most machine learning projects follow it.

At a high level, the process involves these steps:


Problem formulation and ideation
Data discovery and pre-processing
Experimentation and modeling
Enrich and operationalize
Gain insights

This article describes the Microsoft Fabric Data Science capabilities from a data science
process perspective. For each step in the data science process, this article summarizes
the Microsoft Fabric capabilities that can help.

Problem formulation and ideation


Data Science users in Microsoft Fabric work on the same platform as business users and
analysts. Data sharing and collaboration becomes more seamless across different roles
as a result. Analysts can easily share Power BI reports and datasets with data science
practitioners. The ease of collaboration across roles in Microsoft Fabric makes hand-offs
during the problem formulation phase much easier.

Data discovery and pre-processing


Microsoft Fabric users can interact with data in OneLake using the Lakehouse item.
Lakehouse easily attaches to a Notebook to browse and interact with data.

Users can easily read data from a Lakehouse directly into a Pandas dataframe. For
exploration, this makes seamless data reads from OneLake possible.

A powerful set of tools is available for data ingestion and data orchestration pipelines
with data integration pipelines - a natively integrated part of Microsoft Fabric. Easy-to-
build data pipelines can access and transform the data into a format that machine
learning can consume.

Data exploration
An important part of the machine learning process is to understand data through
exploration and visualization.

Depending on the data storage location, Microsoft Fabric offers a set of different tools
to explore and prepare the data for analytics and machine learning. Notebooks become
one of the quickest ways to get started with data exploration.

Apache Spark and Python for data preparation


Microsoft Fabric offers capabilities to transform, prepare, and explore your data at scale.
With Spark, users can leverage PySpark/Python, Scala, and SparkR/SparklyR tools for
data pre-processing at scale. Powerful open-source visualization libraries can enhance
the data exploration experience to help better understand the data.

Data Wrangler for seamless data cleansing


The Microsoft Fabric Notebook experience added a feature to use Data Wrangler, a
code tool that prepares data and generates Python code. This experience makes it easy
to accelerate tedious and mundane tasks - for example, data cleansing, and build
repeatability and automation through generated code. Learn more about Data Wrangler
in the Data Wrangler section of this document.

Experimentation and ML modeling


With tools like PySpark/Python, SparklyR/R, notebooks can handle machine learning
model training.

ML algorithms and libraries can help train machine learning models. Library
management tools can install these libraries and algorithms. Users have therefore the
option to leverage a large variety of popular machine learning libraries to complete their
ML model training in Microsoft Fabric.

Additionally, popular libraries like Scikit Learn can also develop models.

MLflow experiments and runs can track the ML model training. Microsoft Fabric offers a
built-in MLflow experience with which users can interact, to log experiments and
models. Learn more about how to use MLflow to track experiments and manage models
in Microsoft Fabric.

SynapseML
The SynapseML (previously known as MMLSpark) open-source library, that Microsoft
owns and maintains, simplifies massively scalable machine learning pipeline creation. As
a tool ecosystem, it expands the Apache Spark framework in several new directions.
SynapseML unifies several existing machine learning frameworks and new Microsoft
algorithms into a single, scalable API. The open-source SynapseML library includes a rich
ecosystem of ML tools for development of predictive models, as well as leveraging pre-
trained AI models from Azure AI services. Learn more about SynapseML .

Enrich and operationalize


Notebooks can handle machine learning model batch scoring with open-source libraries
for prediction, or the Microsoft Fabric scalable universal Spark Predict function, which
supports MLflow packaged models in the Microsoft Fabric model registry.

Gain insights
In Microsoft Fabric, Predicted values can easily be written to OneLake, and seamlessly
consumed from Power BI reports, with the Power BI Direct Lake mode. This makes it very
easy for data science practitioners to share results from their work with stakeholders and
it also simplifies operationalization.

Notebooks that contain batch scoring can be scheduled to run using the Notebook
scheduling capabilities. Batch scoring can also be scheduled as part of data pipeline
activities or Spark jobs. Power BI automatically gets the latest predictions without need
for loading or refresh of the data, thanks to the Direct lake mode in Microsoft Fabric.

Data exploration with Semantic Link


Data scientists and business analysts spend lots of time trying to understand, clean, and
transform data before they can start any meaningful analysis. Business analysts typically
work with Power BI datasets and encode their domain knowledge and business logic
into Power BI measures. On the other hand, data scientists can work with the same
datasets, but typically in a different code environment or language.
Semantic Link allows data scientists to establish a connection between Power BI datasets
and the Synapse Data Science in Microsoft Fabric experience via the SemPy Python
library. SemPy simplifies data analytics by capturing and leveraging data semantics as
users perform various transformations on their datasets. By leveraging semantic link,
data scientists can:

avoid the need to re-implement business logic and domain knowledge in their
code
easily access and use Power BI measures in their code
use semantics to power new experiences, such as semantic functions
explore and validate functional dependencies and relationships between data

Through the use of SemPy, organizations can expect to see:

increased productivity and faster collaboration across teams that operate on the
same datasets
increased cross-collaboration across business intelligence and AI teams
reduced ambiguity and an easier learning curve when onboarding onto a new
model or dataset

For more information on Semantic Link, see What is Semantic Link?.

Next steps
Get started with end-to-end data science samples, see Data Science Tutorials
Learn more about data preparation and cleansing with Data Wrangler, see Data
Wrangler
Learn more about tracking experiments, see Machine learning experiment
Learn more about managing models, see Machine learning model
Learn more about batch scoring with Predict, see Score models with PREDICT
Serve predictions from Lakehouse to Power BI with Direct lake Mode

Feedback
Was this page helpful?  Yes  No

Provide product feedback | Ask the community


Data science end-to-end scenario:
introduction and architecture
Article • 09/12/2023

This set of tutorials demonstrates a sample end-to-end scenario in the Fabric data
science experience. You'll implement each step from data ingestion, cleansing, and
preparation, to training machine learning models and generating insights, and then
consume those insights using visualization tools like Power BI.

) Important

Microsoft Fabric is in preview.

If you're new to Microsoft Fabric, see What is Microsoft Fabric?.

Introduction
The lifecycle of a Data science project typically includes (often, iteratively) the following
steps:

Business understanding
Data acquisition
Data exploration, cleansing, preparation, and visualization
Model training and experiment tracking
Model scoring and generating insights.

The goals and success criteria of each stage depend on collaboration, data sharing and
documentation. The Fabric data science experience consists of multiple native-built
features that enable collaboration, data acquisition, sharing, and consumption in a
seamless way.

In these tutorials, you take the role of a data scientist who has been given the task to
explore, clean, and transform a dataset containing taxicab trip data. You then build a
machine learning model to predict trip duration at scale on a large dataset.

You'll learn to perform the following activities:

1. Use the Fabric notebooks for data science scenarios.

2. Ingest data into a Fabric lakehouse using Apache Spark.


3. Load existing data from the lakehouse delta tables.

4. Clean and transform data using Apache Spark.

5. Create experiments and runs to train a machine learning model.

6. Register and track trained models using MLflow and the Fabric UI.

7. Run scoring at scale and save predictions and inference results to the lakehouse.

8. Visualize predictions in Power BI using DirectLake.

Architecture
In this tutorial series, we showcase a simplified end-to-end data science scenario that
involves:

1. Ingesting data from an external data source.


2. Data exploration and visualization.
3. Data cleansing, preparation, and feature engineering.
4. Model training and evaluation.
5. Model batch scoring and saving predictions for consumption.
6. Visualizing prediction results.

Different components of the data science scenario


Data sources - Fabric makes it easy and quick to connect to Azure Data Services, other
cloud platforms, and on-premises data sources to ingest data from. Using Fabric
Notebooks you can ingest data from the built-in Lakehouse, Data Warehouse, Power BI
datasets and various Apache Spark and Python supported custom data sources. This
tutorial series focuses on ingesting and loading data from a lakehouse.

Explore, clean, and prepare - The data science experience on Fabric supports data
cleansing, transformation, exploration and featurization by using built-in experiences on
Spark as well as Python based tools like Data Wrangler and SemPy Library. This tutorial
will showcase data exploration using Python library seaborn and data cleansing and
preparation using Apache Spark.

Models and experiments - Fabric enables you to train, evaluate, and score machine
learning models by using built-in experiment and model items with seamless integration
with MLflow for experiment tracking and model registration/deployment. Fabric also
features capabilities for model prediction at scale (PREDICT) to gain and share business
insights.

Storage - Fabric standardizes on Delta Lake , which means all the engines of Fabric can
interact with the same dataset stored in a lakehouse. This storage layer allows you to
store both structured and unstructured data that support both file-based storage and
tabular format. The datasets and files stored can be easily accessed via all Fabric
experience items like notebooks and pipelines.

Expose analysis and insights - Data from a lakehouse can be consumed by Power BI,
industry leading business intelligence tool, for reporting and visualization. Data
persisted in the lakehouse can also be visualized in notebooks using Spark or Python
native visualization libraries like matplotlib , seaborn , plotly , and more. Data can also
be visualized using the SemPy library that supports built-in rich, task-specific
visualizations for the semantic data model, for dependencies and their violations, and
for classification and regression use cases.

Next steps
Prepare your system for the data science tutorial

Feedback
Was this page helpful?  Yes  No
Provide product feedback | Ask the community
Prepare your system for data science
tutorials
Article • 09/26/2023

Before you begin the data science end-to-end tutorial series, learn about prerequisites,
how to import notebooks, and how to attach a lakehouse to those notebooks.

) Important

Microsoft Fabric is in preview.

Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview)
trial.

Sign in to Microsoft Fabric .

Switch to the Data Science experience by using the experience switcher icon on the
left side of your home page.

If you don't have a Microsoft Fabric lakehouse, create one by following the steps in
Create a lakehouse in Microsoft Fabric.

Import tutorial notebooks


We utilize the notebook item in the Data Science experience to demonstrate various
Fabric capabilities. The notebooks are available as Jupyter notebook files that can be
imported to your Fabric-enabled workspace.

1. Download your notebook(s). Make sure to download the files by using the "Raw"
file link in GitHub.

For the Get started notebooks, download the notebook(.ipynb) files from the
parent folder: data-science-tutorial .
For the Tutorials notebooks, download the notebooks(.ipynb) files from the
parent folder ai-samples .

2. On the Data science experience homepage, select Import notebook and upload
the notebook files that you downloaded in step 1.

3. Once the notebooks are imported, select Go to workspace in the import dialog
box.

4. The imported notebooks are now available in your workspace for use.

5. If the imported notebook includes output, select the Edit menu, then select Clear
all outputs.

Attach a lakehouse to the notebooks


To demonstrate Fabric lakehouse features, many of the tutorials require attaching a
default lakehouse to the notebooks. The following steps show how to add an existing
lakehouse to a notebook in a Fabric-enabled workspace.

1. Open the notebook in the workspace.

2. Select Add lakehouse in the left pane and select Existing lakehouse to open the
Data hub dialog box.

3. Select the workspace and the lakehouse you intend to use with these tutorials and
select Add.

4. Once a lakehouse is added, it's visible in the lakehouse pane in the notebook UI
where tables and files stored in the lakehouse can be viewed.

7 Note
Before executing each notebooks, you need to perform these steps on that
notebook.

Next steps
Part 1: Ingest data into Fabric lakehouse using Apache Spark

Feedback
Was this page helpful?  Yes  No

Provide product feedback | Ask the community


Tutorial Part 1: Ingest data into a Microsoft Fabric lakehouse
using Apache Spark
Article • 10/20/2023

In this tutorial, you'll ingest data into Fabric lakehouses in delta lake format. Some important terms to understand:

Lakehouse - A lakehouse is a collection of files/folders/tables that represent a database over a data lake used by the Spark engine and
SQL engine for big data processing and that includes enhanced capabilities for ACID transactions when using the open-source Delta
formatted tables.

Delta Lake - Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata management, and batch and
streaming data processing to Apache Spark. A Delta Lake table is a data table format that extends Parquet data files with a file-based
transaction log for ACID transactions and scalable metadata management.

Azure Open Datasets are curated public datasets you can use to add scenario-specific features to machine learning solutions for more
accurate models. Open Datasets are in the cloud on Microsoft Azure Storage and can be accessed by various methods including
Apache Spark, REST API, Data factory, and other tools.

In this tutorial, you use the Apache Spark to:

" Read data from Azure Open Datasets containers.


" Write data into a Fabric lakehouse delta table.

) Important

Microsoft Fabric is in preview.

Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview) trial.

Sign in to Microsoft Fabric .

Switch to the Data Science experience by using the experience switcher icon on the left side of your home page.

Add a lakehouse to this notebook. You'll be downloading data from a public blob, then storing the data in the lakehouse.

Follow along in notebook


1-ingest-data.ipynb is the notebook that accompanies this tutorial.

If you want to open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science to import
the tutorial notebooks to your workspace.

Or, if you'd rather copy and paste the code from this page, you can create a new notebook.

Be sure to attach a lakehouse to the notebook before you start running code.

Bank churn data


The dataset contains churn status of 10,000 customers. It also includes attributes that could impact churn such as:
Credit score
Geographical location (Germany, France, Spain)
Gender (male, female)
Age
Tenure (years of being bank's customer)
Account balance
Estimated salary
Number of products that a customer has purchased through the bank
Credit card status (whether a customer has a credit card or not)
Active member status (whether an active bank's customer or not)

The dataset also includes columns such as row number, customer ID, and customer surname that should have no impact on customer's
decision to leave the bank.

The event that defines the customer's churn is the closing of the customer's bank account. The column exited in the dataset refers to
customer's abandonment. There isn't much context available about these attributes so you have to proceed without having background
information about the dataset. The aim is to understand how these attributes contribute to the exited status.

Example rows from the dataset:

"CustomerID" "Surname" "CreditScore" "Geography" "Gender" "Age" "Tenure" "Balance" "NumOfProducts" "HasCrCard" "IsActiveMemb

15634602 Hargrave 619 France Female 42 2 0.00 1 1 1

15647311 Hill 608 Spain Female 41 1 83807.86 1 0 1

Download dataset and upload to lakehouse

 Tip

By defining the following parameters, you can use this notebook with different datasets easily.

Python

IS_CUSTOM_DATA = False # if TRUE, dataset has to be uploaded manually

DATA_ROOT = "/lakehouse/default"
DATA_FOLDER = "Files/churn" # folder with data files
DATA_FILE = "churn.csv" # data file name

This code downloads a publicly available version of the dataset and then stores it in a Fabric lakehouse.

) Important

Make sure you add a lakehouse to the notebook before running it. Failure to do so will result in an error.

Python

import os, requests


if not IS_CUSTOM_DATA:
# Download demo data files into lakehouse if not exist
remote_url = "https://synapseaisolutionsa.blob.core.windows.net/public/bankcustomerchurn"
file_list = [DATA_FILE]
download_path = f"{DATA_ROOT}/{DATA_FOLDER}/raw"

if not os.path.exists("/lakehouse/default"):
raise FileNotFoundError(
"Default lakehouse not found, please add a lakehouse and restart the session."
)
os.makedirs(download_path, exist_ok=True)
for fname in file_list:
if not os.path.exists(f"{download_path}/{fname}"):
r = requests.get(f"{remote_url}/{fname}", timeout=30)
with open(f"{download_path}/{fname}", "wb") as f:
f.write(r.content)
print("Downloaded demo data files into lakehouse.")
Next steps
You'll use the data you just ingested in:

Part 2: Explore and visualize data using notebooks

Feedback
Was this page helpful?  Yes  No

Provide product feedback | Ask the community


Tutorial Part 2: Explore and visualize
data using Microsoft Fabric notebooks
Article • 10/20/2023

In this tutorial, you'll learn how to conduct exploratory data analysis (EDA) to examine
and investigate the data while summarizing its key characteristics through the use of
data visualization techniques.

You'll use seaborn , a Python data visualization library that provides a high-level interface
for building visuals on dataframes and arrays. For more information about seaborn , see
Seaborn: Statistical Data Visualization .

You'll also use Data Wrangler, a notebook-based tool that provides you with an
immersive experience to conduct exploratory data analysis and cleaning.

) Important

Microsoft Fabric is in preview.

The main steps in this tutorial are:

1. Read the data stored from a delta table in the lakehouse.


2. Convert a Spark DataFrame to Pandas DataFrame, which python visualization
libraries support.
3. Use Data Wrangler to perform initial data cleaning and transformation.
4. Perform exploratory data analysis using seaborn .

Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview)
trial.

Sign in to Microsoft Fabric .

Switch to the Data Science experience by using the experience switcher icon on the
left side of your home page.

This is part 2 of 5 in the tutorial series. To complete this tutorial, first complete:

Part 1: Ingest data into a Microsoft Fabric lakehouse using Apache Spark.

Follow along in notebook


2-explore-cleanse-data.ipynb is the notebook that accompanies this tutorial.

If you want to open the accompanying notebook for this tutorial, follow the instructions
in Prepare your system for data science to import the tutorial notebooks to your
workspace.

Or, if you'd rather copy and paste the code from this page, you can create a new
notebook.

Be sure to attach a lakehouse to the notebook before you start running code.

) Important

Attach the same lakehouse you used in Part 1.

Read raw data from the lakehouse


Read raw data from the Files section of the lakehouse. You uploaded this data in the
previous notebook. Make sure you have attached the same lakehouse you used in Part 1
to this notebook before you run this code.

Python

df = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv("Files/churn/raw/churn.csv")
.cache()
)

Create a pandas DataFrame from the dataset


Convert the spark DataFrame to pandas DataFrame for easier processing and
visualization.

Python

df = df.toPandas()

Display raw data


Explore the raw data with display , do some basic statistics and show chart views. Note
that you first need to import the required libraries such as Numpy , Pnadas , Seaborn , and
Matplotlib for data analysis and visualization.

Python

import seaborn as sns


sns.set_theme(style="whitegrid", palette="tab10", rc = {'figure.figsize':
(9,6)})
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from matplotlib import rc, rcParams
import numpy as np
import pandas as pd
import itertools

Python

display(df, summary=True)

Use Data Wrangler to perform initial data


cleaning
To explore and transform any pandas Dataframes in your notebook, launch Data
Wrangler directly from the notebook.
7 Note

Data Wrangler can not be opened while the notebook kernel is busy. The cell
execution must complete prior to launching Data Wrangler.

1. Under the notebook ribbon Data tab, select Launch Data Wrangler. You'll see a list
of activated pandas DataFrames available for editing.
2. Select the DataFrame you wish to open in Data Wrangler. Since this notebook only
contains one DataFrame, pd , select pd .

Data Wrangler launches and generates a descriptive overview of your data. The table in
the middle shows each data column. The Summary panel next to the table shows
information about the DataFrame. When you select a column in the table, the summary
updates with information about the selected column. In some instances, the data
displayed and summarized will be a truncated view of your DataFrame. When this
happens, you'll see warning image in the summary pane. Hover over this warning to
view text explaining the situation.


Each operation you do can be applied in a matter of clicks, updating the data display in
real time and generating code that you can save back to your notebook as a reusable
function.

The rest of this section walks you through the steps to perform data cleaning with Data
Wrangler.

Drop duplicate rows


On the left panel is a list of operations (such as Find and replace, Format, Formulas,
Numeric) you can perform on the dataset.

1. Expand Find and replace and select Drop duplicate rows.

2. A panel appears for you to select the list of columns you want to compare to
define a duplicate row. Select RowNumber and CustomerId.

In the middle panel is a preview of the results of this operation. Under the preview
is the code to perform the operation. In this instance, the data appears to be
unchanged. But since you're looking at a truncated view, it's a good idea to still
apply the operation.

3. Select Apply (either at the side or at the bottom) to go to the next step.

Drop rows with missing data


Use Data Wrangler to drop rows with missing data across all columns.

1. Select Drop missing values from Find and replace.

2. Select All columns from the drop-down list.

3. Select Apply to go on to the next step.


Drop columns
Use Data Wrangler to drop columns that you don't need.

1. Expand Schema and select Drop columns.

2. Select RowNumber, CustomerId, Surname. These columns appear in red in the


preview, to show they're changed by the code (in this case, dropped.)

3. Select Apply to go on to the next step.

Add code to notebook


Each time you select Apply, a new step is created in the Cleaning steps panel on the
bottom left. At the bottom of the panel, select Preview code for all steps to view a
combination of all the separate steps.

Select Add code to notebook at the top left to close Data Wrangler and add the code
automatically. The Add code to notebook wraps the code in a function, then calls the
function.
 Tip

The code generated by Data Wrangler won't be applied until you manually run the
new cell.

If you didn't use Data Wrangler, you can instead use this next code cell.

This code is similar to the code produced by Data Wrangler, but adds in the argument
inplace=True to each of the generated steps. By setting inplace=True , pandas will

overwrite the original DataFrame instead of producing a new DataFrame as an output.

Python

# Modified version of code generated by Data Wrangler


# Modification is to add in-place=True to each step

# Define a new function that include all above Data Wrangler operations
def clean_data(df):
# Drop rows with missing data across all columns
df.dropna(inplace=True)
# Drop duplicate rows in columns: 'RowNumber', 'CustomerId'
df.drop_duplicates(subset=['RowNumber', 'CustomerId'], inplace=True)
# Drop columns: 'RowNumber', 'CustomerId', 'Surname'
df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)
return df

df_clean = clean_data(df.copy())
df_clean.head()

Explore the data


Display some summaries and visualizations of the cleaned data.

Determine categorical, numerical, and target attributes


Use this code to determine categorical, numerical, and target attributes.

Python

# Determine the dependent (target) attribute


dependent_variable_name = "Exited"
print(dependent_variable_name)
# Determine the categorical attributes
categorical_variables = [col for col in df_clean.columns if col in "O"
or df_clean[col].nunique() <=5
and col not in "Exited"]
print(categorical_variables)
# Determine the numerical attributes
numeric_variables = [col for col in df_clean.columns if df_clean[col].dtype
!= "object"
and df_clean[col].nunique() >5]
print(numeric_variables)

The five-number summary


Show the five-number summary (the minimum score, first quartile, median, third
quartile, the maximum score) for the numerical attributes, using box plots.

Python

df_num_cols = df_clean[numeric_variables]
sns.set(font_scale = 0.7)
fig, axes = plt.subplots(nrows = 2, ncols = 3, gridspec_kw =
dict(hspace=0.3), figsize = (17,8))
fig.tight_layout()
for ax,col in zip(axes.flatten(), df_num_cols.columns):
sns.boxplot(x = df_num_cols[col], color='green', ax = ax)
fig.delaxes(axes[1,2])
Distribution of exited and nonexited customers
Show the distribution of exited versus nonexited customers across the categorical
attributes.

Python

attr_list = ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember',


'NumOfProducts', 'Tenure']
fig, axarr = plt.subplots(2, 3, figsize=(15, 4))
for ind, item in enumerate (attr_list):
sns.countplot(x = item, hue = 'Exited', data = df_clean, ax =
axarr[ind%2][ind//2])
fig.subplots_adjust(hspace=0.7)

Distribution of numerical attributes


Show the frequency distribution of numerical attributes using histogram.

Python
columns = df_num_cols.columns[: len(df_num_cols.columns)]
fig = plt.figure()
fig.set_size_inches(18, 8)
length = len(columns)
for i,j in itertools.zip_longest(columns, range(length)):
plt.subplot((length // 2), 3, j+1)
plt.subplots_adjust(wspace = 0.2, hspace = 0.5)
df_num_cols[i].hist(bins = 20, edgecolor = 'black')
plt.title(i)
plt.show()

Perform feature engineering


Perform feature engineering to generate new attributes based on current attributes:

Python

df_clean["NewTenure"] = df_clean["Tenure"]/df_clean["Age"]
df_clean["NewCreditsScore"] = pd.qcut(df_clean['CreditScore'], 6, labels =
[1, 2, 3, 4, 5, 6])
df_clean["NewAgeScore"] = pd.qcut(df_clean['Age'], 8, labels = [1, 2, 3, 4,
5, 6, 7, 8])
df_clean["NewBalanceScore"] =
pd.qcut(df_clean['Balance'].rank(method="first"), 5, labels = [1, 2, 3, 4,
5])
df_clean["NewEstSalaryScore"] = pd.qcut(df_clean['EstimatedSalary'], 10,
labels = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Use Data Wrangler to perform one-hot


encoding
Data Wrangler can also be used to perform one-hot encoding. To do so, re-open Data
Wrangler. This time, select the df_clean data.

1. Expand Formulas and select One-hot encode.


2. A panel appears for you to select the list of columns you want to perform one-hot
encoding on. Select Geography and Gender.

You could copy the generated code, close Data Wrangler to return to the notebook,
then paste into a new cell. Or, select Add code to notebook at the top left to close Data
Wrangler and add the code automatically.

If you didn't use Data Wrangler, you can instead use this next code cell:

Python

# This is the same code that Data Wrangler will generate

import pandas as pd

def clean_data(df_clean):
# One-hot encode columns: 'Geography', 'Gender'
df_clean = pd.get_dummies(df_clean, columns=['Geography', 'Gender'])
return df_clean

df_clean_1 = clean_data(df_clean.copy())
df_clean_1.head()

Summary of observations from the exploratory


data analysis
Most of the customers are from France comparing to Spain and Germany, while
Spain has the lowest churn rate comparing to France and Germany.
Most of the customers have credit cards.
There are customers whose age and credit score are above 60 and below 400,
respectively, but they can't be considered as outliers.
Very few customers have more than two of the bank's products.
Customers who aren't active have a higher churn rate.
Gender and tenure years don't seem to have an impact on customer's decision to
close the bank account.

Create a delta table for the cleaned data


You'll use this data in the next notebook of this series.
Python

table_name = "df_clean"
# Create Spark DataFrame from pandas
sparkDF=spark.createDataFrame(df_clean_1)
sparkDF.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Spark dataframe saved to delta table: {table_name}")

Next step
Train and register machine learning models with this data:

Part 3: Train and register machine learning models .

Feedback
Was this page helpful?  Yes  No

Provide product feedback | Ask the community


Tutorial Part 3: Train and register a
machine learning model
Article • 10/20/2023

In this tutorial, you'll learn to train multiple machine learning models to select the best
one in order to predict which bank customers are likely to leave.

) Important

Microsoft Fabric is in preview.

In this tutorial, you'll:

" Train Random Forrest and LightGBM models.


" Use Microsoft Fabric's native integration with the MLflow framework to log the
trained machine learning models, the used hyperaparameters, and evaluation
metrics.
" Register the trained machine learning model.
" Assess the performances of the trained machine learning models on the validation
dataset.

MLflow is an open source platform for managing the machine learning lifecycle with
features like Tracking, Models, and Model Registry. MLflow is natively integrated with
the Fabric Data Science experience.

Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview)
trial.

Sign in to Microsoft Fabric .

Switch to the Data Science experience by using the experience switcher icon on the
left side of your home page.

This is part 3 of 5 in the tutorial series. To complete this tutorial, first complete:

Part 1: Ingest data into a Microsoft Fabric lakehouse using Apache Spark.
Part 2: Explore and visualize data using Microsoft Fabric notebooks to learn more
about the data.

Follow along in notebook


3-train-evaluate.ipynb is the notebook that accompanies this tutorial.

If you want to open the accompanying notebook for this tutorial, follow the instructions
in Prepare your system for data science to import the tutorial notebooks to your
workspace.

Or, if you'd rather copy and paste the code from this page, you can create a new
notebook.

Be sure to attach a lakehouse to the notebook before you start running code.

) Important

Attach the same lakehouse you used in part 1 and part 2.

Install custom libraries


For this notebook, you'll install imbalanced-learn (imported as imblearn ) using %pip
install . Imbalanced-learn is a library for Synthetic Minority Oversampling Technique

(SMOTE) which is used when dealing with imbalanced datasets. The PySpark kernel will
be restarted after %pip install , so you'll need to install the library before you run any
other cells.
You'll access SMOTE using the imblearn library. Install it now using the in-line
installation capabilities (e.g., %pip , %conda ).

Python

# Install imblearn for SMOTE using pip


%pip install imblearn

 Tip

When you install a library in a notebook, it is only available for the duration of the
notebook session and not in the workspace. If you restart the notebook, you'll need
to install the library again. If you have a library you often use, you could instead
install it in your workspace to make it available to all notebooks in your workspace
without further installs.

Load the data


Prior to training any machine learning model, you need to load the delta table from the
lakehouse in order to read the cleaned data you created in the previous notebook.

Python

import pandas as pd
SEED = 12345
df_clean = spark.read.format("delta").load("Tables/df_clean").toPandas()

Generate experiment for tracking and logging the model


using MLflow
This section demonstrates how to generate an experiment, specify the machine learning
model and training parameters as well as scoring metrics, train the machine learning
models, log them, and save the trained models for later use.

Python

import mlflow
# Setup experiment name
EXPERIMENT_NAME = "bank-churn-experiment" # MLflow experiment name
Extending the MLflow autologging capabilities, autologging works by automatically
capturing the values of input parameters and output metrics of a machine learning
model as it is being trained. This information is then logged to your workspace, where it
can be accessed and visualized using the MLflow APIs or the corresponding experiment
in your workspace.

All the experiments with their respective names are logged and you'll be able to track
their parameters and performance metrics. To learn more about autologging, see
Autologging in Microsoft Fabric .

Set experiment and autologging specifications


Python

mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.autolog(exclusive=False)

Import scikit-learn and LightGBM


With your data in place, you can now define the machine learning models. You'll apply
Random Forrest and LightGBM models in this notebook. Use scikit-learn and
lightgbm to implement the models within a few lines of code.

Python

# Import the required libraries for model training


from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score,
confusion_matrix, recall_score, roc_auc_score, classification_report

Prepare training, validation and test datasets


Use the train_test_split function from scikit-learn to split the data into training,
validation, and test sets.

Python

y = df_clean["Exited"]
X = df_clean.drop("Exited",axis=1)
# Split the dataset to 60%, 20%, 20% for training, validation, and test
datasets
# Train-Test Separation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
random_state=SEED)
# Train-Validation Separation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
test_size=0.25, random_state=SEED)

Save test data to a delta table


Save the test data to the delta table for use in the next notebook.

Python

table_name = "df_test"
# Create PySpark DataFrame from Pandas
df_test=spark.createDataFrame(X_test)
df_test.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Spark test DataFrame saved to delta table: {table_name}")

Apply SMOTE to the training data to synthesize new


samples for the minority class
The data exploration in part 2 showed that out of the 10,000 data points corresponding
to 10,000 customers, only 2,037 customers (around 20%) have left the bank. This
indicates that the dataset is highly imbalanced. The problem with imbalanced
classification is that there are too few examples of the minority class for a model to
effectively learn the decision boundary. SMOTE is the most widely used approach to
synthesize new samples for the minority class. Learn more about SMOTE here and
here .

 Tip

Note that SMOTE should only be applied to the training dataset. You must leave
the test dataset in its original imbalanced distribution in order to get a valid
approximation of how the machine learning model will perform on the original
data, which is representing the situation in production.

Python

from collections import Counter


from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=SEED)
X_res, y_res = sm.fit_resample(X_train, y_train)
new_train = pd.concat([X_res, y_res], axis=1)

Model training
Train the model using Random Forest with maximum depth of 4 and 4 features

Python

mlflow.sklearn.autolog(registered_model_name='rfc1_sm') # Register the


trained model with autologging
rfc1_sm = RandomForestClassifier(max_depth=4, max_features=4,
min_samples_split=3, random_state=1) # Pass hyperparameters
with mlflow.start_run(run_name="rfc1_sm") as run:
rfc1_sm_run_id = run.info.run_id # Capture run_id for model prediction
later
print("run_id: {}; status: {}".format(rfc1_sm_run_id, run.info.status))
# rfc1.fit(X_train,y_train) # Imbalanaced training data
rfc1_sm.fit(X_res, y_res.ravel()) # Balanced training data
rfc1_sm.score(X_val, y_val)
y_pred = rfc1_sm.predict(X_val)
cr_rfc1_sm = classification_report(y_val, y_pred)
cm_rfc1_sm = confusion_matrix(y_val, y_pred)
roc_auc_rfc1_sm = roc_auc_score(y_res, rfc1_sm.predict_proba(X_res)[:,
1])

Train the model using Random Forest with maximum depth of 8 and 6 features

Python

mlflow.sklearn.autolog(registered_model_name='rfc2_sm') # Register the


trained model with autologging
rfc2_sm = RandomForestClassifier(max_depth=8, max_features=6,
min_samples_split=3, random_state=1) # Pass hyperparameters
with mlflow.start_run(run_name="rfc2_sm") as run:
rfc2_sm_run_id = run.info.run_id # Capture run_id for model prediction
later
print("run_id: {}; status: {}".format(rfc2_sm_run_id, run.info.status))
# rfc2.fit(X_train,y_train) # Imbalanced training data
rfc2_sm.fit(X_res, y_res.ravel()) # Balanced training data
rfc2_sm.score(X_val, y_val)
y_pred = rfc2_sm.predict(X_val)
cr_rfc2_sm = classification_report(y_val, y_pred)
cm_rfc2_sm = confusion_matrix(y_val, y_pred)
roc_auc_rfc2_sm = roc_auc_score(y_res, rfc2_sm.predict_proba(X_res)[:,
1])

Train the model using LightGBM


Python

# lgbm_model
mlflow.lightgbm.autolog(registered_model_name='lgbm_sm') # Register the
trained model with autologging
lgbm_sm_model = LGBMClassifier(learning_rate = 0.07,
max_delta_step = 2,
n_estimators = 100,
max_depth = 10,
eval_metric = "logloss",
objective='binary',
random_state=42)

with mlflow.start_run(run_name="lgbm_sm") as run:


lgbm1_sm_run_id = run.info.run_id # Capture run_id for model prediction
later
# lgbm_sm_model.fit(X_train,y_train) # Imbalanced training data
lgbm_sm_model.fit(X_res, y_res.ravel()) # Balanced training data
y_pred = lgbm_sm_model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
cr_lgbm_sm = classification_report(y_val, y_pred)
cm_lgbm_sm = confusion_matrix(y_val, y_pred)
roc_auc_lgbm_sm = roc_auc_score(y_res,
lgbm_sm_model.predict_proba(X_res)[:, 1])

Experiments artifact for tracking model


performance
The experiment runs are automatically saved in the experiment artifact that can be
found from the workspace. They're named based on the name used for setting the
experiment. All of the trained machine learning models, their runs, performance metrics,
and model parameters are logged.

To view your experiments:

1. On the left panel, select your workspace.


2. Find and select the experiment name, in this case bank-churn-experiment. If you
don't see the experiment in your workspace, refresh your browser.

Assess the performances of the trained models


on the validation dataset
Once done with machine learning model training, you can assess the performance of
trained models in two ways.

Open the saved experiment from the workspace, load the machine learning
models, and then assess the performance of the loaded models on the validation
dataset.

Python

# Define run_uri to fetch the model


# mlflow client: mlflow.model.url, list model
load_model_rfc1_sm =
mlflow.sklearn.load_model(f"runs:/{rfc1_sm_run_id}/model")
load_model_rfc2_sm =
mlflow.sklearn.load_model(f"runs:/{rfc2_sm_run_id}/model")
load_model_lgbm1_sm =
mlflow.lightgbm.load_model(f"runs:/{lgbm1_sm_run_id}/model")
# Assess the performance of the loaded model on validation dataset
ypred_rfc1_sm_v1 = load_model_rfc1_sm.predict(X_val) # Random Forest
with max depth of 4 and 4 features
ypred_rfc2_sm_v1 = load_model_rfc2_sm.predict(X_val) # Random Forest
with max depth of 8 and 6 features
ypred_lgbm1_sm_v1 = load_model_lgbm1_sm.predict(X_val) # LightGBM
Directly assess the performance of the trained machine learning models on the
validation dataset.

Python

ypred_rfc1_sm_v2 = rfc1_sm.predict(X_val) # Random Forest with max


depth of 4 and 4 features
ypred_rfc2_sm_v2 = rfc2_sm.predict(X_val) # Random Forest with max
depth of 8 and 6 features
ypred_lgbm1_sm_v2 = lgbm_sm_model.predict(X_val) # LightGBM

Depending on your preference, either approach is fine and should offer identical
performances. In this notebook, you'll choose the first approach in order to better
demonstrate the MLflow autologging capabilities in Microsoft Fabric.

Show True/False Positives/Negatives using the Confusion


Matrix
Next, you'll develop a script to plot the confusion matrix in order to evaluate the
accuracy of the classification using the validation dataset. The confusion matrix can be
plotted using SynapseML tools as well, which is shown in Fraud Detection sample that is
available here .

Python

import seaborn as sns


sns.set_theme(style="whitegrid", palette="tab10", rc = {'figure.figsize':
(9,6)})
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from matplotlib import rc, rcParams
import numpy as np
import itertools

def plot_confusion_matrix(cm, classes,


normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
print(cm)
plt.figure(figsize=(4,4))
plt.rcParams.update({'font.size': 10})
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45, color="blue")
plt.yticks(tick_marks, classes, color="blue")
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="red" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')

Confusion Matrix for Random Forest Classifier with maximum depth of 4 and 4
features

Python

cfm = confusion_matrix(y_val, y_pred=ypred_rfc1_sm_v1)


plot_confusion_matrix(cfm, classes=['Non Churn','Churn'],
title='Random Forest with max depth of 4')
tn, fp, fn, tp = cfm.ravel()

Confusion Matrix for Random Forest Classifier with maximum depth of 8 and 6
features

Python

cfm = confusion_matrix(y_val, y_pred=ypred_rfc2_sm_v1)


plot_confusion_matrix(cfm, classes=['Non Churn','Churn'],
title='Random Forest with max depth of 8')
tn, fp, fn, tp = cfm.ravel()

Confusion Matrix for LightGBM

Python

cfm = confusion_matrix(y_val, y_pred=ypred_lgbm1_sm_v1)


plot_confusion_matrix(cfm, classes=['Non Churn','Churn'],
title='LightGBM')
tn, fp, fn, tp = cfm.ravel()
Next step
Part 4: Perform batch scoring and save predictions to a lakehouse

Feedback
Was this page helpful?  Yes  No

Provide product feedback | Ask the community


Tutorial Part 4: Perform batch scoring
and save predictions to a lakehouse
Article • 10/20/2023

In this tutorial, you'll learn to import the registered LightGBMClassifier model that was
trained in part 3 using the Microsoft Fabric MLflow model registry, and perform batch
predictions on a test dataset loaded from a lakehouse.

) Important

Microsoft Fabric is in preview.

Microsoft Fabric allows you to operationalize machine learning models with a scalable
function called PREDICT, which supports batch scoring in any compute engine. You can
generate batch predictions directly from a Microsoft Fabric notebook or from a given
model's item page. Learn about PREDICT .

To generate batch predictions on the test dataset, you'll use version 1 of the trained
LightGBM model that demonstrated the best performance among all trained machine
learning models. You'll load the test dataset into a spark DataFrame and create an
MLFlowTransformer object to generate batch predictions. You can then invoke the
PREDICT function using one of following three ways:

" Transformer API from SynapseML


" Spark SQL API
" PySpark user-defined function (UDF)

Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview)
trial.

Sign in to Microsoft Fabric .

Switch to the Data Science experience by using the experience switcher icon on the
left side of your home page.

This part 4 of 5 in the tutorial series. To complete this tutorial, first complete:

Part 1: Ingest data into a Microsoft Fabric lakehouse using Apache Spark.
Part 2: Explore and visualize data using Microsoft Fabric notebooks to learn more
about the data.
Part 3: Train and register machine learning models.

Follow along in notebook


4-predict.ipynb is the notebook that accompanies this tutorial.

If you want to open the accompanying notebook for this tutorial, follow the instructions
in Prepare your system for data science to import the tutorial notebooks to your
workspace.

Or, if you'd rather copy and paste the code from this page, you can create a new
notebook.

Be sure to attach a lakehouse to the notebook before you start running code.

) Important

Attach the same lakehouse you used in the other parts of this series.

Load the test data


Load the test data that you saved in Part 3.

Python
df_test = spark.read.format("delta").load("Tables/df_test")
display(df_test)

PREDICT with the Transformer API


To use the Transformer API from SynapseML, you'll need to first create an
MLFlowTransformer object.

Instantiate MLFlowTransformer object


The MLFlowTransformer object is a wrapper around the MLFlow model that you
registered in Part 3. It allows you to generate batch predictions on a given DataFrame.
To instantiate the MLFlowTransformer object, you'll need to provide the following
parameters:

The columns from the test DataFrame that you need as input to the model (in this
case, you would need all of them).
A name for the new output column (in this case, predictions).
The correct model name and model version to generate the predictions (in this
case, lgbm_sm and version 1).

Python

from synapse.ml.predict import MLFlowTransformer

model = MLFlowTransformer(
inputCols=list(df_test.columns),
outputCol='predictions',
modelName='lgbm_sm',
modelVersion=1
)

Now that you have the MLFlowTransformer object, you can use it to generate batch
predictions.

Python

import pandas

predictions = model.transform(df_test)
display(predictions)
PREDICT with the Spark SQL API
The following code invokes the PREDICT function with the Spark SQL API.

Python

from pyspark.ml.feature import SQLTransformer

# Substitute "model_name", "model_version", and "features" below with values


for your own model name, model version, and feature columns
model_name = 'lgbm_sm'
model_version = 1
features = df_test.columns

sqlt = SQLTransformer().setStatement(
f"SELECT PREDICT('{model_name}/{model_version}', {','.join(features)})
as predictions FROM __THIS__")

# Substitute "X_test" below with your own test dataset


display(sqlt.transform(df_test))

PREDICT with a user-defined function (UDF)


The following code invokes the PREDICT function with a PySpark UDF.

Python

from pyspark.sql.functions import col, pandas_udf, udf, lit

# Substitute "model" and "features" below with values for your own model
name and feature columns
my_udf = model.to_udf()
features = df_test.columns

display(df_test.withColumn("predictions", my_udf(*[col(f) for f in


features])))

Note that you can also generate PREDICT code from a model's item page. Learn about
PREDICT .

Write model prediction results to the lakehouse


Once you have generated batch predictions, write the model prediction results back to
the lakehouse.

Python
# Save predictions to lakehouse to be used for generating a Power BI report
table_name = "customer_churn_test_predictions"
predictions.write.format('delta').mode("overwrite").save(f"Tables/{table_nam
e}")
print(f"Spark DataFrame saved to delta table: {table_name}")

Next step
Continue on to:

Part 5: Create a Power BI report to visualize predictions

Feedback
Was this page helpful?  Yes  No

Provide product feedback | Ask the community


Tutorial Part 5: Visualize predictions
with a Power BI report
Article • 10/20/2023

In this tutorial, you'll create a Power BI report from the predictions data that was
generated in Part 4: Perform batch scoring and save predictions to a lakehouse.

) Important

Microsoft Fabric is in preview.

You'll learn how to:

" Create a dataset for Power BI from the predictions data.


" Add new measures to the data from Power BI.
" Create a Power BI report.
" Add visualizations to the report.

Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview)
trial.

Sign in to Microsoft Fabric .

Switch to the Data Science experience by using the experience switcher icon on the
left side of your home page.

This is part 5 of 5 in the tutorial series. To complete this tutorial, first complete:
Part 1: Ingest data into a Microsoft Fabric lakehouse using Apache Spark.
Part 2: Explore and visualize data using Microsoft Fabric notebooks to learn more
about the data.
Part 3: Train and register machine learning models.
Part 4: Perform batch scoring and save predictions to a lakehouse.

Create a Power BI dataset


Create a new Power BI dataset linked to the predictions data you produced in part 4:

1. On the left, select your workspace.

2. On the top left, select Lakehouse as a filter.

3. Select the lakehouse that you used in the previous parts of the tutorial series.

4. Select New Power BI dataset on the top ribbon.

5. Give the dataset a name, such as "bank churn predictions." Then select the
customer_churn_test_predictions dataset.
6. Select Confirm.

Add new measures


Now add a few measures to the dataset:

1. Add a new measure for the churn rate.

a. Select New measure in the top ribbon. This action adds a new item named
Measure to the customer_churn_test_predictions dataset, and opens a formula
bar above the table.
b. To determine the average predicted churn rate, replace Measure = in the
formula bar with:

Python

Churn Rate = AVERAGE(customer_churn_test_predictions[predictions])

c. To apply the formula, select the check mark in the formula bar. The new
measure appears in the data table. The calculator icon shows it was created as a
measure.

d. Change the format from General to Percentage in the Properties panel.

e. Scroll down in the Properties panel to change the Decimal places to 1.


2. Add a new measure that counts the total number of bank customers. You'll need it
for the rest of the new measures.

a. Select New measure in the top ribbon to add a new item named Measure to
the customer_churn_test_predictions dataset. This action also opens a formula
bar above the table.

b. Each prediction represents one customer. To determine the total number of


customers, replace Measure = in the formula bar with:

Python

Customers = COUNT(customer_churn_test_predictions[predictions])

c. Select the check mark in the formula bar to apply the formula.

3. Add the churn rate for Germany.

a. Select New measure in the top ribbon to add a new item named Measure to
the customer_churn_test_predictions dataset. This action also opens a formula
bar above the table.

b. To determine the churn rate for Germany, replace Measure = in the formula bar
with:

Python

Germany Churn = CALCULATE(customer_churn_test_predictions[Churn


Rate], customer_churn_test_predictions[Geography_Germany] = 1)

This filters the rows down to the ones with Germany as their geography
(Geography_Germany equals one).

c. To apply the formula, select the check mark in the formula bar.

4. Repeat the above step to add the churn rates for France and Spain.

Spain's churn rate:

Python

Spain Churn = CALCULATE(customer_churn_test_predictions[Churn


Rate], customer_churn_test_predictions[Geography_Spain] = 1)

France's churn rate:

Python

France Churn = CALCULATE(customer_churn_test_predictions[Churn


Rate], customer_churn_test_predictions[Geography_France] = 1)

Create new report


Once you're done with all operations, move on to the Power BI report authoring page
by selecting Create report on the top ribbon.

Once the report page appears, add these visuals:


1. Select the text box on the top ribbon and enter a title for the report, such as "Bank
Customer Churn". Change the font size and background color in the Format panel.
Adjust the font size and color by selecting the text and using the format bar.

2. In the Visualizations panel, select the Card icon. From the Data pane, select Churn
Rate. Change the font size and background color in the Format panel. Drag this
visualization to the top right of the report.

3. In the Visualizations panel, select the Line and stacked column chart icon. Select
age for the x-axis, Churn Rate for column y-axis, and Customers for the line y-axis.

4. In the Visualizations panel, select the Line and stacked column chart icon. Select
NumOfProducts for x-axis, Churn Rate for column y-axis, and Customers for the
line y-axis.

5. In the Visualizations panel, select the Stacked column chart icon. Select
NewCreditsScore for x-axis and Churn Rate for y-axis.

Change the title "NewCreditsScore" to "Credit Score" in the Format panel.


6. In the Visualizations panel, select the Clustered column chart card. Select
Germany Churn, Spain Churn, France Churn in that order for the y-axis.

7 Note

This report represents an illustrated example of how you might analyze the saved
prediction results in Power BI. However, for a real customer churn use-case, the you
may have to do more thorough ideation of what visualizations to create, based on
syour subject matter expertise, and what your firm and business analytics team has
standardized as metrics.

The Power BI report shows:

Customers who use more than two of the bank products have a higher churn rate
although few customers had more than two products. The bank should collect
more data, but also investigate other features correlated with more products (see
the plot in the bottom left panel).
Bank customers in Germany have a higher churn rate than in France and Spain (see
the plot in the bottom right panel), which suggests that an investigation into what
has encouraged customers to leave could be beneficial.
There are more middle aged customers (between 25-45) and customers between
45-60 tend to exit more.
Finally, customers with lower credit scores would most likely leave the bank for
other financial institutes. The bank should look into ways that encourage
customers with lower credit scores and account balances to stay with the bank.

Next step
This completes the five part tutorial series. See other end-to-end sample tutorials:

How to use end-to-end AI samples in Microsoft Fabric

Feedback
Was this page helpful?  Yes  No

Provide product feedback | Ask the community


How to use Microsoft Fabric notebooks
Article • 09/27/2023

Microsoft Fabric notebook is a primary code item for developing Apache Spark jobs and
machine learning experiments, it's a web-based interactive surface used by data
scientists and data engineers to write code benefiting from rich visualizations and
Markdown text. Data engineers write code for data ingestion, data preparation, and data
transformation. Data scientists also use notebooks to build machine learning solutions,
including creating experiments and models, model tracking, and deployment.

) Important

Microsoft Fabric is in preview.

With a Microsoft Fabric notebook, you can:

Get started with zero setup effort.


Easily explore and process data with intuitive low-code experience.
Keep data secure with built-in enterprise security features.
Analyze data across raw formats (CSV, txt, JSON, etc.), processed file formats
(parquet, Delta Lake, etc.), leveraging powerful Spark capabilities.
Be productive with enhanced authoring capabilities and built-in data visualization.

This article describes how to use notebooks in data science and data engineering
experiences.

Create notebooks
You can either create a new notebook or import an existing notebook.

Create a new notebook


Similar with other standard Microsoft Fabric item creation, you can easily create a new
notebook from the Microsoft Fabric Data Engineering homepage, the workspace New
button, or the Create Hub.

Import existing notebooks


You can import one or more existing notebooks from your local computer to a Microsoft
Fabric workspace from the Data Engineering or the Data Science homepage. Microsoft
Fabric notebooks can recognize the standard Jupyter Notebook .ipynb files, and source
files like .py, .scala, and .sql, and create new notebook items accordingly.

Export a notebook
You can Export your notebook to other standard formats. Synapse notebook supports to
be exported into:

Standard Notebook file(.ipynb) that is usually used for Jupyter notebooks.


HTML file(.html) that can be opened from browser directly.
Python file(.py).
Latex file(.tex).

Save a notebook
In Microsoft Fabric, a notebook will by default save automatically after you open and
edit it; you don't need to worry about losing code changes. You can also use Save a
copy to clone another copy in the current workspace or to another workspace.
If you prefer to save a notebook manually, you can also switch to "Manual save" mode
to have a "local branch" of your notebook item, and use Save or CTRL+s to save your
changes.

You can also switch to manual save mode from Edit->Save options->Manual. To turn
on a local branch of your notebook then save it manually by clicking Save button or
through "Ctrl" + "s" keybinding.

Connect lakehouses and notebooks


Microsoft Fabric notebook now supports interacting with lakehouses closely; you can
easily add a new or existing lakehouse from the lakehouse explorer.

You can navigate to different lakehouses in the lakehouse explorer and set one
lakehouse as the default by pinning it. It will then be mounted to the runtime working
directory and you can read or write to the default lakehouse using a local path.
7 Note

You need to restart the session after pinning a new lakehouse or renaming the
default lakehouse.

Add or remove a lakehouse


Selecting the X icon beside a lakehouse name removes it from the notebook tab, but
the lakehouse item still exists in the workspace.

Select Add lakehouse to add more lakehouses to the notebook, either by adding an
existing one or creating a new lakehouse.

Explore a lakehouse file


The subfolder and files under the Tables and Files section of the Lake view appear in a
content area between the lakehouse list and the notebook content. Select different
folders in the Tables and Files section to refresh the content area.

Folder and File operations


If you select a file(.csv, .parquet, .txt, .jpg, .png, etc) with a right mouse click, both Spark
and Pandas API are supported to load the data. A new code cell is generated and
inserted to below of the focus cell.

You can easily copy path with different format of the select file or folder and use the
corresponding path in your code.
Notebook resources
The notebook resource explorer provides a Unix-like file system to help you manage
your folders and files. It offers a writeable file system space where you can store small-
sized files, such as code modules, datasets, and images. You can easily access them with
code in the notebook as if you were working with your local file system.
The Built-in folder is a system pre-defined folder for each notebook instance, it
preserves up to 500MB storage to store the dependencies of the current notebook,
below are the key capabilities of Notebook resources:

You can use common operations such as create/delete, upload/download,


drag/drop, rename, duplicate, and search through the UI.
You can use relative paths like builtin/YourData.txt for quick exploration. The
mssparkutils.nbResPath method helps you compose the full path.

You can easily move your validated data to a Lakehouse via the Write to
Lakehouse option. We have embedded rich code snippets for common file types
to help you quickly get started.
These resources are also available for use in the Reference Notebook run case via
mssparkutils.notebook.run() .

7 Note

Currently we support uploading certain file types through UI which includes,


.py, .txt, .json, .yml, .xml, .csv, .html, .png, .jpg, xlsx files. You can write to the
built-in folder with file types that are not in the list via code, However Fabric
notebook doesn’t support generating code snippet when operated on
unsupported file types.
Each file size needs to be less than 50MB, and the Built-in folder allows up to
100 file/folder instances in total.
When using mssparkutils.notebook.run() , we recommend using the
mssparkutils.nbResPath command to access to the target notebook resource.

The relative path “builtin/” will always point to the root notebook’s built-in
folder.

Collaborate in a notebook
The Microsoft Fabric notebook is a collaborative item that supports multiple users
editing the same notebook.

When you open a notebook, you enter the co-editing mode by default, and every
notebook edit will be auto-saved. If your colleagues open the same notebook at the
same time, you see their profile, run output, cursor indicator, selection indicator and
editing trace. By leveraging the collaborating features, you can easily accomplish pair
programming, remote debugging, and tutoring scenarios.

Share a notebook
Sharing a notebook is a convenient way for you to collaborate with team members.
Authorized workspace roles can view or edit/run notebooks by default. You can share a
notebook with specified permissions granted.

1. Select the Share button on the notebook toolbar.

2. Select the corresponding category of people who can view this notebook. You can
check Share, Edit, or Run to grant the permissions to the recipients.

3. After you "Apply" the selection, you can either send the notebook directly or copy
the link to others, and then the recipients can open the notebook with the
corresponding view granted by the permission.
4. To further manage your notebook permissions, you can find the "Manage
permissions" entry in the Workspace item list > More options to update the
existing notebook access and permission.
Comment a code cell
Commenting is another useful feature during collaborative scenarios. Currently, we
support adding cell-level comments.

1. Select the Comments button on the notebook toolbar or cell comment indicator
to open the Comments pane.

2. Select code in the code cell, select New in the Comments pane, add comments,
and then select the post comment button to save.
3. You could perform Edit comment, Resolve thread, or Delete thread by selecting
the More button besides your comment.

Switch Notebook mode


Fabric notebook support two modes for different scenarios, you can easily switch
between Editing mode and Viewing mode.

Editing mode: You can edit and run the cells and collaborate with others on the
notebook.
Viewing mode: You can only view the cell content, output, and comments of the
notebook, all the operations that can lead to change the notebook will be
disabled.

Next steps
Author and execute notebooks

Feedback
Was this page helpful?  Yes  No

Provide product feedback | Ask the community


Develop, execute, and manage Microsoft
Fabric notebooks
Article • 10/09/2023

Microsoft Fabric notebook is a primary code item for developing Apache Spark jobs and
machine learning experiments. It's a web-based interactive surface used by data scientists
and data engineers to write code benefiting from rich visualizations and Markdown text. This
article explains how to develop notebooks with code cell operations and run them.

) Important

Microsoft Fabric is in preview.

Develop notebooks
Notebooks consist of cells, which are individual blocks of code or text that can be run
independently or as a group.

We provide rich operations to develop notebooks:

Add a cell
Set a primary language
Use multiple languages
IDE-style IntelliSense
Code snippets
Drag and drop to insert snippets
Drag and drop to insert images
Format text cell with toolbar buttons
Undo or redo cell operation
Move a cell
Delete a cell
Collapse a cell input
Collapse a cell output
Lock or freeze a cell
Notebook contents
Markdown folding
Find and replace

Add a cell
There are multiple ways to add a new cell to your notebook.

1. Hover over the space between two cells and select Code or Markdown.

2. Use Shortcut keys under command mode. Press A to insert a cell above the current cell.
Press B to insert a cell below the current cell.

Set a primary language

Microsoft Fabric notebooks currently support four Apache Spark languages:

PySpark (Python)
Spark (Scala)
Spark SQL
SparkR

You can set the primary language for new added cells from the dropdown list in the top
command bar.

Use multiple languages


You can use multiple languages in a notebook by specifying the language magic command
at the beginning of a cell, you can also switch the cell language from the language picker.
The following table lists the magic commands to switch cell languages.

Magic command Language Description

%%pyspark Python Execute a Python query against Spark Context.

%%spark Scala Execute a Scala query against Spark Context.

%%sql SparkSQL Execute a SparkSQL query against Spark Context.

%%html Html Execute a HTML query against Spark Context.

%%sparkr R Execute a R query against Spark Context.

IDE-style IntelliSense
Microsoft Fabric notebooks are integrated with the Monaco editor to bring IDE-style
IntelliSense to the cell editor. Syntax highlight, error marker, and automatic code
completions help you to write code and identify issues quicker.

The IntelliSense features are at different levels of maturity for different languages. The
following table shows what's supported:

Languages Syntax Syntax Syntax Variable System User Smart Code


Highlight Error Code Code Function Function Indent Folding
Marker Completion Completion Code Code
Completion Completion

PySpark Yes Yes Yes Yes Yes Yes Yes Yes


(Python)

Spark Yes Yes Yes Yes Yes Yes Yes Yes


(Scala)

SparkSQL Yes Yes Yes Yes Yes No Yes Yes

SparkR Yes Yes Yes Yes Yes Yes Yes Yes

7 Note

An active Spark session is required to make use of the IntelliSense code completion.

Code snippets
Microsoft Fabric notebooks provide code snippets that help you write commonly used code
patterns easily, like:

Reading data as a Spark DataFrame, or


Drawing charts with Matplotlib.

Snippets appear in Shortcut keys of IDE style IntelliSense mixed with other suggestions. The
code snippets contents align with the code cell language. You can see available snippets by
typing Snippet or any keywords appear in the snippet title in the code cell editor. For
example, by typing read you can see the list of snippets to read data from various data
sources.
Drag and drop to insert snippets
You can use drag and drop to read data from Lakehouse explorer conveniently. Multiple file
types are supported here, you can operate on text files, tables, images, etc. You can either
drop to an existing cell or to a new cell. The notebook generates the code snippet
accordingly to preview the data.

Drag and drop to insert images


You can use drag and drop to insert images from your browser or local computer to a
markdown cell conveniently.
Format text cell with toolbar buttons
You can use the format buttons in the text cells toolbar to do common markdown actions.

Undo or redo cell operation


Select the Undo or Redo button, or press Z or Shift+Z to revoke the most recent cell
operations. You can undo or redo up to the latest 10 historical cell operations.

Supported undo cell operations:

Insert or delete cell: You could revoke the delete operations by selecting Undo, the text
content is kept along with the cell.
Reorder cell.
Toggle parameter.
Convert between code cell and Markdown cell.

7 Note

In-cell text operations and code cell commenting operations can't be undone. You can
undo or redo up to the latest 10 historical cell operations.

Move a cell
You can drag from the empty part of a cell and drop it to the desired position.

You can also move the selected cell using Move up and Move down on the ribbon.

Delete a cell
To delete a cell, select the delete button at the right hand of the cell.
You can also use shortcut keys under command mode. Press Shift+D to delete the current
cell.

Collapse a cell input


Select the More commands ellipses (...) on the cell toolbar and Hide input to collapse
current cell's input. To expand it, Select the Show input while the cell is collapsed.

Collapse a cell output


Select the More commands ellipses (...) on the cell toolbar and Hide output to collapse
current cell's output. To expand it, select the Show output while the cell's output is hidden.

Lock or freeze a cell


Lock and freeze cell operations allow you to make cells read-only or stop code cells from
being run on an individual cell basis.

Notebook contents
The Outlines or Table of Contents presents the first markdown header of any markdown cell
in a sidebar window for quick navigation. The Outlines sidebar is resizable and collapsible to
fit the screen in the best ways possible. You can select the Contents button on the notebook
command bar to open or hide the sidebar.

Markdown folding
The markdown folding allows you to hide cells under a markdown cell that contains a
heading. The markdown cell and its hidden cells are treated the same as a set of contiguous
multi-selected cells when performing cell operations.

Find and replace


Find and replace can help you easily match and locate the keywords or expression within
your notebook content, and you can replace the target string with a new string.

Run notebooks
You can run the code cells in your notebook individually or all at once. The status and
progress of each cell is represented in the notebook.

Run a cell
There are several ways to run the code in a cell.
1. Hover on the cell you want to run and select the Run Cell button or press Ctrl+Enter.

2. Use Shortcut keys under command mode. Press Shift+Enter to run the current cell and
select the next cell. Press Alt+Enter to run the current cell and insert a new cell.

Run all cells


Select the Run All button to run all the cells in the current notebook in sequence.

Run all cells above or below


Expand the dropdown list from Run all button, then select Run cells above to run all the cells
above the current in sequence. Select Run cells below to run the current cell and all the cells
below the current in sequence.

Cancel all running cells


Select the Cancel All button to cancel the running cells or cells waiting in the queue.

Stop session
Stop session cancels the running and waiting cells and stops the current session. You can
restart a brand new session by selecting the run button again.

Notebook reference run


Besides using mssparkutils reference run API You can also use %run <notebook name> magic
command to reference another notebook within current notebook's context. All the variables
defined in the reference notebook are available in the current notebook. %run magic
command supports nested calls but not support recursive calls. You receive an exception if
the statement depth is larger than five.

Example: %run Notebook1 { "parameterInt": 1, "parameterFloat": 2.5, "parameterBool":


true, "parameterString": "abc" } .

Notebook reference works in both interactive mode and pipeline.

7 Note

%run command currently only supports reference notebooks that in the same

workspace with current notebook.


%run command currently only supports to 4 parameter value types: int , float ,

bool , string , variable replacement operation is not supported.

%run command do not support nested reference that depth is larger than five.

Variable explorer
Microsoft Fabric notebook provides a built-in variables explorer for you to see the list of the
variables name, type, length, and value in the current Spark session for PySpark (Python)
cells. More variables show up automatically as they're defined in the code cells. Clicking on
each column header sorts the variables in the table.

You can select the Variables button on the notebook ribbon “View” tab to open or hide the
variable explorer.


7 Note

Variable explorer only supports python.

Cell status indicator


A step-by-step cell execution status is displayed beneath the cell to help you see its current
progress. Once the cell run is complete, an execution summary with the total duration and
end time is shown and stored there for future reference.

Inline spark job indicator


The Microsoft Fabric notebook is Spark based. Code cells are executed on the spark cluster
remotely. A Spark job progress indicator is provided with a real-time progress bar appears to
help you understand the job execution status. The number of tasks per each job or stage
help you to identify the parallel level of your spark job. You can also drill deeper to the Spark
UI of a specific job (or stage) via selecting the link on the job (or stage) name.

You can also find the Cell level real-time log next to the progress indicator, and Diagnostics
can provide you with useful suggestions to help refine and debug the code.

In More actions, you can easily navigate to the Spark application details page and Spark
web UI page.


Secret redaction
To prevent the credentials being accidentally leaked when running notebooks, Fabric
notebook support Secret redaction to replace the secret values that are displayed in the cell
output with [REDACTED] , Secret redaction is applicable for Python, Scala and R.

Magic commands in notebook

Built-in magics
You can use familiar Ipython magic commands in Fabric notebooks. Review the following list
as the current available magic commands.

7 Note

Only following magic commands are supported in Fabric pipeline : %%pyspark,


%%spark, %%csharp, %%sql.

Available line magics: %lsmagic , %time , %timeit , %history , %run, %load , %alias,
%alias_magic, %autoawait, %autocall, %automagic, %bookmark, %cd, %colors, %dhist, %dirs,
%doctest_mode, %killbgscripts, %load_ext, %logoff, %logon, %logstart, %logstate, %logstop,
%magic, %matplotlib, %page, %pastebin, %pdef, %pfile, %pinfo, %pinfo2, %popd, %pprint,
%precision, %prun, %psearch, %psource, %pushd, %pwd, %pycat, %quickref, %rehashx,
%reload_ext, %reset, %reset_selective, %sx, %system, %tb, %unalias, %unload_ext, %who,
%who_ls, %whos, %xdel, %xmode.

Fabric notebook also supports improved library management commands %pip, %conda,
check Manage Apache Spark libraries in Microsoft Fabric for the usage.

Available cell magics: %%time , %%timeit , %%capture , %%writefile , %%sql,


%%pyspark, %%spark, %%csharp, %%html , %%bash , %%markdown , %%perl ,
%%script , %%sh .
Custom magics
You can also build out more custom magic commands to meet your specific needs as the
following example shows.

1. Create a notebook with name "MyLakehouseModule".

2. In another notebook reference the "MyLakehouseModule" and its magic commands, by


this way you can organize your project with notebooks that using different languages
conveniently.

IPython Widgets
IPython Widgets are eventful python objects that have a representation in the browser. You
can use IPython Widgets as low-code controls (for example, slider, text box) in your
notebook just like the Jupyter notebook, currently it only works in Python context.

To use IPython Widget


1. You need to import ipywidgets module first to use the Jupyter Widget framework.
Python

import ipywidgets as widgets

2. You can use top-level display function to render a widget, or leave an expression of
widget type at the last line of code cell.

Python

slider = widgets.IntSlider()
display(slider)

3. Run the cell, the widget displays in the output area.

Python

slider = widgets.IntSlider()
display(slider)

4. You can use multiple display() calls to render the same widget instance multiple times,
but they remain in sync with each other.

Python

slider = widgets.IntSlider()
display(slider)
display(slider)

5. To render two widgets independent of each other, create two widget instances:

Python

slider1 = widgets.IntSlider()
slider2 = widgets.IntSlider()
display(slider1)
display(slider2)

Supported widgets

Widgets type Widgets

Numeric widgets IntSlider, FloatSlider, FloatLogSlider, IntRangeSlider, FloatRangeSlider,


IntProgress, FloatProgress, BoundedIntText, BoundedFloatText, IntText, FloatText

Boolean widgets ToggleButton, Checkbox, Valid

Selection widgets Dropdown, RadioButtons, Select, SelectionSlider, SelectionRangeSlider,


ToggleButtons, SelectMultiple

String Widgets Text, Text area, Combobox, Password, Label, HTML, HTML Math, Image, Button

Play (Animation) Date picker, Color picker, Controller


widgets

Container or Box, HBox, VBox, GridBox, Accordion, Tabs, Stacked


Layout widgets

Known limitations
1. The following widgets aren't supported yet, you could follow the corresponding
workaround as follows:

Functionality Workaround

Output widget You can use print() function instead to write text into stdout.

widgets.jslink() You can use widgets.link() function to link two similar widgets.

FileUpload widget Not supported yet.

2. Global display function provided by Fabric doesn't support displaying multiple widgets
in one call (for example, display(a, b)), which is different from IPython display function.

3. If you close a notebook that contains IPython Widget, you can't see or interact with it
until you execute the corresponding cell again.

Python logging in Notebook


You can find Python logs and set different log levels and format like the sample code shown
here:

Python

import logging

# Customize the logging format for all loggers


FORMAT = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
formatter = logging.Formatter(fmt=FORMAT)
for handler in logging.getLogger().handlers:
handler.setFormatter(formatter)

# Customize log level for all loggers


logging.getLogger().setLevel(logging.INFO)

# Customize the log level for a specific logger


customizedLogger = logging.getLogger('customized')
customizedLogger.setLevel(logging.WARNING)

# logger that use the default global log level


defaultLogger = logging.getLogger('default')
defaultLogger.debug("default debug message")
defaultLogger.info("default info message")
defaultLogger.warning("default warning message")
defaultLogger.error("default error message")
defaultLogger.critical("default critical message")

# logger that use the customized log level


customizedLogger.debug("customized debug message")
customizedLogger.info("customized info message")
customizedLogger.warning("customized warning message")
customizedLogger.error("customized error message")
customizedLogger.critical("customized critical message")

Integrate a notebook

Designate a parameters cell


To parameterize your notebook, select the ellipses (...) to access the More commands at the
cell toolbar. Then select Toggle parameter cell to designate the cell as the parameters cell.

The parameter cell is useful for integrating a notebook in a pipeline. Pipeline activity looks
for the parameters cell and treats this cell as the default for the parameters passed in at
execution time. The execution engine adds a new cell beneath the parameters cell with input
parameters in order to overwrite the default values.

Shortcut keys
Similar to Jupyter Notebooks, Fabric notebooks have a modal user interface. The keyboard
does different things depending on which mode the notebook cell is in. Fabric notebooks
support the following two modes for a given code cell: Command mode and Edit mode.

1. A cell is in Command mode when there's no text cursor prompting you to type. When a
cell is in Command mode, you can edit the notebook as a whole but not type into
individual cells. Enter Command mode by pressing ESC or using the mouse to select
outside of a cell's editor area.

2. Edit mode can be indicated from a text cursor that prompting you to type in the editor
area. When a cell is in Edit mode, you can type into the cell. Enter Edit mode by
pressing Enter or using the mouse to select on a cell's editor area.

Shortcut keys under command mode

Action Notebook shortcuts

Run the current cell and select below Shift+Enter

Run the current cell and insert below Alt+Enter


Action Notebook shortcuts

Run current cell Ctrl+Enter

Select cell above Up

Select cell below Down

Select previous cell K

Select next cell J

Insert cell above A

Insert cell below B

Delete selected cells Shift + D

Switch to edit mode Enter

Shortcut keys under edit mode


Using the following keystroke shortcuts, you can easily navigate and run code in Fabric
notebooks when in Edit mode.

Action Notebook shortcuts

Move cursor up Up

Move cursor down Down

Undo Ctrl + Z

Redo Ctrl + Y

Comment or Uncomment Ctrl + /


Comment: Ctrl + K + C
Uncomment: Ctrl + K + U

Delete word before Ctrl + Backspace

Delete word after Ctrl + Delete

Go to cell start Ctrl + Home

Go to cell end Ctrl + End

Go one word left Ctrl + Left

Go one word right Ctrl + Right

Select all Ctrl + A


Action Notebook shortcuts

Indent Ctrl + ]

Dedent Ctrl + [

Switch to command mode Esc

You can easily find all shortcut keys from the notebook ribbon View -> Keybindings.

Next steps
Notebook visualization
Introduction of Fabric MSSparkUtils

Feedback
Was this page helpful?  Yes  No

Provide product feedback | Ask the community


Notebook visualization in Microsoft
Fabric
Article • 05/31/2023

Microsoft Fabric is an integrated analytics service that accelerates time to insight, across
data warehouses and big data analytics systems. Data visualization in notebook is a key
component in being able to gain insight into your data. It helps make big and small data
easier for humans to understand. It also makes it easier to detect patterns, trends, and
outliers in groups of data.

) Important

Microsoft Fabric is in preview.

When you use Apache Spark in Microsoft Fabric, there are various built-in options to
help you visualize your data, including Microsoft Fabric notebook chart options, and
access to popular open-source libraries.

Notebook chart options


When using a Microsoft Fabric notebook, you can turn your tabular results view into a
customized chart using chart options. Here, you can visualize your data without having
to write any code.

display(df) function
The display function allows you to turn SQL query result and Apache Spark dataframes
into rich data visualizations. The display function can be used on dataframes created in
PySpark and Scala.

To access the chart options:

1. The output of %%sql magic commands appear in the rendered table view by
default. You can also call display(df) on Spark DataFrames or Resilient Distributed
Datasets (RDD) function to produce the rendered table view.

2. Once you have a rendered table view, switch to the Chart view.
3. You can now customize your visualization by specifying the following values:

Configuration Description

Chart Type The display function supports a wide range of chart types, including bar
charts, scatter plots, line graphs, and more.

Key Specify the range of values for the x-axis.

Value Specify the range of values for the y-axis values.

Series Group Used to determine the groups for the aggregation.

Aggregation Method to aggregate data in your visualization.

7 Note

By default the display(df) function will only take the first 1000 rows of the data
to render the charts. Select Aggregation over all results and then select
Apply to apply the chart generation from the whole dataset. A Spark job will
be triggered when the chart setting changes. Please note that it may take
several minutes to complete the calculation and render the chart.

4. Once done, you can view and interact with your final visualization!

display(df) summary view


You can use display(df, summary = true) to check the statistics summary of a given
Apache Spark DataFrame that includes the column name, column type, unique values,
and missing values for each column. You can also select a specific column to see its
minimum value, maximum value, mean value, and standard deviation.

displayHTML() option
Microsoft Fabric notebooks support HTML graphics using the displayHTML function.

The following image is an example of creating visualizations using D3.js .\

Run the following code to create this visualization.

Python
displayHTML("""<!DOCTYPE html>
<meta charset="utf-8">

<!-- Load d3.js -->


<script src="https://d3js.org/d3.v4.js"></script>

<!-- Create a div where the graph will take place -->
<div id="my_dataviz"></div>
<script>

// set the dimensions and margins of the graph


var margin = {top: 10, right: 30, bottom: 30, left: 40},
width = 400 - margin.left - margin.right,
height = 400 - margin.top - margin.bottom;

// append the svg object to the body of the page


var svg = d3.select("#my_dataviz")
.append("svg")
.attr("width", width + margin.left + margin.right)
.attr("height", height + margin.top + margin.bottom)
.append("g")
.attr("transform",
"translate(" + margin.left + "," + margin.top + ")");

// Create Data
var data = [12,19,11,13,12,22,13,4,15,16,18,19,20,12,11,9]

// Compute summary statistics used for the box:


var data_sorted = data.sort(d3.ascending)
var q1 = d3.quantile(data_sorted, .25)
var median = d3.quantile(data_sorted, .5)
var q3 = d3.quantile(data_sorted, .75)
var interQuantileRange = q3 - q1
var min = q1 - 1.5 * interQuantileRange
var max = q1 + 1.5 * interQuantileRange

// Show the Y scale


var y = d3.scaleLinear()
.domain([0,24])
.range([height, 0]);
svg.call(d3.axisLeft(y))

// a few features for the box


var center = 200
var width = 100

// Show the main vertical line


svg
.append("line")
.attr("x1", center)
.attr("x2", center)
.attr("y1", y(min) )
.attr("y2", y(max) )
.attr("stroke", "black")
// Show the box
svg
.append("rect")
.attr("x", center - width/2)
.attr("y", y(q3) )
.attr("height", (y(q1)-y(q3)) )
.attr("width", width )
.attr("stroke", "black")
.style("fill", "#69b3a2")

// show median, min and max horizontal lines


svg
.selectAll("toto")
.data([min, median, max])
.enter()
.append("line")
.attr("x1", center-width/2)
.attr("x2", center+width/2)
.attr("y1", function(d){ return(y(d))} )
.attr("y2", function(d){ return(y(d))} )
.attr("stroke", "black")
</script>

"""
)

Popular libraries
When it comes to data visualization, Python offers multiple graphing libraries that come
packed with many different features. By default, every Apache Spark pool in Microsoft
Fabric contains a set of curated and popular open-source libraries.

Matplotlib
You can render standard plotting libraries, like Matplotlib, using the built-in rendering
functions for each library.

The following image is an example of creating a bar chart using Matplotlib.


Run the following sample code to draw this bar chart.

Python

# Bar chart

import matplotlib.pyplot as plt

x1 = [1, 3, 4, 5, 6, 7, 9]
y1 = [4, 7, 2, 4, 7, 8, 3]

x2 = [2, 4, 6, 8, 10]
y2 = [5, 6, 2, 6, 2]

plt.bar(x1, y1, label="Blue Bar", color='b')


plt.bar(x2, y2, label="Green Bar", color='g')
plt.plot()

plt.xlabel("bar number")
plt.ylabel("bar height")
plt.title("Bar Chart Example")
plt.legend()
plt.show()

Bokeh
You can render HTML or interactive libraries, like bokeh, using the displayHTML(df).

The following image is an example of plotting glyphs over a map using bokeh.

Run the following sample code to draw this image.

Python

from bokeh.plotting import figure, output_file


from bokeh.tile_providers import get_provider, Vendors
from bokeh.embed import file_html
from bokeh.resources import CDN
from bokeh.models import ColumnDataSource

tile_provider = get_provider(Vendors.CARTODBPOSITRON)

# range bounds supplied in web mercator coordinates


p = figure(x_range=(-9000000,-8000000), y_range=(4000000,5000000),
x_axis_type="mercator", y_axis_type="mercator")
p.add_tile(tile_provider)

# plot datapoints on the map


source = ColumnDataSource(
data=dict(x=[ -8800000, -8500000 , -8800000],
y=[4200000, 4500000, 4900000])
)

p.circle(x="x", y="y", size=15, fill_color="blue", fill_alpha=0.8,


source=source)

# create an html document that embeds the Bokeh plot


html = file_html(p, CDN, "my plot1")

# display this html


displayHTML(html)

Plotly
You can render HTML or interactive libraries like Plotly, using the displayHTML().

Run the following sample code to draw this image:

Python

from urllib.request import urlopen


import json
with
urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-
counties-fips.json') as response:
counties = json.load(response)

import pandas as pd
df =
pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/fips-
unemp-16.csv",
dtype={"fips": str})

import plotly
import plotly.express as px

fig = px.choropleth(df, geojson=counties, locations='fips', color='unemp',


color_continuous_scale="Viridis",
range_color=(0, 12),
scope="usa",
labels={'unemp':'unemployment rate'}
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

# create an html document that embeds the Plotly plot


h = plotly.offline.plot(fig, output_type='div')

# display this html


displayHTML(h)

Pandas
You can view html output of pandas dataframe as the default output, notebook
automatically shows the styled html content.

Python

import pandas as pd
import numpy as np

df = pd.DataFrame([[38.0, 2.0, 18.0, 22.0, 21, np.nan],[19, 439, 6, 452,


226,232]],

index=pd.Index(['Tumour (Positive)', 'Non-Tumour


(Negative)'], name='Actual Label:'),

columns=pd.MultiIndex.from_product([['Decision Tree',
'Regression', 'Random'],['Tumour', 'Non-Tumour']], names=['Model:',
'Predicted:']))

df

Next steps
Use a notebook with lakehouse to explore your data
Explore the data in your lakehouse with
a notebook
Article • 06/08/2023

In this tutorial, learn how to explore the data in your lakehouse with a notebook.

) Important

Microsoft Fabric is in preview.

Prerequisites
To get started, you need the following prerequisites:

A Microsoft Fabric tenant account with an active subscription. Create an account


for free.
Read the Lakehouse overview.

Open or create a notebook from a lakehouse


To explore your lakehouse data, you can add the lakehouse to an existing notebook or
create a new notebook from the lakehouse.

Open a lakehouse from an existing notebook


Select the notebook from the notebook list and then select Add. The notebook opens
with your current lakehouse added to the notebook.

Open a lakehouse from a new notebook


You can create a new notebook in the same workspace and the current lakehouse
appears in that notebook.
Switch lakehouses and set a default
You can add multiple lakehouses to the same notebook. By switching the available
lakehouse in the left panel, you can explore the structure and the data from different
lakehouses.

In the lakehouse list, the pin icon next to the name of a lakehouse indicates that it's the
default lakehouse in your current notebook. In the notebook code, if only a relative path
is provided to access the data from the Microsoft Fabric OneLake, then the default
lakehouse is served as the root folder at run time.

To switch to a different default lakehouse, move the pin icon.

7 Note

The default lakehouse decide which Hive metastore to use when running the
notebook with Spark SQL.if multiple lakehouse are added into the notebook, make
sure when Spark SQL is used, the target lakehouse and the current default
lakehouse are from the same workspace.

Add or remove a lakehouse


Selecting the X icon next to a lakehouse name removes it from the notebook, but the
lakehouse item still exists in the workspace.

To remove all the lakehouses from the notebook, click "Remove all Lakehouses" in the
lakehouse list.
Select Add lakehouse to add more lakehouses to the notebook. You can either add an
existing one or create a new one.

Explore the lakehouse data


The structure of the Lakehouse shown in the Notebook is the same as the one in the
Lakehouse view. For the detail please check Lakehouse overview. When you select a file
or folder, the content area shows the details of the selected item.
7 Note

The notebook will be created under your current workspace.

Next steps
How to use a notebook to load data into your lakehouse
Use a notebook to load data into your
Lakehouse
Article • 05/23/2023

In this tutorial, learn how to read/write data into your lakehouse with a notebook.Spark
API and Pandas API are supported to achieve this goal.

) Important

Microsoft Fabric is in preview.

Load data with an Apache Spark API


In the code cell of the notebook, use the following code example to read data from the
source and load it into Files, Tables, or both sections of your lakehouse.

To specify the location to read from, you can use the relative path if the data is from the
default lakehouse of current notebook, or you can use the absolute ABFS path if the
data is from other lakehouse. you can copy this path from the context menu of the data
Copy ABFS path : this return the absolute path of the file

Copy relative path for Spark : this return the relative path of the file in the default
lakehouse

Python

df = spark.read.parquet("location to read from")

# Keep it if you want to save dataframe as CSV files to Files section of the
default Lakehouse

df.write.mode("overwrite").format("csv").save("Files/ " + csv_table_name)

# Keep it if you want to save dataframe as Parquet files to Files section of


the default Lakehouse

df.write.mode("overwrite").format("parquet").save("Files/" +
parquet_table_name)

# Keep it if you want to save dataframe as a delta lake, parquet table to


Tables section of the default Lakehouse

df.write.mode("overwrite").format("delta").saveAsTable(delta_table_name)

# Keep it if you want to save the dataframe as a delta lake, appending the
data to an existing table

df.write.mode("append").format("delta").saveAsTable(delta_table_name)

Load data with a Pandas API


To support Pandas API, the default Lakehouse will be automatically mounted to the
notebook. The mount point is '/lakehouse/default/'. You can use this mount point to
read/write data from/to the default lakehouse. The "Copy File API Path" option from the
context menu will return the File API path from that mount point. The path returned
from the option Copy ABFS path also works for Pandas API.
Copy File API Path :This return the path under the mount point of the default lakehouse

Python

# Keep it if you want to read parquet file with Pandas from the default
lakehouse mount point

import pandas as pd
df = pd.read_parquet("/lakehouse/default/Files/sample.parquet")

# Keep it if you want to read parquet file with Pandas from the absolute
abfss path

import pandas as pd
df = pd.read_parquet("abfss://DevExpBuildDemo@msit-
onelake.dfs.fabric.microsoft.com/Marketing_LH.Lakehouse/Files/sample.parquet
")

 Tip

For Spark API, please use the option of Copy ABFS path or Copy relative path for
Spark to get the path of the file. For Pandas API, please use the option of Copy
ABFS path or Copy File API path to get the path of the file.

The quickest way to have the code to work with Spark API or Pandas API is to use the
option of Load data and select the API you want to use. The code will be automatically
generated in a new code cell of the notebook.

Next steps
Explore the data in your lakehouse with a notebook
How-to use end-to-end AI samples in
Microsoft Fabric
Article • 09/15/2023

In providing the Synapse Data Science in Microsoft Fabric SaaS experience we want to
enable ML professionals to easily and frictionlessly build, deploy, and operationalize
their machine learning models, in a single analytics platform, while collaborating with
other key roles. Begin here to understand the various capabilities the Synapse Data
Science experience offers and explore examples of how ML models can address your
common business problems.

) Important

Microsoft Fabric is in preview.

Install Python libraries


Some of the end-to-end AI samples require use of additional libraries when developing
machine learning models or doing ad-hoc data analysis. You can quickly install these
libraries for your Apache Spark session in one of two ways:

Use the in-line installation capabilities (such as %pip or %conda ) in your notebook
Install libraries directly in your current workspace

Install with in-line installation capabilities

You can use the in-line installation capabilities (for example, %pip or %conda ) within your
notebook to install new libraries. This installation option would install the libraries only
in the current notebook and not in the workspace.

To install a library, use the following code, replacing <library name> with the name of
your desired library, such as imblearn , wordcloud , etc.

Python

# Use pip to install libraries


%pip install <library name>
# Use conda to install libraries
%conda install <library name>

For more information on in-line installation capabilities, see In-line installation.

Install directly in your workspace


Alternatively, you can install libraries in your workspace so that they're available for use
in any notebooks that are in the workspace.

) Important

Only your Workspace admin has access to update the Workspace-level settings.

For more information on installing libraries in your workspace, see Install workspace
libraries.

Bank customer churn


Build a model to predict whether bank customers would churn or not. The churn rate,
also known as the rate of attrition refers to the rate at which bank customers stop doing
business with the bank.

Follow along in the Customer churn prediction tutorial.

Recommender
An online bookstore is looking to increase sales by providing customized
recommendations. Using customer book rating data in this sample you'll see how to
clean, explore the data leading to developing and deploying a recommendation to
provide predictions.

Follow along in the Train a retail recommendation model tutorial.

Fraud detection
As unauthorized transactions increase, detecting credit card fraud in real time will
support financial institutions to provide their customers faster turnaround time on
resolution. This end to end sample will include preprocessing, training, model storage
and inferencing. The training section will review implementing multiple models and
methods that address challenges like imbalanced examples and trade-offs between false
positives and false negatives.

Follow along in the Fraud detection tutorial.

Forecasting
Using historical New York City Property Sales data and Facebook Prophet in this sample,
we'll build a time series model with the trend, seasonality and holiday information to
forecast what sales will look like in future cycles.

Follow along in the Forecasting tutorial.

Text classification
In this sample, we'll predict whether a book in the British Library is fiction or non-fiction
based on book metadata. This will be accomplished by applying text classification with
word2vec and linear-regression model on Spark.

Follow along in the Text classification tutorial.

Uplift model
In this sample, we'll estimate the causal impact of certain treatments on an individual's
behavior by using an Uplift model. We'll walk through step by step how to create, train
and evaluate the model touching on four core learnings:

Data-processing module: extracts features, treatments, and labels.


Training module: targets to predict the difference between an individual's behavior
when there's a treatment and when there's no treatment, using a classical machine
learning model like lightGBM.
Prediction module: calls the uplift model to predict on test data.
Evaluation module: evaluates the effect of the uplift model on test data.

Follow along in the Healthcare causal impact of treatments tutorial.

Predictive maintenance
In this tutorial, you proactively predict mechanical failures. This is accomplished by
training multiple models on historical data such as temperature and rotational speed,
then determining which model is the best fit for predicting future failures.
Follow along in the Predictive maintenance tutorial.

Next steps
How to use Microsoft Fabric notebooks
Machine learning model in Microsoft Fabric

Feedback
Was this page helpful?  Yes  No

Provide product feedback | Ask the community


Tutorial: Create, evaluate, and score a churn prediction model
Article • 09/15/2023

In this tutorial, you'll see a Microsoft Fabric data science workflow with an end-to-end example. The scenario is to build a model to predict
whether bank customers would churn or not. The churn rate, also known as the rate of attrition refers to the rate at which bank customers
stop doing business with the bank.

) Important

Microsoft Fabric is in preview.

The main steps in this tutorial are

" Install custom libraries


" Load the data
" Understand and process the data through exploratory data analysis and demonstrate the use of Fabric Data Wrangler feature
" Train machine learning models using Scikit-Learn and LightGBM , and track experiments using MLflow and Fabric Autologging feature
" Evaluate and save the final machine learning model
" Demonstrate the model performance via visualizations in Power BI

Prerequisites
A Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview) trial.

Sign in to Microsoft Fabric .

Switch to the Data Science experience using the experience switcher icon at the left corner of your homepage.

If you don't have a Microsoft Fabric lakehouse, create one by following the steps in Create a lakehouse in Microsoft Fabric.

Follow along in notebook


AIsample - Bank Customer Churn.ipynb is the notebook that accompanies this tutorial.

If you'd like to open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science to import
the tutorial notebooks to your workspace.

Or, create a new notebook if you'd rather copy/paste the code from this page.

Be sure to attach a lakehouse to the notebook before you start running code.

Step 1: Install custom libraries


When developing a machine learning model or doing ad-hoc data analysis, you may need to quickly install a custom library for your
Apache Spark session. To do so, use %pip install or %conda install . Alternatively, you could install the required libraries into the
workspace, by navigating into the workspace setting to find Library management.

Here, you'll use %pip install to install imblearn .

7 Note
The PySpark kernel will restart after %pip install . Install libraries before you run any other cells.

Python

# Use pip to install libraries


%pip install imblearn

Step 2: Load the data


The dataset contains churn status of 10,000 customers along with 14 attributes that include credit score, geographical location (Germany,
France, Spain), gender (male, female), age, tenure (years of being bank's customer), account balance, estimated salary, number of products
that a customer has purchased through the bank, credit card status (whether a customer has a credit card or not), and active member
status (whether an active bank's customer or not).

The dataset also includes columns such as row number, customer ID, and customer surname that should have no impact on customer's
decision to leave the bank. The event that defines the customer's churn is the closing of the customer's bank account, therefore, the column
exit in the dataset refers to customer's abandonment. Since you don't have much context about these attributes, you'll proceed without

having background information about the dataset. Your aim is to understand how these attributes contribute to the exit status.

Out of the 10,000 customers, only 2037 customers (around 20%) have left the bank. Therefore, given the class imbalance ratio, we
recommend generating synthetic data. Moreover, confusion matrix accuracy may not be meaningful for imbalanced classification and it
might be better to also measure the accuracy using the Area Under the Precision-Recall Curve (AUPRC).

churn.csv

"CustomerID" "Surname" "CreditScore" "Geography" "Gender" "Age" "Tenure" "Balance" "NumOfProducts" "HasCrCard" "IsActiveMemb

15634602 Hargrave 619 France Female 42 2 0.00 1 1 1

15647311 Hill 608 Spain Female 41 1 83807.86 1 0 1

Introduction to SMOTE
The problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the
decision boundary. Synthetic Minority Oversampling Technique (SMOTE) is the most widely used approach to synthesize new samples for
the minority class. Learn more about SMOTE here and here .

You will be able to access SMOTE using the imblearn library that you installed in Step 1.

Download dataset and upload to lakehouse

 Tip

By defining the following parameters, you can use this notebook with different datasets easily.

Python

IS_CUSTOM_DATA = False # if TRUE, dataset has to be uploaded manually

IS_SAMPLE = False # if TRUE, use only SAMPLE_ROWS of data for training, otherwise use all data
SAMPLE_ROWS = 5000 # if IS_SAMPLE is True, use only this number of rows for training

DATA_ROOT = "/lakehouse/default"
DATA_FOLDER = "Files/churn" # folder with data files
DATA_FILE = "churn.csv" # data file name

This code downloads a publicly available version of the dataset and then stores it in a Fabric lakehouse.

) Important

Make sure you add a lakehouse to the notebook before running it. Failure to do so will result in an error.

Python
import os, requests
if not IS_CUSTOM_DATA:
# Using synapse blob, this can be done in one line

# Download demo data files into lakehouse if not exist


remote_url = "https://synapseaisolutionsa.blob.core.windows.net/public/bankcustomerchurn"
file_list = ["churn.csv"]
download_path = "/lakehouse/default/Files/churn/raw"

if not os.path.exists("/lakehouse/default"):
raise FileNotFoundError(
"Default lakehouse not found, please add a lakehouse and restart the session."
)
os.makedirs(download_path, exist_ok=True)
for fname in file_list:
if not os.path.exists(f"{download_path}/{fname}"):
r = requests.get(f"{remote_url}/{fname}", timeout=30)
with open(f"{download_path}/{fname}", "wb") as f:
f.write(r.content)
print("Downloaded demo data files into lakehouse.")

Start recording the time it takes to run this notebook.

Python

# Record the notebook running time


import time

ts = time.time()

Read raw data from the lakehouse


Reads raw data from the Files section of the lakehouse, adds additional columns for different date parts and the same information will be
used to create partitioned delta table.

Python

df = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv("Files/churn/raw/churn.csv")
.cache()
)

Create a pandas dataframe from the dataset


This code converts the spark DataFrame to pandas DataFrame for easier processing and visualization.

Python

df = df.toPandas()

Step 3: Exploratory Data Analysis

Display raw data


Explore the raw data with display , do some basic statistics and show chart views. You first need to import required libraries for data
visualization such as seaborn which is a Python data visualization library to provide a high-level interface for building visuals on dataframes
and arrays. Learn more about seaborn .

Python

import seaborn as sns


sns.set_theme(style="whitegrid", palette="tab10", rc = {'figure.figsize':(9,6)})
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
from matplotlib import rc, rcParams
import numpy as np
import pandas as pd
import itertools

Python

display(df, summary=True)

Use Data Wrangler to perform initial data cleansing


Launch Data Wrangler directly from the notebook to explore and transform any pandas dataframe. Under the notebook ribbon Data tab,
you can use the Data Wrangler dropdown prompt to browse the activated pandas DataFrames available for editing and select the one you
wish to open in Data Wrangler.

7 Note

Data Wrangler can not be opened while the notebook kernel is busy. The cell execution must complete prior to launching Data
Wrangler. Learn more about Data Wrangler .

Once the Data Wrangler is launched, a descriptive overview of the displayed data panel is generated as shown in the following images. It
includes information about the DataFrame's dimension, missing values, etc. You can then use Data Wrangler to generate the script for
dropping the rows with missing values, the duplicate rows and the columns with specific names, then copy the script into a cell. The next
cell shows that copied script.
Python
def clean_data(df):
# Drop rows with missing data across all columns
df.dropna(inplace=True)
# Drop duplicate rows in columns: 'RowNumber', 'CustomerId'
df.drop_duplicates(subset=['RowNumber', 'CustomerId'], inplace=True)
# Drop columns: 'RowNumber', 'CustomerId', 'Surname'
df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)
return df

df_clean = clean_data(df.copy())

Determine attributes
Use this code to determine categorical, numerical, and target attributes.

Python

# Determine the dependent (target) attribute


dependent_variable_name = "Exited"
print(dependent_variable_name)
# Determine the categorical attributes
categorical_variables = [col for col in df_clean.columns if col in "O"
or df_clean[col].nunique() <=5
and col not in "Exited"]
print(categorical_variables)
# Determine the numerical attributes
numeric_variables = [col for col in df_clean.columns if df_clean[col].dtype != "object"
and df_clean[col].nunique() >5]
print(numeric_variables)

The five-number summary


Show the five-number summary (the minimum score, first quartile, median, third quartile, the maximum score) for the numerical attributes
using box plots.

Python

df_num_cols = df_clean[numeric_variables]
sns.set(font_scale = 0.7)
fig, axes = plt.subplots(nrows = 2, ncols = 3, gridspec_kw = dict(hspace=0.3), figsize = (17,8))
fig.tight_layout()
for ax,col in zip(axes.flatten(), df_num_cols.columns):
sns.boxplot(x = df_num_cols[col], color='green', ax = ax)
# fig.suptitle('visualize and compare the distribution and central tendency of numerical attributes', color = 'k', fontsize
= 12)
fig.delaxes(axes[1,2])
Distribution of exited and non-exited customers
Show the distribution of exited versus non-exited customers across the categorical attributes.

Python

attr_list = ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember', 'NumOfProducts', 'Tenure']


fig, axarr = plt.subplots(2, 3, figsize=(15, 4))
for ind, item in enumerate (attr_list):
sns.countplot(x = item, hue = 'Exited', data = df_clean, ax = axarr[ind%2][ind//2])
fig.subplots_adjust(hspace=0.7)

Distribution of numerical attributes


Show the frequency distribution of numerical attributes using histogram.

Python

columns = df_num_cols.columns[: len(df_num_cols.columns)]


fig = plt.figure()
fig.set_size_inches(18, 8)
length = len(columns)
for i,j in itertools.zip_longest(columns, range(length)):
plt.subplot((length // 2), 3, j+1)
plt.subplots_adjust(wspace = 0.2, hspace = 0.5)
df_num_cols[i].hist(bins = 20, edgecolor = 'black')
plt.title(i)
# fig = fig.suptitle('distribution of numerical attributes', color = 'r' ,fontsize = 14)
plt.show()

Perform feature engineering


The following feature engineering generates new attributes based on current attributes.
Python

df_clean["NewTenure"] = df_clean["Tenure"]/df_clean["Age"]
df_clean["NewCreditsScore"] = pd.qcut(df_clean['CreditScore'], 6, labels = [1, 2, 3, 4, 5, 6])
df_clean["NewAgeScore"] = pd.qcut(df_clean['Age'], 8, labels = [1, 2, 3, 4, 5, 6, 7, 8])
df_clean["NewBalanceScore"] = pd.qcut(df_clean['Balance'].rank(method="first"), 5, labels = [1, 2, 3, 4, 5])
df_clean["NewEstSalaryScore"] = pd.qcut(df_clean['EstimatedSalary'], 10, labels = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Use Data Wrangler to perform one-hot encoding


Following the same instructions discussed earlier to launch Data Wrangler, use the Data Wrangler to perform one-hot encoding. The next
cell shows the copied generated script for one-hot encoding.
Python

df_clean = pd.get_dummies(df_clean, columns=['Geography', 'Gender'])


Create a delta table to generate the Power BI report
Python

table_name = "df_clean"
# Create PySpark DataFrame from Pandas
sparkDF=spark.createDataFrame(df_clean)
sparkDF.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Spark dataframe saved to delta table: {table_name}")

Summary of observations from the exploratory data analysis


Most of the customers are from France comparing to Spain and Germany, while Spain has the lower churn rate comparing to France
and Germany.
Most of the customers have credit cards.
There are customers whose age and credit score are above 60 and below 400, respectively, but they can't be considered as outliers.
Very few customers have more than two of the bank's products.
Customers who aren't active have a higher churn rate.
Gender and tenure years don't seem to have an impact on customer's decision to close the bank account.

Step 4: Model training and tracking


With your data in place, you can now define the model. You'll apply Random Forrest and LightGBM models in this notebook.

Use scikit-learn and lightgbm to implement the models within a few lines of code. Also use MLfLow and Fabric Autologging to track the
experiments.

Here you'll load the delta table from the lakehouse. You may use other delta tables considering the lakehouse as the source.

Python

SEED = 12345
df_clean = spark.read.format("delta").load("Tables/df_clean").toPandas()

Generate experiment for tracking and logging the models using MLflow
This section demonstrates how to generate an experiment, specify model and training parameters as well as scoring metrics, train the
models, log them, and save the trained models for later use.

Python

import mlflow

# Set up experiment name


EXPERIMENT_NAME = "sample-bank-churn-experiment" # Mlflow experiment name

Extending the MLflow autologging capabilities, autologging works by automatically capturing the values of input parameters and output
metrics of a machine learning model as it is being trained. This information is then logged to your workspace, where it can be accessed and
visualized using the MLflow APIs or the corresponding experiment in your workspace.

When complete, your experiment will look like this image. All the experiments with their respective names are logged and you'll be able to
track their parameters and performance metrics. To learn more about autologging, see Autologging in Microsoft Fabric .
Set experiment and autologging specifications
Python

mlflow.set_experiment(EXPERIMENT_NAME) # Use date stamp to append to experiment


mlflow.autolog(exclusive=False)

Import scikit-learn and LightGBM


Python

# Import the required libraries for model training


from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, confusion_matrix, recall_score, roc_auc_score,
classification_report

Prepare training and test datasets


Python

y = df_clean["Exited"]
X = df_clean.drop("Exited",axis=1)
# Train-Test Separation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=SEED)

Apply SMOTE to the training data to synthesize new samples for the minority class
SMOTE should only be applied to the training dataset. You must leave the test dataset in its original imbalanced distribution in order to get
a valid approximation of how the model will perform on the original data, which is representing the situation in production.

Python

from collections import Counter


from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=SEED)
X_res, y_res = sm.fit_resample(X_train, y_train)
new_train = pd.concat([X_res, y_res], axis=1)
Model training
Train the model using Random Forest with maximum depth of four, with four features.

Python

mlflow.sklearn.autolog(registered_model_name='rfc1_sm') # Register the trained model with autologging


rfc1_sm = RandomForestClassifier(max_depth=4, max_features=4, min_samples_split=3, random_state=1) # Pass hyperparameters
with mlflow.start_run(run_name="rfc1_sm") as run:
rfc1_sm_run_id = run.info.run_id # Capture run_id for model prediction later
print("run_id: {}; status: {}".format(rfc1_sm_run_id, run.info.status))
# rfc1.fit(X_train,y_train) # imbalanaced training data
rfc1_sm.fit(X_res, y_res.ravel()) # balanced training data
rfc1_sm.score(X_test, y_test)
y_pred = rfc1_sm.predict(X_test)
cr_rfc1_sm = classification_report(y_test, y_pred)
cm_rfc1_sm = confusion_matrix(y_test, y_pred)
roc_auc_rfc1_sm = roc_auc_score(y_res, rfc1_sm.predict_proba(X_res)[:, 1])

Train the model using Random Forest with maximum depth of eight, with six features.

Python

mlflow.sklearn.autolog(registered_model_name='rfc2_sm') # Register the trained model with autologging


rfc2_sm = RandomForestClassifier(max_depth=8, max_features=6, min_samples_split=3, random_state=1) # Pass hyperparameters
with mlflow.start_run(run_name="rfc2_sm") as run:
rfc2_sm_run_id = run.info.run_id # Capture run_id for model prediction later
print("run_id: {}; status: {}".format(rfc2_sm_run_id, run.info.status))
# rfc2.fit(X_train,y_train) # imbalanced training data
rfc2_sm.fit(X_res, y_res.ravel()) # balanced training data
rfc2_sm.score(X_test, y_test)
y_pred = rfc2_sm.predict(X_test)
cr_rfc2_sm = classification_report(y_test, y_pred)
cm_rfc2_sm = confusion_matrix(y_test, y_pred)
roc_auc_rfc2_sm = roc_auc_score(y_res, rfc2_sm.predict_proba(X_res)[:, 1])

Train the model using LightGBM.

Python

# lgbm_model
mlflow.lightgbm.autolog(registered_model_name='lgbm_sm') # Register the trained model with autologging
lgbm_sm_model = LGBMClassifier(learning_rate = 0.07,
max_delta_step = 2,
n_estimators = 100,
max_depth = 10,
eval_metric = "logloss",
objective='binary',
random_state=42)

with mlflow.start_run(run_name="lgbm_sm") as run:


lgbm1_sm_run_id = run.info.run_id # Capture run_id for model prediction later
# lgbm_sm_model.fit(X_train,y_train) # imbalanced training data
lgbm_sm_model.fit(X_res, y_res.ravel()) # balanced training data
y_pred = lgbm_sm_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
cr_lgbm_sm = classification_report(y_test, y_pred)
cm_lgbm_sm = confusion_matrix(y_test, y_pred)
roc_auc_lgbm_sm = roc_auc_score(y_res, lgbm_sm_model.predict_proba(X_res)[:, 1])

Experiments artifact for tracking model performance


The experiment runs are automatically saved in the experiment artifact that can be found from the workspace. They're named based on the
name used for setting the experiment. All of the trained models, their runs, performance metrics and model parameters are logged as can
be seen from the experiment page shown in the image below.

To view your experiments:

1. On the left panel, select your workspace.


2. Find and select the experiment name, in this case sample-bank-churn-experiment.
Step 5: Evaluate and save the final machine learning model
Open the saved experiment from the workspace to select and save the best model.

Python

# Define run_uri to fetch the model


# mlflow client: mlflow.model.url, list model
load_model_rfc1_sm = mlflow.sklearn.load_model(f"runs:/{rfc1_sm_run_id}/model")
load_model_rfc2_sm = mlflow.sklearn.load_model(f"runs:/{rfc2_sm_run_id}/model")
load_model_lgbm1_sm = mlflow.lightgbm.load_model(f"runs:/{lgbm1_sm_run_id}/model")

Assess the performances of the saved models on test dataset


Python

ypred_rfc1_sm = load_model_rfc1_sm.predict(X_test) # Random Forest with max depth of 4 and 4 features


ypred_rfc2_sm = load_model_rfc2_sm.predict(X_test) # Random Forest with max depth of 8 and 6 features
ypred_lgbm1_sm = load_model_lgbm1_sm.predict(X_test) # LightGBM

Show True/False Positives/Negatives using the confusion matrix


Develop a script to plot the confusion matrix in order to evaluate the accuracy of the classification. You can also plot a confusion matrix
using SynapseML tools, which is shown in the Fraud Detection sample .

Python

def plot_confusion_matrix(cm, classes,


normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
print(cm)
plt.figure(figsize=(4,4))
plt.rcParams.update({'font.size': 10})
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45, color="blue")
plt.yticks(tick_marks, classes, color="blue")

fmt = '.2f' if normalize else 'd'


thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="red" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')

Create a confusion matrix for Random Forest Classifier with maximum depth of four, with four features.

Python

cfm = confusion_matrix(y_test, y_pred=ypred_rfc1_sm)


plot_confusion_matrix(cfm, classes=['Non Churn','Churn'],
title='Random Forest with max depth of 4')
tn, fp, fn, tp = cfm.ravel()

Create a confusion matrix for Random Forest Classifier with maximum depth of eight, with six features.

Python

cfm = confusion_matrix(y_test, y_pred=ypred_rfc2_sm)


plot_confusion_matrix(cfm, classes=['Non Churn','Churn'],
title='Random Forest with max depth of 8')
tn, fp, fn, tp = cfm.ravel()

Create the confusion matrix for LightGBM.

Python

cfm = confusion_matrix(y_test, y_pred=ypred_lgbm1_sm)


plot_confusion_matrix(cfm, classes=['Non Churn','Churn'],
title='LightGBM')
tn, fp, fn, tp = cfm.ravel()

Save results for Power BI


Move model prediction results to Power BI Visualization by saving delta frame to lakehouse.

Python

df_pred = X_test.copy()
df_pred['y_test'] = y_test
df_pred['ypred_rfc1_sm'] = ypred_rfc1_sm
df_pred['ypred_rfc2_sm'] =ypred_rfc2_sm
df_pred['ypred_lgbm1_sm'] = ypred_lgbm1_sm
table_name = "df_pred_results"
sparkDF=spark.createDataFrame(df_pred)
sparkDF.write.mode("overwrite").format("delta").option("overwriteSchema", "true").save(f"Tables/{table_name}")
print(f"Spark dataframe saved to delta table: {table_name}")

Step 6: Business Intelligence via Visualizations in Power BI


Use these steps to access your saved table in Power BI.

1. On the left, select OneLake data hub.


2. Select the lakehouse that you added to this notebook.
3. On the top right, select Open under the section titled Open this Lakehouse.
4. Select New Power BI dataset on the top ribbon and select df_pred_results , then select Continue to create a new Power BI dataset
linked to the predictions.
5. On the tools at the top of the dataset page, select New report to open the Power BI report authoring page.

Some example visualizations are shown here. The data panel shows the delta tables and columns from the table to select. Upon selecting
appropriate x and y axes, you can pick the filters and functions, for example, sum or average of the table column.

7 Note

We show an illustrated example of how you would analyze the saved prediction results in Power BI. However, for a real customer churn
use-case, the platform user may have to do more thorough ideation of what visualizations to create, based on subject matter
expertise, and what their firm and business analytics team has standardized as metrics.
The Power BI report shows that customers who use more than two of the bank products have a higher churn rate although few customers
had more than two products. The bank should collect more data, but also investigate other features correlated with more products (see the
plot in the bottom left panel). Bank customers in Germany have a higher churn rate than in France and Spain (see the plot in the bottom
right panel), which suggests that an investigation into what has encouraged customers to leave could be beneficial. There are more middle
aged customers (between 25-45) and customers between 45-60 tend to exit more. Finally, customers with lower credit scores would most
likely leave the bank for other financial institutes. The bank should look into ways that encourage customers with lower credit scores and
account balances to stay with the bank.

Python

# Determine the entire runtime


print(f"Full run cost {int(time.time() - ts)} seconds.")

Next steps
Machine learning model in Microsoft Fabric
Train machine learning models
Machine learning experiments in Microsoft Fabric

Feedback
Was this page helpful?  Yes  No

Provide product feedback | Ask the community


Create, evaluate, and score a recommendation system in
Microsoft Fabric
Article • 08/25/2023

In this tutorial, you walk through the data engineering and data science workflow with an end-to-end tutorial. The scenario is to build a
recommender for online book recommendation. The steps you'll take are:

" Upload the data into a Lakehouse


" Perform exploratory data analysis on the data
" Train a model and log it with MLflow
" Load the model and make predictions

) Important

Microsoft Fabric is in preview.

There are different types of recommendation algorithms. This tutorial uses a model based collaborative filtering algorithm named
Alternating Least Squares (ALS) matrix factorization.

ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, U and V, U * Vt = R. Typically these
approximations are called 'factor' matrices.

The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least
squares. The newly solved factor matrix is then held constant while solving for the other factor matrix.

Prerequisites
A Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview) trial.

Sign in to Microsoft Fabric .


Switch to the Data Science experience using the experience switcher icon at the left corner of your homepage.

If you don't have a Microsoft Fabric lakehouse, create one by following the steps in Create a lakehouse in Microsoft Fabric.

Follow along in notebook


AIsample - Book Recommendation.ipynb is the notebook that accompanies this tutorial.

If you'd like to open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science to import
the tutorial notebooks to your workspace.

Or, create a new notebook if you'd rather copy/paste the code from this page.

Be sure to attach a lakehouse to the notebook before you start running code.

Step 1: Load the data


The book recommendation dataset in this scenario consists of three separate datasets, Books.csv, Ratings.csv, and Users.csv.

Books.csv - Each book is identified with an International Standard Book Number (ISBN), with invalid dates already removed. Additional
information, such as the title, author, and publisher, has also been added. If a book has multiple authors, only the first is listed. URLs
point to Amazon for cover images in three different sizes.

ISBN Book- Book- Year-Of- Publisher Image-URL-S Image-URL-M


Title Author Publication

0195153448 Classical Mark P. 2002 Oxford http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg http://images


Mythology O. University Press
Morford

0002005018 Clara Richard 2001 HarperFlamingo http://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg http://images


Callan Bruce Canada
Wright

Ratings.csv - Ratings for each book are either explicit (provided by users on a scale of 1-10) or implicit (observed without user input,
and indicated by 0).

User-ID ISBN Book-Rating

276725 034545104X 0

276726 0155061224 5

Users.csv - User IDs, which have been anonymized and mapped to integers. Demographic data such as location and age, are provided
if available. If unavailable, the value is null.

User-ID Location Age

1 "nyc new york usa"

2 "stockton california usa" 18.0

By defining the following parameters, you can more easily apply the code in this tutorial to different datasets.

Python
IS_CUSTOM_DATA = False # if True, dataset has to be uploaded manually

USER_ID_COL = "User-ID" # must not be '_user_id' for this notebook to run successfully
ITEM_ID_COL = "ISBN" # must not be '_item_id' for this notebook to run successfully
ITEM_INFO_COL = (
"Book-Title" # must not be '_item_info' for this notebook to run successfully
)
RATING_COL = (
"Book-Rating" # must not be '_rating' for this notebook to run successfully
)
IS_SAMPLE = True # if True, use only <SAMPLE_ROWS> rows of data for training; otherwise use all data
SAMPLE_ROWS = 5000 # if IS_SAMPLE is True, use only this number of rows for training

DATA_FOLDER = "Files/book-recommendation/" # folder containing the datasets


ITEMS_FILE = "Books.csv" # file containing the items information
USERS_FILE = "Users.csv" # file containing the users information
RATINGS_FILE = "Ratings.csv" # file containing the ratings information

EXPERIMENT_NAME = "aisample-recommendation" # mlflow experiment name

Download dataset and upload to Lakehouse


The following code downloads the dataset and then stores it in the Lakehouse.

) Important

Add a Lakehouse to the notebook before running this code.

Python

if not IS_CUSTOM_DATA:
# Download data files into lakehouse if it does not exist
import os, requests

remote_url = "https://synapseaisolutionsa.blob.core.windows.net/public/Book-Recommendation-Dataset"
file_list = ["Books.csv", "Ratings.csv", "Users.csv"]
download_path = f"/lakehouse/default/{DATA_FOLDER}/raw"

if not os.path.exists("/lakehouse/default"):
raise FileNotFoundError(
"Default lakehouse not found, please add a lakehouse and restart the session."
)
os.makedirs(download_path, exist_ok=True)
for fname in file_list:
if not os.path.exists(f"{download_path}/{fname}"):
r = requests.get(f"{remote_url}/{fname}", timeout=30)
with open(f"{download_path}/{fname}", "wb") as f:
f.write(r.content)
print("Downloaded demo data files into lakehouse.")

Set up the MLflow experiment tracking


Use the following code to set up the MLflow experiment tracking. Autologging is disabled for this example. For more information, see the
Autologging article.

Python

# Setup mlflow for experiment tracking


import mlflow

mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.autolog(disable=True) # disable mlflow autologging

Read data from Lakehouse


Once the right data is in the Lakehouse, you can read the three separate datasets into separate Spark DataFrames in the notebook. The file
paths in the following code use the parameters that you defined earlier.

Python
df_items = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv(f"{DATA_FOLDER}/raw/{ITEMS_FILE}")
.cache()
)

df_ratings = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv(f"{DATA_FOLDER}/raw/{RATINGS_FILE}")
.cache()
)

df_users = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv(f"{DATA_FOLDER}/raw/{USERS_FILE}")
.cache()
)

Step 2. Perform exploratory data analysis

Display raw data


You can explore each of the DataFrames using the display command. This command allows you to view high-level statistics of the
DataFramkes and understand how different columns in the datasets are related to each other. Before exploring the datasets, use the
following code to import the required libraries.

Python

import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette() # adjusting plotting style
import pandas as pd # dataframes

Use the following code to look at the DataFrame that contains the book data:

Python

display(df_items, summary=True)

Add an _item_id column for later usage. The _item_id must be an integer for recommendation models. The following code uses
StringIndexer to transform ITEM_ID_COL to indices.

Python

df_items = (
StringIndexer(inputCol=ITEM_ID_COL, outputCol="_item_id")
.setHandleInvalid("skip")
.fit(df_items)
.transform(df_items)
.withColumn("_item_id", F.col("_item_id").cast("int"))
)

Display and check if the _item_id increases monotonically and successively as expected.

Python

display(df_items.sort(F.col("_item_id").desc()))

Use the following code to plot the top 10 authors with the maximum number of books. Agatha Christie is the leading author with over 600
books, followed by William Shakespeare.

Python
df_books = df_items.toPandas() # Create a pandas dataframe from the spark dataframe for visualization
plt.figure(figsize=(8,5))
sns.countplot(y="Book-Author",palette = 'Paired', data=df_books,order=df_books['Book-Author'].value_counts().index[0:10])
plt.title("Top 10 authors with maximum number of books")

Next, display the DataFrame that contains the user data:

Python

display(df_users, summary=True)

If there's a missing value in User-ID , drop the row with missing value. It doesn't matter if customized dataset doesn't have missing value.

Python

df_users = df_users.dropna(subset=(USER_ID_COL))
display(df_users, summary=True)

Add a _user_id column for later usage. The _user_id must be integer for recommendation models. In the following code, you use
StringIndexer to transform USER_ID_COL to indices.

The book dataset already contains a User-ID column, which is integer. However adding a _user_id column for compatibility to different
datasets makes this example more robust. To add the _user_id column, use the following code:

Python

df_users = (
StringIndexer(inputCol=USER_ID_COL, outputCol="_user_id")
.setHandleInvalid("skip")
.fit(df_users)
.transform(df_users)
.withColumn("_user_id", F.col("_user_id").cast("int"))
)

Python

display(df_users.sort(F.col("_user_id").desc()))

To view the rating data, use the following code:

Python

display(df_ratings, summary=True)

Get the distinct ratings and save them to a list named ratings for later use.
Python

ratings = [i[0] for i in df_ratings.select(RATING_COL).distinct().collect()]


print(ratings)

To display the top 10 books with the highest ratings, use the following code:

Python

plt.figure(figsize=(8,5))
sns.countplot(y="Book-Title",palette = 'Paired',data= df_books, order=df_books['Book-Title'].value_counts().index[0:10])
plt.title("Top 10 books per number of ratings")

"Selected Poems" is most favorable among users according to ratings. The books "Adventures of Huckleberry Finn", "The Secret Garden",
and "Dracula", have the same rating.

Merge data
Merge the three DataFrames into one DataFrame for a more comprehensive analysis.

Python

df_all = df_ratings.join(df_users, USER_ID_COL, "inner").join(


df_items, ITEM_ID_COL, "inner"
)
df_all_columns = [
c for c in df_all.columns if c not in ["_user_id", "_item_id", RATING_COL]
]

# Reorders the columns to ensure that _user_id, _item_id, and Book-Rating are the first three columns
df_all = (
df_all.select(["_user_id", "_item_id", RATING_COL] + df_all_columns)
.withColumn("id", F.monotonically_increasing_id())
.cache()
)

display(df_all)

To display a count of the total distinct users, books, and interactions, use the following code:

Python

print(f"Total Users: {df_users.select('_user_id').distinct().count()}")


print(f"Total Items: {df_items.select('_item_id').distinct().count()}")
print(f"Total User-Item Interactions: {df_all.count()}")

Compute and plot most popular items


To compute the most popular books, use the following code. It displays the top 10 most popular books:

Python

# Compute top popular products


df_top_items = (
df_all.groupby(["_item_id"])
.count()
.join(df_items, "_item_id", "inner")
.sort(["count"], ascending=[0])
)

# Find top <topn> popular items


topn = 10
pd_top_items = df_top_items.limit(topn).toPandas()
pd_top_items.head(10)

 Tip

The <topn> can be used for Popular or "Top purchased" recommendation sections.

Python

# Plot top <topn> items


f, ax = plt.subplots(figsize=(10, 5))
plt.xticks(rotation="vertical")
sns.barplot(y=ITEM_INFO_COL, x="count", data=pd_top_items)
ax.tick_params(axis='x', rotation=45)
plt.xlabel("Number of Ratings for the Item")
plt.show()

Prepare training and testing datasets


Before training, you need to perform some data preparation steps for the ALS recommender. Use the following code to prepare the data.
The code performs the following actions:

Casts the rating column into the correct type.


Samples the training data with user ratings.
Split the data into training and test datasets.

Python

if IS_SAMPLE:
# Must sort by '_user_id' before performing limit to ensure ALS work normally
# Note that if train and test datasets have no common _user_id, ALS will fail
df_all = df_all.sort("_user_id").limit(SAMPLE_ROWS)

# Cast column into the correct type


df_all = df_all.withColumn(RATING_COL, F.col(RATING_COL).cast("float"))

# Using a fraction between 0 to 1 returns the approximate size of the dataset, i.e., 0.8 means 80% of the dataset
# Rating = 0 means the user didn't rate the item, so it can't be used for training
# We use the 80% if the dataset with rating > 0 as the training dataset
fractions_train = {0: 0}
fractions_test = {0: 0}
for i in ratings:
if i == 0:
continue
fractions_train[i] = 0.8
fractions_test[i] = 1
# training dataset
train = df_all.sampleBy(RATING_COL, fractions=fractions_train)

# Join with leftanti will select all rows from df_all with rating > 0 and not in train dataset, i.e., the remaining 20% of
the dataset
# test dataset
test = df_all.join(train, on="id", how="leftanti").sampleBy(
RATING_COL, fractions=fractions_test
)

To gain a better understanding of the data and the problem at hand, use the following code to compute the sparsity of the dataset.
Sparsity refers to the situation in which feedback data is sparse and not sufficient to identify similarities in users' interests.

Python

# Compute the sparsity of the dataset


def get_mat_sparsity(ratings):
# Count the total number of ratings in the dataset - used as numerator
count_nonzero = ratings.select(RATING_COL).count()
print(f"Number of rows: {count_nonzero}")

# Count the total number of distinct user_id and distinct product_id - used as denominator
total_elements = (
ratings.select("_user_id").distinct().count()
* ratings.select("_item_id").distinct().count()
)

# Calculate the sparsity by dividing the numerator by the denominator


sparsity = (1.0 - (count_nonzero * 1.0) / total_elements) * 100
print("The ratings dataframe is ", "%.4f" % sparsity + "% sparse.")

get_mat_sparsity(df_all)

Python

# Check the ID range


# ALS only supports values in Integer range
print(f"max user_id: {df_all.agg({'_user_id': 'max'}).collect()[0][0]}")
print(f"max user_id: {df_all.agg({'_item_id': 'max'}).collect()[0][0]}")

Step 3. Develop and train the Model


You've explored the dataset, added unique IDs to users and items, and plotted top items. Next, train an Alternating Least Squares (ALS)
recommender to give users personalized recommendations.

Define the model


With the data prepared, you can now define the recommendation model. You train an Alternating Least Squares (ALS) recommender.

Spark ML provides a convenient API for building the ALS model. However, the model isn't good enough at handling problems like data
sparsity and cold start (making recommendations when the users or items are new). To improve the performance of the model, you can
combine cross validation and auto hyperparameter tuning.

To import libraries required for training and evaluating the model, use the following code:

Python

# Import Spark required libraries


from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, TrainValidationSplit

# Specify the training parameters


num_epochs = 1 # number of epochs, here we use 1 to reduce the training time
rank_size_list = [64] # the values of rank in ALS for tuning
reg_param_list = [0.01, 0.1] # the values of regParam in ALS for tuning
model_tuning_method = "TrainValidationSplit" # TrainValidationSplit or CrossValidator
# Build the recommendation model using ALS on the training data
# Note that we set the cold start strategy to 'drop' to ensure that we don't get NaN evaluation metrics
als = ALS(
maxIter=num_epochs,
userCol="_user_id",
itemCol="_item_id",
ratingCol=RATING_COL,
coldStartStrategy="drop",
implicitPrefs=False,
nonnegative=True,
)

Train model and perform hyperparameter tuning


To search over the hyperparameters, use the following code to construct a grid of parameters. It also creates a regression evaluator that
uses the root mean square error as the evaluation metric.

Python

# Construct a grid search to select the best values for the training parameters
param_grid = (
ParamGridBuilder()
.addGrid(als.rank, rank_size_list)
.addGrid(als.regParam, reg_param_list)
.build()
)

print("Number of models to be tested: ", len(param_grid))

# Define evaluator and set the loss fucntion to root mean squared error (RMSE)
evaluator = RegressionEvaluator(
metricName="rmse", labelCol=RATING_COL, predictionCol="prediction"
)

Use the following code to initiate different model tuning methods based on the preconfigured parameters. For more information on model
tuning, see https://spark.apache.org/docs/latest/ml-tuning.html .

Python

# Build cross validation using CrossValidator and TrainValidationSplit


if model_tuning_method == "CrossValidator":
tuner = CrossValidator(
estimator=als,
estimatorParamMaps=param_grid,
evaluator=evaluator,
numFolds=5,
collectSubModels=True,
)
elif model_tuning_method == "TrainValidationSplit":
tuner = TrainValidationSplit(
estimator=als,
estimatorParamMaps=param_grid,
evaluator=evaluator,
# 80% of the training data will be used for training, 20% for validation
trainRatio=0.8,
collectSubModels=True,
)
else:
raise ValueError(f"Unknown model_tuning_method: {model_tuning_method}")

Evaluate model
Modules should be evaluated against the test data. If a model is well trained, it should have high metrics on the dataset.

If the model is overfitted, you may need to increase the size of the training data or reduce some of the redundant features. You may also
need to change the model's architecture or fine-tune its hyperparameters.

To define an evaluation function, use the following code:

7 Note
If the R-Squared metric value is negative, it indicates that the trained model performs worse than a horizontal straight line. Suggesting
that the data isn't explained by the trained model.

Python

def evaluate(model, data, verbose=0):


"""
Evaluate the model by computing rmse, mae, r2 and variance over the data.
"""

predictions = model.transform(data).withColumn(
"prediction", F.col("prediction").cast("double")
)

if verbose > 1:
# Show 10 predictions
predictions.select("_user_id", "_item_id", RATING_COL, "prediction").limit(
10
).show()

# Initialize the regression evaluator


evaluator = RegressionEvaluator(predictionCol="prediction", labelCol=RATING_COL)

_evaluator = lambda metric: evaluator.setMetricName(metric).evaluate(predictions)


rmse = _evaluator("rmse")
mae = _evaluator("mae")
r2 = _evaluator("r2")
var = _evaluator("var")

if verbose > 0:
print(f"RMSE score = {rmse}")
print(f"MAE score = {mae}")
print(f"R2 score = {r2}")
print(f"Explained variance = {var}")

return predictions, (rmse, mae, r2, var)

Perform experiment tracking with MLflow


MLflow is used to track all the experiments and log parameters, metrics, and the models. To start training and evaluating models, use the
following code:

Python

from mlflow.models.signature import infer_signature

with mlflow.start_run(run_name="als"):
# Train models
models = tuner.fit(train)
best_metrics = {"RMSE": 10e6, "MAE": 10e6, "R2": 0, "Explained variance": 0}
best_index = 0
# Evaluate models
# Log model, metrics and parameters
for idx, model in enumerate(models.subModels):
with mlflow.start_run(nested=True, run_name=f"als_{idx}") as run:
print("\nEvaluating on testing data:")
print(f"subModel No. {idx + 1}")
predictions, (rmse, mae, r2, var) = evaluate(model, test, verbose=1)

signature = infer_signature(
train.select(["_user_id", "_item_id"]),
predictions.select(["_user_id", "_item_id", "prediction"]),
)
print("log model:")
mlflow.spark.log_model(
model,
f"{EXPERIMENT_NAME}-alsmodel",
signature=signature,
registered_model_name=f"{EXPERIMENT_NAME}-alsmodel",
dfs_tmpdir="Files/spark",
)
print("log metrics:")
current_metric = {
"RMSE": rmse,
"MAE": mae,
"R2": r2,
"Explained variance": var,
}
mlflow.log_metrics(current_metric)
if rmse < best_metrics["RMSE"]:
best_metrics = current_metric
best_index = idx

print("log parameters:")
mlflow.log_params(
{
"subModel_idx": idx,
"num_epochs": num_epochs,
"rank_size_list": rank_size_list,
"reg_param_list": reg_param_list,
"model_tuning_method": model_tuning_method,
"DATA_FOLDER": DATA_FOLDER,
}
)
# Log best model and related metrics and parameters to the parent run
mlflow.spark.log_model(
models.subModels[best_index],
f"{EXPERIMENT_NAME}-alsmodel",
signature=signature,
registered_model_name=f"{EXPERIMENT_NAME}-alsmodel",
dfs_tmpdir="Files/spark",
)
mlflow.log_metrics(best_metrics)
mlflow.log_params(
{
"subModel_idx": idx,
"num_epochs": num_epochs,
"rank_size_list": rank_size_list,
"reg_param_list": reg_param_list,
"model_tuning_method": model_tuning_method,
"DATA_FOLDER": DATA_FOLDER,
}
)

To view the logged information for the training run, select the experiment named aisample-recommendation from your workspace. If you
changed the experiment name, select the experiment with the name you specified. The logged information appears similar to the following
image:

Step 4: Load the final model for scoring and make predictions
Once the training has completed and the best model is selected, load the model for scoring (sometimes called inferencing). The following
code loads the model and uses predictions to recommend the top 10 books for each user:

Python

# Load the best model


# Note that mlflow uses the PipelineModel to wrap the original model, thus we extract the original ALSModel from the stages
model_uri = f"models:/{EXPERIMENT_NAME}-alsmodel/1"
loaded_model = mlflow.spark.load_model(model_uri, dfs_tmpdir="Files/spark").stages[-1]

# Generate top 10 book recommendations for each user


userRecs = loaded_model.recommendForAllUsers(10)

# Represent the recommendations in an interpretable format


userRecs = (
userRecs.withColumn("rec_exp", F.explode("recommendations"))
.select("_user_id", F.col("rec_exp._item_id"), F.col("rec_exp.rating"))
.join(df_items.select(["_item_id", "Book-Title"]), on="_item_id")
)
userRecs.limit(10).show()

The output of appears similar to the following table:

_item_id _user_id rating Book-Title

44865 7 7.9996786 Lasher: Lives of ...

786 7 6.2255826 The Piano Man's D...

45330 7 4.980466 State of Mind

38960 7 4.980466 All He Ever Wanted

125415 7 4.505084 Harry Potter and ...

44939 7 4.3579073 Taltos: Lives of ...

175247 7 4.3579073 The Bonesetter's ...

170183 7 4.228735 Living the Simple...

88503 7 4.221206 Island of the Blu...

32894 7 3.9031885 Winter Solstice

Save the predictions to the Lakehouse


To write the recommendations back to the Lakehouse, use the following code:

Python

# Code to save the userRecs into lakehouse


userRecs.write.format("delta").mode("overwrite").save(
f"{DATA_FOLDER}/predictions/userRecs"
)

Next steps
Training and evaluating a text classification model
Machine learning model in Microsoft Fabric
Train machine learning models
Machine learning experiments in Microsoft Fabric
Tutorial: Create, evaluate, and score a fraud detection model
Article • 09/21/2023

In this tutorial, you walk through the Synapse Data Science in Microsoft Fabric workflow with an end-to-end example. The scenario is to
build a fraud detection model by using machine learning algorithms trained on historical data, and then use the model to detect future
fraudulent transactions. The steps that you take are:

" Install custom libraries.


" Load the data.
" Understand and process the data through exploratory data analysis.
" Train a machine learning model by using scikit-learn, and track experiments by using MLflow and the Fabric autologging feature.
" Save and register the best-performing machine learning model.
" Load the machine learning model for scoring and making predictions.

) Important

Microsoft Fabric is in preview.

Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview) trial.

Sign in to Microsoft Fabric .

Switch to the Data Science experience by using the experience switcher icon on the left side of your home page.

If you don't have a Microsoft Fabric lakehouse, create one by following the steps in Create a lakehouse in Microsoft Fabric.

Follow along in the notebook


AIsample - Fraud Detection.ipynb is the notebook that accompanies this tutorial.

If you want to open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science to import
the tutorial notebooks to your workspace.

Or, if you'd rather copy and paste the code from this page, you can create a new notebook.

Be sure to attach a lakehouse to the notebook before you start running code.

Step 1: Install custom libraries


When you're developing a machine learning model or doing ad hoc data analysis, you might need to quickly install a custom library (such
as imblearn ) for your Apache Spark session. You can install libraries in one of two ways:

Use the inline installation capabilities (such as %pip or %conda ) of your notebook to install libraries in your current notebook only.
Install libraries directly in your workspace, so that the libraries are available for use by all notebooks in your workspace.

For more information on installing libraries, see Install Python libraries.

For this tutorial, you install the imblearn library in your notebook by using %pip install . When you run %pip install , the PySpark kernel
restarts. So you should install the library before you run any other cells in the notebook:
Python

# Use pip to install imblearn


%pip install imblearn

Step 2: Load the data


The fraud detection dataset contains credit card transactions that European cardholders made in September 2013 over the course of two
days. The dataset contains only numerical features, which is the result of a Principal Component Analysis (PCA) transformation that was
applied to the original features. The only features that haven't been transformed with PCA are Time and Amount . To protect confidentiality,
we can't provide the original features or more background information about the dataset.

Here are some details about the dataset:

The features V1 , V2 , V3 , …, V28 are the principal components obtained with PCA.
The feature Time contains the elapsed seconds between a transaction and the first transaction in the dataset.
The feature Amount is the transaction amount. You can use this feature for example-dependent, cost-sensitive learning.
The column Class is the response (target) variable and takes the value 1 for fraud and 0 otherwise.

Out of the 284,807 transactions, only 492 are fraudulent. The minority class (fraud) accounts for only about 0.172% of the data, so the
dataset is highly imbalanced.

The following table shows a preview of the creditcard.csv data:

Time V1 V2 V3 V4 V5 V6 V7

0 -1.3598071336738 -0.0727811733098497 2.53634673796914 1.37815522427443 -0.338320769942518 0.462387777762292 0.23959855406125

0 1.19185711131486 0.26615071205963 0.16648011335321 0.448154078460911 0.0600176492822243 -0.0823608088155687 -0.07880298333231

Introduction to SMOTE
The imblearn (imbalanced learn) library uses the Synthetic Minority Oversampling Technique (SMOTE) approach to address the problem of
imbalanced classification. Imbalanced classification happens when there are too few examples of the minority class for a model to
effectively learn the decision boundary.

SMOTE is the most widely used approach to synthesize new samples for the minority class. To learn more about SMOTE, see the scikit-learn
reference page for the SMOTE method and the scikit-learn user guide on oversampling .

 Tip

You can apply the SMOTE approach by using the imblearn library that you installed in Step 1.

Download the dataset and upload to the lakehouse


By defining the following parameters, you can easily apply your notebook on different datasets:

Python

IS_CUSTOM_DATA = False # If True, the dataset has to be uploaded manually

TARGET_COL = "Class" # Target column name


IS_SAMPLE = False # If True, use only <SAMPLE_ROWS> rows of data for training; otherwise, use all data
SAMPLE_ROWS = 5000 # If IS_SAMPLE is True, use only this number of rows for training

DATA_FOLDER = "Files/fraud-detection/" # Folder with data files


DATA_FILE = "creditcard.csv" # Data file name

EXPERIMENT_NAME = "aisample-fraud" # MLflow experiment name

The following code downloads a publicly available version of the dataset and then stores it in a Fabric lakehouse.

) Important
Be sure to add a lakehouse to the notebook before running it. If you don't, you'll get an error.

Python

if not IS_CUSTOM_DATA:
# Download data files into the lakehouse if they're not already there
import os, requests

remote_url = "https://synapseaisolutionsa.blob.core.windows.net/public/Credit_Card_Fraud_Detection"
fname = "creditcard.csv"
download_path = f"/lakehouse/default/{DATA_FOLDER}/raw"

if not os.path.exists("/lakehouse/default"):
raise FileNotFoundError("Default lakehouse not found, please add a lakehouse and restart the session.")
os.makedirs(download_path, exist_ok=True)
if not os.path.exists(f"{download_path}/{fname}"):
r = requests.get(f"{remote_url}/{fname}", timeout=30)
with open(f"{download_path}/{fname}", "wb") as f:
f.write(r.content)
print("Downloaded demo data files into lakehouse.")

Set up MLflow experiment tracking


Experiment tracking is the process of saving all relevant experiment-related information for every experiment that you run. Sometimes, it's
easy to observe that there's no way to get better results when you're running a particular experiment. In such a situation, you're better off
simply stopping the experiment and trying a new one.

The Data Science experience in Microsoft Fabric includes an autologging feature. This feature reduces the amount of code required to
automatically log the parameters, metrics, and items of a machine learning model during training. The feature extends MLflow's
autologging capabilities and is deeply integrated into the Data Science experience.

By using autologging, you can easily track and compare the performance of different models and experiments without the need for manual
tracking. For more information, see Autologging in Microsoft Fabric .

You can disable Microsoft Fabric autologging in a notebook session by calling mlflow.autolog() and setting disable=True :

Python

# Set up MLflow for experiment tracking


import mlflow

mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.autolog(disable=True) # Disable MLflow autologging

Read raw data from the lakehouse


This code reads raw data from the lakehouse:

Python

df = (
spark.read.format("csv")
.option("header", "true")
.option("inferSchema", True)
.load(f"{DATA_FOLDER}/raw/{DATA_FILE}")
.cache()
)

Step 3: Perform exploratory data analysis


In this section, you begin by exploring the raw data and high-level statistics. You then transform the data by casting the columns into the
correct types and converting from a Spark DataFrame into a pandas DataFrame for easier visualization. Finally, you explore and visualize the
class distributions in the data.

Display the raw data


1. Explore the raw data and view high-level statistics by using the display command. For more information on data visualization, see
Notebook visualization in Microsoft Fabric .

Python

display(df)

2. Print some basic information about the dataset:

Python

# Print dataset basic information


print("records read: " + str(df.count()))
print("Schema: ")
df.printSchema()

Transform the data


1. Cast the dataset's columns into the correct types:

Python

import pyspark.sql.functions as F

df_columns = df.columns
df_columns.remove(TARGET_COL)

# Ensure that TARGET_COL is the last column


df = df.select(df_columns + [TARGET_COL]).withColumn(TARGET_COL, F.col(TARGET_COL).cast("int"))

if IS_SAMPLE:
df = df.limit(SAMPLE_ROWS)

2. Convert Spark DataFrame to pandas DataFrame for easier visualization and processing:

Python

df_pd = df.toPandas()

Explore the class distribution in the dataset


1. Display the class distribution in the dataset:

Python

# The distribution of classes in the dataset


print('No Frauds', round(df_pd['Class'].value_counts()[0]/len(df_pd) * 100,2), '% of the dataset')
print('Frauds', round(df_pd['Class'].value_counts()[1]/len(df_pd) * 100,2), '% of the dataset')

The code returns the following class distribution of the dataset: 99.83% No Frauds and 0.17% Frauds . This class distribution shows
that most of the transactions are nonfraudulent. Therefore, data preprocessing is required before model training, to avoid overfitting.

2. Use a plot to show the class imbalance in the dataset, by viewing the distribution of fraudulent versus nonfraudulent transactions:

Python

import seaborn as sns


import matplotlib.pyplot as plt

colors = ["#0101DF", "#DF0101"]


sns.countplot(x='Class', data=df_pd, palette=colors)
plt.title('Class Distributions \n (0: No Fraud || 1: Fraud)', fontsize=10)

3. Show the five-number summary (minimum score, first quartile, median, third quartile, and maximum score) for the transaction
amount, by using box plots:

Python
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,5))
s = sns.boxplot(ax = ax1, x="Class", y="Amount", hue="Class",data=df_pd, palette="PRGn", showfliers=True) # Remove
outliers from the plot
s = sns.boxplot(ax = ax2, x="Class", y="Amount", hue="Class",data=df_pd, palette="PRGn", showfliers=False) # Keep
outliers from the plot
plt.show()

When the data is highly imbalanced, these box plots might not demonstrate accurate insights. Alternatively, you can address the class
imbalance problem first and then create the same plots for more accurate insights.

Step 4: Train and evaluate models


In this section, you train a LightGBM model to classify the fraud transactions. You train a LightGBM model on both the imbalanced dataset
and the balanced dataset (via SMOTE). Then, you compare the performance of both models.

Prepare training and test datasets


Before training, split the data into the training and test datasets:

Python

# Split the dataset into training and test sets


from sklearn.model_selection import train_test_split

train, test = train_test_split(df_pd, test_size=0.15)


feature_cols = [c for c in df_pd.columns.tolist() if c not in [TARGET_COL]]

Apply SMOTE to the training data to synthesize new samples for the minority class
Apply SMOTE only to the training dataset, and not to the test dataset. When you score the model with the test data, you want an
approximation of the model's performance on unseen data in production. For this approximation to be valid, your test data needs to
represent production data as closely as possible by having the original imbalanced distribution.

Python

# Apply SMOTE to the training data


import pandas as pd
from collections import Counter
from imblearn.over_sampling import SMOTE

X = train[feature_cols]
y = train[TARGET_COL]
print("Original dataset shape %s" % Counter(y))

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
print("Resampled dataset shape %s" % Counter(y_res))

new_train = pd.concat([X_res, y_res], axis=1)

Train machine learning models and run experiments


Apache Spark in Microsoft Fabric enables machine learning with big data. By using Apache Spark, you can get valuable insights from large
amounts of structured, unstructured, and fast-moving data.

There are several options for training machine learning models by using Apache Spark in Microsoft Fabric: Apache Spark MLlib, SynapseML,
and other open-source libraries. For more information, see Train machine learning models in Microsoft Fabric .

A machine learning experiment is the primary unit of organization and control for all related machine learning runs. A run corresponds to a
single execution of model code. Machine learning experiment tracking refers to the process of managing all the experiments and their
components, such as parameters, metrics, models, and other artifacts.

When you use experiment tracking, you can organize all the required components of a specific machine learning experiment. You can also
easily reproduce past results, by using saved experiments. For more information on machine learning experiments, see Machine learning
experiments in Microsoft Fabric .
1. Update the MLflow autologging configuration to track more metrics, parameters, and files, by setting exclusive=False :

Python

mlflow.autolog(exclusive=False)

2. Train two models by using LightGBM: one model on the imbalanced dataset and the other on the balanced dataset (via SMOTE). Then
compare the performance of the two models.

Python

import lightgbm as lgb

model = lgb.LGBMClassifier(objective="binary") # Imbalanced dataset


smote_model = lgb.LGBMClassifier(objective="binary") # Balanced dataset

Python

# Train LightGBM for both imbalanced and balanced datasets and define the evaluation metrics
print("Start training with imbalanced data:\n")
with mlflow.start_run(run_name="raw_data") as raw_run:
model = model.fit(
train[feature_cols],
train[TARGET_COL],
eval_set=[(test[feature_cols], test[TARGET_COL])],
eval_metric="auc",
callbacks=[
lgb.log_evaluation(10),
],
)

print(f"\n\nStart training with balanced data:\n")


with mlflow.start_run(run_name="smote_data") as smote_run:
smote_model = smote_model.fit(
new_train[feature_cols],
new_train[TARGET_COL],
eval_set=[(test[feature_cols], test[TARGET_COL])],
eval_metric="auc",
callbacks=[
lgb.log_evaluation(10),
],
)

Determine feature importance for training


1. Determine feature importance for the model that you trained on the imbalanced dataset:

Python

with mlflow.start_run(run_id=raw_run.info.run_id):
importance = lgb.plot_importance(
model, title="Feature importance for imbalanced data"
)
importance.figure.savefig("feauture_importance.png")
mlflow.log_figure(importance.figure, "feature_importance.png")

2. Determine feature importance for the model that you trained on balanced data (generated via SMOTE):

Python

with mlflow.start_run(run_id=smote_run.info.run_id):
smote_importance = lgb.plot_importance(
smote_model, title="Feature importance for balanced (via SMOTE) data"
)
smote_importance.figure.savefig("feauture_importance_smote.png")
mlflow.log_figure(smote_importance.figure, "feauture_importance_smote.png")

The important features are drastically different when you train a model with the imbalanced dataset versus the balanced dataset.

Evaluate the models


In this section, you evaluate the two trained models:

model trained on raw, imbalanced data

smote_model trained on balanced data

Compute model metrics


1. Define a prediction_to_spark function that performs predictions and converts the prediction results into a Spark DataFrame. You can
later compute model statistics on the prediction results by using SynapseML .

Python

from pyspark.sql.functions import col


from pyspark.sql.types import IntegerType, DoubleType

def prediction_to_spark(model, test):


predictions = model.predict(test[feature_cols], num_iteration=model.best_iteration_)
predictions = tuple(zip(test[TARGET_COL].tolist(), predictions.tolist()))
dataColumns = [TARGET_COL, "prediction"]
predictions = (
spark.createDataFrame(data=predictions, schema=dataColumns)
.withColumn(TARGET_COL, col(TARGET_COL).cast(IntegerType()))
.withColumn("prediction", col("prediction").cast(DoubleType()))
)

return predictions

2. Use the prediction_to_spark function to perform predictions with the two models, model and smote_model :

Python

predictions = prediction_to_spark(model, test)


smote_predictions = prediction_to_spark(smote_model, test)
predictions.limit(10).toPandas()

3. Compute metrics for the two models:

Python

from synapse.ml.train import ComputeModelStatistics

metrics = ComputeModelStatistics(
evaluationMetric="classification", labelCol=TARGET_COL, scoredLabelsCol="prediction"
).transform(predictions)

smote_metrics = ComputeModelStatistics(
evaluationMetric="classification", labelCol=TARGET_COL, scoredLabelsCol="prediction"
).transform(smote_predictions)
display(metrics)

Evaluate model performance by using a confusion matrix

A confusion matrix displays the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) that a model
produces when it's scored with test data. For binary classification, you get a 2x2 confusion matrix. For multiclass classification, you get an
nxn confusion matrix, where n is the number of classes.

1. Use a confusion matrix to summarize the performances of the trained machine learning models on the test data:

Python

# Collect confusion matrix values


cm = metrics.select("confusion_matrix").collect()[0][0].toArray()
smote_cm = smote_metrics.select("confusion_matrix").collect()[0][0].toArray()
print(cm)

2. Plot the confusion matrix for the predictions of smote_model (trained on balanced data):

Python
# Plot the confusion matrix
import seaborn as sns

def plot(cm):
"""
Plot the confusion matrix.
"""
sns.set(rc={"figure.figsize": (5, 3.5)})
ax = sns.heatmap(cm, annot=True, fmt=".20g")
ax.set_title("Confusion Matrix")
ax.set_xlabel("Predicted label")
ax.set_ylabel("True label")
return ax

with mlflow.start_run(run_id=smote_run.info.run_id):
ax = plot(smote_cm)
mlflow.log_figure(ax.figure, "ConfusionMatrix.png")

3. Plot the confusion matrix for the predictions of model (trained on raw, imbalanced data):

Python

with mlflow.start_run(run_id=raw_run.info.run_id):
ax = plot(cm)
mlflow.log_figure(ax.figure, "ConfusionMatrix.png")

Evaluate model performance by using AUC-ROC and AUPRC measures

The Area Under the Curve Receiver Operating Characteristic (AUC-ROC) measure is widely used to assess the performance of binary
classifiers. AUC-ROC is a chart that visualizes the trade-off between the true positive rate (TPR) and the false positive rate (FPR).

In some cases, it's more appropriate to evaluate your classifier based on the Area Under the Precision-Recall Curve (AUPRC) measure.
AUPRC is a curve that combines these rates:

The precision, also called the positive predictive value (PPV)


The recall, also called TPR

To evaluate performance by using the AUC-ROC and AUPRC measures:

1. Define a function that returns the AUC-ROC and AUPRC measures:

Python

from pyspark.ml.evaluation import BinaryClassificationEvaluator

def evaluate(predictions):
"""
Evaluate the model by computing AUROC and AUPRC with the predictions.
"""

# Initialize the binary evaluator


evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol=TARGET_COL)

_evaluator = lambda metric: evaluator.setMetricName(metric).evaluate(predictions)

# Calculate AUROC, baseline 0.5


auroc = _evaluator("areaUnderROC")
print(f"The AUROC is: {auroc:.4f}")

# Calculate AUPRC, baseline positive rate (0.172% in the data)


auprc = _evaluator("areaUnderPR")
print(f"The AUPRC is: {auprc:.4f}")

return auroc, auprc

2. Log the AUC-ROC and AUPRC metrics for the model trained on imbalanced data:

Python

with mlflow.start_run(run_id=raw_run.info.run_id):
auroc, auprc = evaluate(predictions)
mlflow.log_metrics({"AUPRC": auprc, "AUROC": auroc})
mlflow.log_params({"Data_Enhancement": "None", "DATA_FILE": DATA_FILE})

3. Log the AUC-ROC and AUPRC metrics for the model trained on balanced data:

Python

with mlflow.start_run(run_id=smote_run.info.run_id):
auroc, auprc = evaluate(smote_predictions)
mlflow.log_metrics({"AUPRC": auprc, "AUROC": auroc})
mlflow.log_params({"Data_Enhancement": "SMOTE", "DATA_FILE": DATA_FILE})

The model trained on balanced data returns higher AUC-ROC and AUPRC values compared to the model trained on imbalanced data.
Based on these measures, SMOTE appears to be an effective technique for enhancing model performance when you're working with highly
imbalanced data.

As shown in the following image, any experiment is logged along with its respective name. You can track the experiment's parameters and
performance metrics in your workspace.

The following image also shows performance metrics for the model trained on the balanced dataset (in Version 2). You can select Version 1
to see the metrics for the model trained on the imbalanced dataset. When you compare the metrics, you notice that AUROC is higher for
the model trained with the balanced dataset. These results indicate that this model is better at correctly predicting 0 classes as 0 and
predicting 1 classes as 1 .

You might also like