Fabric Data Science 1 150
Fabric Data Science 1 150
e OVERVIEW
Get started
b GET STARTED
Tutorials
g TUTORIAL
Retail recommendation
Fraud detection
Forecasting
Preparing Data
c HOW-TO GUIDE
Training Models
e OVERVIEW
Training Overview
c HOW-TO GUIDE
c HOW-TO GUIDE
Fabric autologging
Using SynapseML
b GET STARTED
c HOW-TO GUIDE
Using R
e OVERVIEW
R overview
c HOW-TO GUIDE
R library management
R visualization
Reference docs
i REFERENCE
e OVERVIEW
c HOW-TO GUIDE
) Important
Microsoft Fabric offers Data Science experiences to empower users to complete end-to-
end data science workflows for the purpose of data enrichment and business insights.
You can complete a wide range of activities across the entire data science process, all
the way from data exploration, preparation and cleansing to experimentation, modeling,
model scoring and serving of predictive insights to BI reports.
Microsoft Fabric users can access a Data Science Home page. From there, they can
discover and access various relevant resources. For example, they can create machine
learning Experiments, Models and Notebooks. They can also import existing Notebooks
on the Data Science Home page.
You might know how a typical data science process works. As a well-known process,
most machine learning projects follow it.
This article describes the Microsoft Fabric Data Science capabilities from a data science
process perspective. For each step in the data science process, this article summarizes
the Microsoft Fabric capabilities that can help.
Users can easily read data from a Lakehouse directly into a Pandas dataframe. For
exploration, this makes seamless data reads from OneLake possible.
A powerful set of tools is available for data ingestion and data orchestration pipelines
with data integration pipelines - a natively integrated part of Microsoft Fabric. Easy-to-
build data pipelines can access and transform the data into a format that machine
learning can consume.
Data exploration
An important part of the machine learning process is to understand data through
exploration and visualization.
Depending on the data storage location, Microsoft Fabric offers a set of different tools
to explore and prepare the data for analytics and machine learning. Notebooks become
one of the quickest ways to get started with data exploration.
ML algorithms and libraries can help train machine learning models. Library
management tools can install these libraries and algorithms. Users have therefore the
option to leverage a large variety of popular machine learning libraries to complete their
ML model training in Microsoft Fabric.
Additionally, popular libraries like Scikit Learn can also develop models.
MLflow experiments and runs can track the ML model training. Microsoft Fabric offers a
built-in MLflow experience with which users can interact, to log experiments and
models. Learn more about how to use MLflow to track experiments and manage models
in Microsoft Fabric.
SynapseML
The SynapseML (previously known as MMLSpark) open-source library, that Microsoft
owns and maintains, simplifies massively scalable machine learning pipeline creation. As
a tool ecosystem, it expands the Apache Spark framework in several new directions.
SynapseML unifies several existing machine learning frameworks and new Microsoft
algorithms into a single, scalable API. The open-source SynapseML library includes a rich
ecosystem of ML tools for development of predictive models, as well as leveraging pre-
trained AI models from Azure AI services. Learn more about SynapseML .
Gain insights
In Microsoft Fabric, Predicted values can easily be written to OneLake, and seamlessly
consumed from Power BI reports, with the Power BI Direct Lake mode. This makes it very
easy for data science practitioners to share results from their work with stakeholders and
it also simplifies operationalization.
Notebooks that contain batch scoring can be scheduled to run using the Notebook
scheduling capabilities. Batch scoring can also be scheduled as part of data pipeline
activities or Spark jobs. Power BI automatically gets the latest predictions without need
for loading or refresh of the data, thanks to the Direct lake mode in Microsoft Fabric.
avoid the need to re-implement business logic and domain knowledge in their
code
easily access and use Power BI measures in their code
use semantics to power new experiences, such as semantic functions
explore and validate functional dependencies and relationships between data
increased productivity and faster collaboration across teams that operate on the
same datasets
increased cross-collaboration across business intelligence and AI teams
reduced ambiguity and an easier learning curve when onboarding onto a new
model or dataset
Next steps
Get started with end-to-end data science samples, see Data Science Tutorials
Learn more about data preparation and cleansing with Data Wrangler, see Data
Wrangler
Learn more about tracking experiments, see Machine learning experiment
Learn more about managing models, see Machine learning model
Learn more about batch scoring with Predict, see Score models with PREDICT
Serve predictions from Lakehouse to Power BI with Direct lake Mode
Feedback
Was this page helpful? Yes No
This set of tutorials demonstrates a sample end-to-end scenario in the Fabric data
science experience. You'll implement each step from data ingestion, cleansing, and
preparation, to training machine learning models and generating insights, and then
consume those insights using visualization tools like Power BI.
) Important
Introduction
The lifecycle of a Data science project typically includes (often, iteratively) the following
steps:
Business understanding
Data acquisition
Data exploration, cleansing, preparation, and visualization
Model training and experiment tracking
Model scoring and generating insights.
The goals and success criteria of each stage depend on collaboration, data sharing and
documentation. The Fabric data science experience consists of multiple native-built
features that enable collaboration, data acquisition, sharing, and consumption in a
seamless way.
In these tutorials, you take the role of a data scientist who has been given the task to
explore, clean, and transform a dataset containing taxicab trip data. You then build a
machine learning model to predict trip duration at scale on a large dataset.
6. Register and track trained models using MLflow and the Fabric UI.
7. Run scoring at scale and save predictions and inference results to the lakehouse.
Architecture
In this tutorial series, we showcase a simplified end-to-end data science scenario that
involves:
Explore, clean, and prepare - The data science experience on Fabric supports data
cleansing, transformation, exploration and featurization by using built-in experiences on
Spark as well as Python based tools like Data Wrangler and SemPy Library. This tutorial
will showcase data exploration using Python library seaborn and data cleansing and
preparation using Apache Spark.
Models and experiments - Fabric enables you to train, evaluate, and score machine
learning models by using built-in experiment and model items with seamless integration
with MLflow for experiment tracking and model registration/deployment. Fabric also
features capabilities for model prediction at scale (PREDICT) to gain and share business
insights.
Storage - Fabric standardizes on Delta Lake , which means all the engines of Fabric can
interact with the same dataset stored in a lakehouse. This storage layer allows you to
store both structured and unstructured data that support both file-based storage and
tabular format. The datasets and files stored can be easily accessed via all Fabric
experience items like notebooks and pipelines.
Expose analysis and insights - Data from a lakehouse can be consumed by Power BI,
industry leading business intelligence tool, for reporting and visualization. Data
persisted in the lakehouse can also be visualized in notebooks using Spark or Python
native visualization libraries like matplotlib , seaborn , plotly , and more. Data can also
be visualized using the SemPy library that supports built-in rich, task-specific
visualizations for the semantic data model, for dependencies and their violations, and
for classification and regression use cases.
Next steps
Prepare your system for the data science tutorial
Feedback
Was this page helpful? Yes No
Provide product feedback | Ask the community
Prepare your system for data science
tutorials
Article • 09/26/2023
Before you begin the data science end-to-end tutorial series, learn about prerequisites,
how to import notebooks, and how to attach a lakehouse to those notebooks.
) Important
Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview)
trial.
Switch to the Data Science experience by using the experience switcher icon on the
left side of your home page.
If you don't have a Microsoft Fabric lakehouse, create one by following the steps in
Create a lakehouse in Microsoft Fabric.
1. Download your notebook(s). Make sure to download the files by using the "Raw"
file link in GitHub.
For the Get started notebooks, download the notebook(.ipynb) files from the
parent folder: data-science-tutorial .
For the Tutorials notebooks, download the notebooks(.ipynb) files from the
parent folder ai-samples .
2. On the Data science experience homepage, select Import notebook and upload
the notebook files that you downloaded in step 1.
3. Once the notebooks are imported, select Go to workspace in the import dialog
box.
4. The imported notebooks are now available in your workspace for use.
5. If the imported notebook includes output, select the Edit menu, then select Clear
all outputs.
2. Select Add lakehouse in the left pane and select Existing lakehouse to open the
Data hub dialog box.
3. Select the workspace and the lakehouse you intend to use with these tutorials and
select Add.
4. Once a lakehouse is added, it's visible in the lakehouse pane in the notebook UI
where tables and files stored in the lakehouse can be viewed.
7 Note
Before executing each notebooks, you need to perform these steps on that
notebook.
Next steps
Part 1: Ingest data into Fabric lakehouse using Apache Spark
Feedback
Was this page helpful? Yes No
In this tutorial, you'll ingest data into Fabric lakehouses in delta lake format. Some important terms to understand:
Lakehouse - A lakehouse is a collection of files/folders/tables that represent a database over a data lake used by the Spark engine and
SQL engine for big data processing and that includes enhanced capabilities for ACID transactions when using the open-source Delta
formatted tables.
Delta Lake - Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata management, and batch and
streaming data processing to Apache Spark. A Delta Lake table is a data table format that extends Parquet data files with a file-based
transaction log for ACID transactions and scalable metadata management.
Azure Open Datasets are curated public datasets you can use to add scenario-specific features to machine learning solutions for more
accurate models. Open Datasets are in the cloud on Microsoft Azure Storage and can be accessed by various methods including
Apache Spark, REST API, Data factory, and other tools.
) Important
Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview) trial.
Switch to the Data Science experience by using the experience switcher icon on the left side of your home page.
Add a lakehouse to this notebook. You'll be downloading data from a public blob, then storing the data in the lakehouse.
If you want to open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science to import
the tutorial notebooks to your workspace.
Or, if you'd rather copy and paste the code from this page, you can create a new notebook.
Be sure to attach a lakehouse to the notebook before you start running code.
The dataset also includes columns such as row number, customer ID, and customer surname that should have no impact on customer's
decision to leave the bank.
The event that defines the customer's churn is the closing of the customer's bank account. The column exited in the dataset refers to
customer's abandonment. There isn't much context available about these attributes so you have to proceed without having background
information about the dataset. The aim is to understand how these attributes contribute to the exited status.
"CustomerID" "Surname" "CreditScore" "Geography" "Gender" "Age" "Tenure" "Balance" "NumOfProducts" "HasCrCard" "IsActiveMemb
Tip
By defining the following parameters, you can use this notebook with different datasets easily.
Python
DATA_ROOT = "/lakehouse/default"
DATA_FOLDER = "Files/churn" # folder with data files
DATA_FILE = "churn.csv" # data file name
This code downloads a publicly available version of the dataset and then stores it in a Fabric lakehouse.
) Important
Make sure you add a lakehouse to the notebook before running it. Failure to do so will result in an error.
Python
if not os.path.exists("/lakehouse/default"):
raise FileNotFoundError(
"Default lakehouse not found, please add a lakehouse and restart the session."
)
os.makedirs(download_path, exist_ok=True)
for fname in file_list:
if not os.path.exists(f"{download_path}/{fname}"):
r = requests.get(f"{remote_url}/{fname}", timeout=30)
with open(f"{download_path}/{fname}", "wb") as f:
f.write(r.content)
print("Downloaded demo data files into lakehouse.")
Next steps
You'll use the data you just ingested in:
Feedback
Was this page helpful? Yes No
In this tutorial, you'll learn how to conduct exploratory data analysis (EDA) to examine
and investigate the data while summarizing its key characteristics through the use of
data visualization techniques.
You'll use seaborn , a Python data visualization library that provides a high-level interface
for building visuals on dataframes and arrays. For more information about seaborn , see
Seaborn: Statistical Data Visualization .
You'll also use Data Wrangler, a notebook-based tool that provides you with an
immersive experience to conduct exploratory data analysis and cleaning.
) Important
Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview)
trial.
Switch to the Data Science experience by using the experience switcher icon on the
left side of your home page.
This is part 2 of 5 in the tutorial series. To complete this tutorial, first complete:
Part 1: Ingest data into a Microsoft Fabric lakehouse using Apache Spark.
If you want to open the accompanying notebook for this tutorial, follow the instructions
in Prepare your system for data science to import the tutorial notebooks to your
workspace.
Or, if you'd rather copy and paste the code from this page, you can create a new
notebook.
Be sure to attach a lakehouse to the notebook before you start running code.
) Important
Python
df = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv("Files/churn/raw/churn.csv")
.cache()
)
Python
df = df.toPandas()
Python
Python
display(df, summary=True)
Data Wrangler can not be opened while the notebook kernel is busy. The cell
execution must complete prior to launching Data Wrangler.
1. Under the notebook ribbon Data tab, select Launch Data Wrangler. You'll see a list
of activated pandas DataFrames available for editing.
2. Select the DataFrame you wish to open in Data Wrangler. Since this notebook only
contains one DataFrame, pd , select pd .
Data Wrangler launches and generates a descriptive overview of your data. The table in
the middle shows each data column. The Summary panel next to the table shows
information about the DataFrame. When you select a column in the table, the summary
updates with information about the selected column. In some instances, the data
displayed and summarized will be a truncated view of your DataFrame. When this
happens, you'll see warning image in the summary pane. Hover over this warning to
view text explaining the situation.
Each operation you do can be applied in a matter of clicks, updating the data display in
real time and generating code that you can save back to your notebook as a reusable
function.
The rest of this section walks you through the steps to perform data cleaning with Data
Wrangler.
2. A panel appears for you to select the list of columns you want to compare to
define a duplicate row. Select RowNumber and CustomerId.
In the middle panel is a preview of the results of this operation. Under the preview
is the code to perform the operation. In this instance, the data appears to be
unchanged. But since you're looking at a truncated view, it's a good idea to still
apply the operation.
3. Select Apply (either at the side or at the bottom) to go to the next step.
Select Add code to notebook at the top left to close Data Wrangler and add the code
automatically. The Add code to notebook wraps the code in a function, then calls the
function.
Tip
The code generated by Data Wrangler won't be applied until you manually run the
new cell.
If you didn't use Data Wrangler, you can instead use this next code cell.
This code is similar to the code produced by Data Wrangler, but adds in the argument
inplace=True to each of the generated steps. By setting inplace=True , pandas will
Python
# Define a new function that include all above Data Wrangler operations
def clean_data(df):
# Drop rows with missing data across all columns
df.dropna(inplace=True)
# Drop duplicate rows in columns: 'RowNumber', 'CustomerId'
df.drop_duplicates(subset=['RowNumber', 'CustomerId'], inplace=True)
# Drop columns: 'RowNumber', 'CustomerId', 'Surname'
df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)
return df
df_clean = clean_data(df.copy())
df_clean.head()
Python
Python
df_num_cols = df_clean[numeric_variables]
sns.set(font_scale = 0.7)
fig, axes = plt.subplots(nrows = 2, ncols = 3, gridspec_kw =
dict(hspace=0.3), figsize = (17,8))
fig.tight_layout()
for ax,col in zip(axes.flatten(), df_num_cols.columns):
sns.boxplot(x = df_num_cols[col], color='green', ax = ax)
fig.delaxes(axes[1,2])
Distribution of exited and nonexited customers
Show the distribution of exited versus nonexited customers across the categorical
attributes.
Python
Python
columns = df_num_cols.columns[: len(df_num_cols.columns)]
fig = plt.figure()
fig.set_size_inches(18, 8)
length = len(columns)
for i,j in itertools.zip_longest(columns, range(length)):
plt.subplot((length // 2), 3, j+1)
plt.subplots_adjust(wspace = 0.2, hspace = 0.5)
df_num_cols[i].hist(bins = 20, edgecolor = 'black')
plt.title(i)
plt.show()
Python
df_clean["NewTenure"] = df_clean["Tenure"]/df_clean["Age"]
df_clean["NewCreditsScore"] = pd.qcut(df_clean['CreditScore'], 6, labels =
[1, 2, 3, 4, 5, 6])
df_clean["NewAgeScore"] = pd.qcut(df_clean['Age'], 8, labels = [1, 2, 3, 4,
5, 6, 7, 8])
df_clean["NewBalanceScore"] =
pd.qcut(df_clean['Balance'].rank(method="first"), 5, labels = [1, 2, 3, 4,
5])
df_clean["NewEstSalaryScore"] = pd.qcut(df_clean['EstimatedSalary'], 10,
labels = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
You could copy the generated code, close Data Wrangler to return to the notebook,
then paste into a new cell. Or, select Add code to notebook at the top left to close Data
Wrangler and add the code automatically.
If you didn't use Data Wrangler, you can instead use this next code cell:
Python
import pandas as pd
def clean_data(df_clean):
# One-hot encode columns: 'Geography', 'Gender'
df_clean = pd.get_dummies(df_clean, columns=['Geography', 'Gender'])
return df_clean
df_clean_1 = clean_data(df_clean.copy())
df_clean_1.head()
table_name = "df_clean"
# Create Spark DataFrame from pandas
sparkDF=spark.createDataFrame(df_clean_1)
sparkDF.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Spark dataframe saved to delta table: {table_name}")
Next step
Train and register machine learning models with this data:
Feedback
Was this page helpful? Yes No
In this tutorial, you'll learn to train multiple machine learning models to select the best
one in order to predict which bank customers are likely to leave.
) Important
MLflow is an open source platform for managing the machine learning lifecycle with
features like Tracking, Models, and Model Registry. MLflow is natively integrated with
the Fabric Data Science experience.
Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview)
trial.
Switch to the Data Science experience by using the experience switcher icon on the
left side of your home page.
This is part 3 of 5 in the tutorial series. To complete this tutorial, first complete:
Part 1: Ingest data into a Microsoft Fabric lakehouse using Apache Spark.
Part 2: Explore and visualize data using Microsoft Fabric notebooks to learn more
about the data.
If you want to open the accompanying notebook for this tutorial, follow the instructions
in Prepare your system for data science to import the tutorial notebooks to your
workspace.
Or, if you'd rather copy and paste the code from this page, you can create a new
notebook.
Be sure to attach a lakehouse to the notebook before you start running code.
) Important
(SMOTE) which is used when dealing with imbalanced datasets. The PySpark kernel will
be restarted after %pip install , so you'll need to install the library before you run any
other cells.
You'll access SMOTE using the imblearn library. Install it now using the in-line
installation capabilities (e.g., %pip , %conda ).
Python
Tip
When you install a library in a notebook, it is only available for the duration of the
notebook session and not in the workspace. If you restart the notebook, you'll need
to install the library again. If you have a library you often use, you could instead
install it in your workspace to make it available to all notebooks in your workspace
without further installs.
Python
import pandas as pd
SEED = 12345
df_clean = spark.read.format("delta").load("Tables/df_clean").toPandas()
Python
import mlflow
# Setup experiment name
EXPERIMENT_NAME = "bank-churn-experiment" # MLflow experiment name
Extending the MLflow autologging capabilities, autologging works by automatically
capturing the values of input parameters and output metrics of a machine learning
model as it is being trained. This information is then logged to your workspace, where it
can be accessed and visualized using the MLflow APIs or the corresponding experiment
in your workspace.
All the experiments with their respective names are logged and you'll be able to track
their parameters and performance metrics. To learn more about autologging, see
Autologging in Microsoft Fabric .
mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.autolog(exclusive=False)
Python
Python
y = df_clean["Exited"]
X = df_clean.drop("Exited",axis=1)
# Split the dataset to 60%, 20%, 20% for training, validation, and test
datasets
# Train-Test Separation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
random_state=SEED)
# Train-Validation Separation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
test_size=0.25, random_state=SEED)
Python
table_name = "df_test"
# Create PySpark DataFrame from Pandas
df_test=spark.createDataFrame(X_test)
df_test.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Spark test DataFrame saved to delta table: {table_name}")
Tip
Note that SMOTE should only be applied to the training dataset. You must leave
the test dataset in its original imbalanced distribution in order to get a valid
approximation of how the machine learning model will perform on the original
data, which is representing the situation in production.
Python
Model training
Train the model using Random Forest with maximum depth of 4 and 4 features
Python
Train the model using Random Forest with maximum depth of 8 and 6 features
Python
# lgbm_model
mlflow.lightgbm.autolog(registered_model_name='lgbm_sm') # Register the
trained model with autologging
lgbm_sm_model = LGBMClassifier(learning_rate = 0.07,
max_delta_step = 2,
n_estimators = 100,
max_depth = 10,
eval_metric = "logloss",
objective='binary',
random_state=42)
Open the saved experiment from the workspace, load the machine learning
models, and then assess the performance of the loaded models on the validation
dataset.
Python
Python
Depending on your preference, either approach is fine and should offer identical
performances. In this notebook, you'll choose the first approach in order to better
demonstrate the MLflow autologging capabilities in Microsoft Fabric.
Python
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
Confusion Matrix for Random Forest Classifier with maximum depth of 4 and 4
features
Python
Confusion Matrix for Random Forest Classifier with maximum depth of 8 and 6
features
Python
Python
Feedback
Was this page helpful? Yes No
In this tutorial, you'll learn to import the registered LightGBMClassifier model that was
trained in part 3 using the Microsoft Fabric MLflow model registry, and perform batch
predictions on a test dataset loaded from a lakehouse.
) Important
Microsoft Fabric allows you to operationalize machine learning models with a scalable
function called PREDICT, which supports batch scoring in any compute engine. You can
generate batch predictions directly from a Microsoft Fabric notebook or from a given
model's item page. Learn about PREDICT .
To generate batch predictions on the test dataset, you'll use version 1 of the trained
LightGBM model that demonstrated the best performance among all trained machine
learning models. You'll load the test dataset into a spark DataFrame and create an
MLFlowTransformer object to generate batch predictions. You can then invoke the
PREDICT function using one of following three ways:
Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview)
trial.
Switch to the Data Science experience by using the experience switcher icon on the
left side of your home page.
This part 4 of 5 in the tutorial series. To complete this tutorial, first complete:
Part 1: Ingest data into a Microsoft Fabric lakehouse using Apache Spark.
Part 2: Explore and visualize data using Microsoft Fabric notebooks to learn more
about the data.
Part 3: Train and register machine learning models.
If you want to open the accompanying notebook for this tutorial, follow the instructions
in Prepare your system for data science to import the tutorial notebooks to your
workspace.
Or, if you'd rather copy and paste the code from this page, you can create a new
notebook.
Be sure to attach a lakehouse to the notebook before you start running code.
) Important
Attach the same lakehouse you used in the other parts of this series.
Python
df_test = spark.read.format("delta").load("Tables/df_test")
display(df_test)
The columns from the test DataFrame that you need as input to the model (in this
case, you would need all of them).
A name for the new output column (in this case, predictions).
The correct model name and model version to generate the predictions (in this
case, lgbm_sm and version 1).
Python
model = MLFlowTransformer(
inputCols=list(df_test.columns),
outputCol='predictions',
modelName='lgbm_sm',
modelVersion=1
)
Now that you have the MLFlowTransformer object, you can use it to generate batch
predictions.
Python
import pandas
predictions = model.transform(df_test)
display(predictions)
PREDICT with the Spark SQL API
The following code invokes the PREDICT function with the Spark SQL API.
Python
sqlt = SQLTransformer().setStatement(
f"SELECT PREDICT('{model_name}/{model_version}', {','.join(features)})
as predictions FROM __THIS__")
Python
# Substitute "model" and "features" below with values for your own model
name and feature columns
my_udf = model.to_udf()
features = df_test.columns
Note that you can also generate PREDICT code from a model's item page. Learn about
PREDICT .
Python
# Save predictions to lakehouse to be used for generating a Power BI report
table_name = "customer_churn_test_predictions"
predictions.write.format('delta').mode("overwrite").save(f"Tables/{table_nam
e}")
print(f"Spark DataFrame saved to delta table: {table_name}")
Next step
Continue on to:
Feedback
Was this page helpful? Yes No
In this tutorial, you'll create a Power BI report from the predictions data that was
generated in Part 4: Perform batch scoring and save predictions to a lakehouse.
) Important
Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview)
trial.
Switch to the Data Science experience by using the experience switcher icon on the
left side of your home page.
This is part 5 of 5 in the tutorial series. To complete this tutorial, first complete:
Part 1: Ingest data into a Microsoft Fabric lakehouse using Apache Spark.
Part 2: Explore and visualize data using Microsoft Fabric notebooks to learn more
about the data.
Part 3: Train and register machine learning models.
Part 4: Perform batch scoring and save predictions to a lakehouse.
3. Select the lakehouse that you used in the previous parts of the tutorial series.
5. Give the dataset a name, such as "bank churn predictions." Then select the
customer_churn_test_predictions dataset.
6. Select Confirm.
a. Select New measure in the top ribbon. This action adds a new item named
Measure to the customer_churn_test_predictions dataset, and opens a formula
bar above the table.
b. To determine the average predicted churn rate, replace Measure = in the
formula bar with:
Python
c. To apply the formula, select the check mark in the formula bar. The new
measure appears in the data table. The calculator icon shows it was created as a
measure.
2. Add a new measure that counts the total number of bank customers. You'll need it
for the rest of the new measures.
a. Select New measure in the top ribbon to add a new item named Measure to
the customer_churn_test_predictions dataset. This action also opens a formula
bar above the table.
Python
Customers = COUNT(customer_churn_test_predictions[predictions])
c. Select the check mark in the formula bar to apply the formula.
a. Select New measure in the top ribbon to add a new item named Measure to
the customer_churn_test_predictions dataset. This action also opens a formula
bar above the table.
b. To determine the churn rate for Germany, replace Measure = in the formula bar
with:
Python
This filters the rows down to the ones with Germany as their geography
(Geography_Germany equals one).
c. To apply the formula, select the check mark in the formula bar.
4. Repeat the above step to add the churn rates for France and Spain.
Python
Python
2. In the Visualizations panel, select the Card icon. From the Data pane, select Churn
Rate. Change the font size and background color in the Format panel. Drag this
visualization to the top right of the report.
3. In the Visualizations panel, select the Line and stacked column chart icon. Select
age for the x-axis, Churn Rate for column y-axis, and Customers for the line y-axis.
4. In the Visualizations panel, select the Line and stacked column chart icon. Select
NumOfProducts for x-axis, Churn Rate for column y-axis, and Customers for the
line y-axis.
5. In the Visualizations panel, select the Stacked column chart icon. Select
NewCreditsScore for x-axis and Churn Rate for y-axis.
6. In the Visualizations panel, select the Clustered column chart card. Select
Germany Churn, Spain Churn, France Churn in that order for the y-axis.
7 Note
This report represents an illustrated example of how you might analyze the saved
prediction results in Power BI. However, for a real customer churn use-case, the you
may have to do more thorough ideation of what visualizations to create, based on
syour subject matter expertise, and what your firm and business analytics team has
standardized as metrics.
Customers who use more than two of the bank products have a higher churn rate
although few customers had more than two products. The bank should collect
more data, but also investigate other features correlated with more products (see
the plot in the bottom left panel).
Bank customers in Germany have a higher churn rate than in France and Spain (see
the plot in the bottom right panel), which suggests that an investigation into what
has encouraged customers to leave could be beneficial.
There are more middle aged customers (between 25-45) and customers between
45-60 tend to exit more.
Finally, customers with lower credit scores would most likely leave the bank for
other financial institutes. The bank should look into ways that encourage
customers with lower credit scores and account balances to stay with the bank.
Next step
This completes the five part tutorial series. See other end-to-end sample tutorials:
Feedback
Was this page helpful? Yes No
Microsoft Fabric notebook is a primary code item for developing Apache Spark jobs and
machine learning experiments, it's a web-based interactive surface used by data
scientists and data engineers to write code benefiting from rich visualizations and
Markdown text. Data engineers write code for data ingestion, data preparation, and data
transformation. Data scientists also use notebooks to build machine learning solutions,
including creating experiments and models, model tracking, and deployment.
) Important
This article describes how to use notebooks in data science and data engineering
experiences.
Create notebooks
You can either create a new notebook or import an existing notebook.
Export a notebook
You can Export your notebook to other standard formats. Synapse notebook supports to
be exported into:
Save a notebook
In Microsoft Fabric, a notebook will by default save automatically after you open and
edit it; you don't need to worry about losing code changes. You can also use Save a
copy to clone another copy in the current workspace or to another workspace.
If you prefer to save a notebook manually, you can also switch to "Manual save" mode
to have a "local branch" of your notebook item, and use Save or CTRL+s to save your
changes.
You can also switch to manual save mode from Edit->Save options->Manual. To turn
on a local branch of your notebook then save it manually by clicking Save button or
through "Ctrl" + "s" keybinding.
You can navigate to different lakehouses in the lakehouse explorer and set one
lakehouse as the default by pinning it. It will then be mounted to the runtime working
directory and you can read or write to the default lakehouse using a local path.
7 Note
You need to restart the session after pinning a new lakehouse or renaming the
default lakehouse.
Select Add lakehouse to add more lakehouses to the notebook, either by adding an
existing one or creating a new lakehouse.
You can easily copy path with different format of the select file or folder and use the
corresponding path in your code.
Notebook resources
The notebook resource explorer provides a Unix-like file system to help you manage
your folders and files. It offers a writeable file system space where you can store small-
sized files, such as code modules, datasets, and images. You can easily access them with
code in the notebook as if you were working with your local file system.
The Built-in folder is a system pre-defined folder for each notebook instance, it
preserves up to 500MB storage to store the dependencies of the current notebook,
below are the key capabilities of Notebook resources:
You can easily move your validated data to a Lakehouse via the Write to
Lakehouse option. We have embedded rich code snippets for common file types
to help you quickly get started.
These resources are also available for use in the Reference Notebook run case via
mssparkutils.notebook.run() .
7 Note
The relative path “builtin/” will always point to the root notebook’s built-in
folder.
Collaborate in a notebook
The Microsoft Fabric notebook is a collaborative item that supports multiple users
editing the same notebook.
When you open a notebook, you enter the co-editing mode by default, and every
notebook edit will be auto-saved. If your colleagues open the same notebook at the
same time, you see their profile, run output, cursor indicator, selection indicator and
editing trace. By leveraging the collaborating features, you can easily accomplish pair
programming, remote debugging, and tutoring scenarios.
Share a notebook
Sharing a notebook is a convenient way for you to collaborate with team members.
Authorized workspace roles can view or edit/run notebooks by default. You can share a
notebook with specified permissions granted.
2. Select the corresponding category of people who can view this notebook. You can
check Share, Edit, or Run to grant the permissions to the recipients.
3. After you "Apply" the selection, you can either send the notebook directly or copy
the link to others, and then the recipients can open the notebook with the
corresponding view granted by the permission.
4. To further manage your notebook permissions, you can find the "Manage
permissions" entry in the Workspace item list > More options to update the
existing notebook access and permission.
Comment a code cell
Commenting is another useful feature during collaborative scenarios. Currently, we
support adding cell-level comments.
1. Select the Comments button on the notebook toolbar or cell comment indicator
to open the Comments pane.
2. Select code in the code cell, select New in the Comments pane, add comments,
and then select the post comment button to save.
3. You could perform Edit comment, Resolve thread, or Delete thread by selecting
the More button besides your comment.
Editing mode: You can edit and run the cells and collaborate with others on the
notebook.
Viewing mode: You can only view the cell content, output, and comments of the
notebook, all the operations that can lead to change the notebook will be
disabled.
Next steps
Author and execute notebooks
Feedback
Was this page helpful? Yes No
Microsoft Fabric notebook is a primary code item for developing Apache Spark jobs and
machine learning experiments. It's a web-based interactive surface used by data scientists
and data engineers to write code benefiting from rich visualizations and Markdown text. This
article explains how to develop notebooks with code cell operations and run them.
) Important
Develop notebooks
Notebooks consist of cells, which are individual blocks of code or text that can be run
independently or as a group.
Add a cell
Set a primary language
Use multiple languages
IDE-style IntelliSense
Code snippets
Drag and drop to insert snippets
Drag and drop to insert images
Format text cell with toolbar buttons
Undo or redo cell operation
Move a cell
Delete a cell
Collapse a cell input
Collapse a cell output
Lock or freeze a cell
Notebook contents
Markdown folding
Find and replace
Add a cell
There are multiple ways to add a new cell to your notebook.
1. Hover over the space between two cells and select Code or Markdown.
2. Use Shortcut keys under command mode. Press A to insert a cell above the current cell.
Press B to insert a cell below the current cell.
PySpark (Python)
Spark (Scala)
Spark SQL
SparkR
You can set the primary language for new added cells from the dropdown list in the top
command bar.
IDE-style IntelliSense
Microsoft Fabric notebooks are integrated with the Monaco editor to bring IDE-style
IntelliSense to the cell editor. Syntax highlight, error marker, and automatic code
completions help you to write code and identify issues quicker.
The IntelliSense features are at different levels of maturity for different languages. The
following table shows what's supported:
7 Note
An active Spark session is required to make use of the IntelliSense code completion.
Code snippets
Microsoft Fabric notebooks provide code snippets that help you write commonly used code
patterns easily, like:
Snippets appear in Shortcut keys of IDE style IntelliSense mixed with other suggestions. The
code snippets contents align with the code cell language. You can see available snippets by
typing Snippet or any keywords appear in the snippet title in the code cell editor. For
example, by typing read you can see the list of snippets to read data from various data
sources.
Drag and drop to insert snippets
You can use drag and drop to read data from Lakehouse explorer conveniently. Multiple file
types are supported here, you can operate on text files, tables, images, etc. You can either
drop to an existing cell or to a new cell. The notebook generates the code snippet
accordingly to preview the data.
Insert or delete cell: You could revoke the delete operations by selecting Undo, the text
content is kept along with the cell.
Reorder cell.
Toggle parameter.
Convert between code cell and Markdown cell.
7 Note
In-cell text operations and code cell commenting operations can't be undone. You can
undo or redo up to the latest 10 historical cell operations.
Move a cell
You can drag from the empty part of a cell and drop it to the desired position.
You can also move the selected cell using Move up and Move down on the ribbon.
Delete a cell
To delete a cell, select the delete button at the right hand of the cell.
You can also use shortcut keys under command mode. Press Shift+D to delete the current
cell.
Notebook contents
The Outlines or Table of Contents presents the first markdown header of any markdown cell
in a sidebar window for quick navigation. The Outlines sidebar is resizable and collapsible to
fit the screen in the best ways possible. You can select the Contents button on the notebook
command bar to open or hide the sidebar.
Markdown folding
The markdown folding allows you to hide cells under a markdown cell that contains a
heading. The markdown cell and its hidden cells are treated the same as a set of contiguous
multi-selected cells when performing cell operations.
Run notebooks
You can run the code cells in your notebook individually or all at once. The status and
progress of each cell is represented in the notebook.
Run a cell
There are several ways to run the code in a cell.
1. Hover on the cell you want to run and select the Run Cell button or press Ctrl+Enter.
2. Use Shortcut keys under command mode. Press Shift+Enter to run the current cell and
select the next cell. Press Alt+Enter to run the current cell and insert a new cell.
Stop session
Stop session cancels the running and waiting cells and stops the current session. You can
restart a brand new session by selecting the run button again.
7 Note
%run command currently only supports reference notebooks that in the same
%run command do not support nested reference that depth is larger than five.
Variable explorer
Microsoft Fabric notebook provides a built-in variables explorer for you to see the list of the
variables name, type, length, and value in the current Spark session for PySpark (Python)
cells. More variables show up automatically as they're defined in the code cells. Clicking on
each column header sorts the variables in the table.
You can select the Variables button on the notebook ribbon “View” tab to open or hide the
variable explorer.
7 Note
You can also find the Cell level real-time log next to the progress indicator, and Diagnostics
can provide you with useful suggestions to help refine and debug the code.
In More actions, you can easily navigate to the Spark application details page and Spark
web UI page.
Secret redaction
To prevent the credentials being accidentally leaked when running notebooks, Fabric
notebook support Secret redaction to replace the secret values that are displayed in the cell
output with [REDACTED] , Secret redaction is applicable for Python, Scala and R.
Built-in magics
You can use familiar Ipython magic commands in Fabric notebooks. Review the following list
as the current available magic commands.
7 Note
Available line magics: %lsmagic , %time , %timeit , %history , %run, %load , %alias,
%alias_magic, %autoawait, %autocall, %automagic, %bookmark, %cd, %colors, %dhist, %dirs,
%doctest_mode, %killbgscripts, %load_ext, %logoff, %logon, %logstart, %logstate, %logstop,
%magic, %matplotlib, %page, %pastebin, %pdef, %pfile, %pinfo, %pinfo2, %popd, %pprint,
%precision, %prun, %psearch, %psource, %pushd, %pwd, %pycat, %quickref, %rehashx,
%reload_ext, %reset, %reset_selective, %sx, %system, %tb, %unalias, %unload_ext, %who,
%who_ls, %whos, %xdel, %xmode.
Fabric notebook also supports improved library management commands %pip, %conda,
check Manage Apache Spark libraries in Microsoft Fabric for the usage.
IPython Widgets
IPython Widgets are eventful python objects that have a representation in the browser. You
can use IPython Widgets as low-code controls (for example, slider, text box) in your
notebook just like the Jupyter notebook, currently it only works in Python context.
2. You can use top-level display function to render a widget, or leave an expression of
widget type at the last line of code cell.
Python
slider = widgets.IntSlider()
display(slider)
Python
slider = widgets.IntSlider()
display(slider)
4. You can use multiple display() calls to render the same widget instance multiple times,
but they remain in sync with each other.
Python
slider = widgets.IntSlider()
display(slider)
display(slider)
5. To render two widgets independent of each other, create two widget instances:
Python
slider1 = widgets.IntSlider()
slider2 = widgets.IntSlider()
display(slider1)
display(slider2)
Supported widgets
String Widgets Text, Text area, Combobox, Password, Label, HTML, HTML Math, Image, Button
Known limitations
1. The following widgets aren't supported yet, you could follow the corresponding
workaround as follows:
Functionality Workaround
Output widget You can use print() function instead to write text into stdout.
widgets.jslink() You can use widgets.link() function to link two similar widgets.
2. Global display function provided by Fabric doesn't support displaying multiple widgets
in one call (for example, display(a, b)), which is different from IPython display function.
3. If you close a notebook that contains IPython Widget, you can't see or interact with it
until you execute the corresponding cell again.
Python
import logging
Integrate a notebook
The parameter cell is useful for integrating a notebook in a pipeline. Pipeline activity looks
for the parameters cell and treats this cell as the default for the parameters passed in at
execution time. The execution engine adds a new cell beneath the parameters cell with input
parameters in order to overwrite the default values.
Shortcut keys
Similar to Jupyter Notebooks, Fabric notebooks have a modal user interface. The keyboard
does different things depending on which mode the notebook cell is in. Fabric notebooks
support the following two modes for a given code cell: Command mode and Edit mode.
1. A cell is in Command mode when there's no text cursor prompting you to type. When a
cell is in Command mode, you can edit the notebook as a whole but not type into
individual cells. Enter Command mode by pressing ESC or using the mouse to select
outside of a cell's editor area.
2. Edit mode can be indicated from a text cursor that prompting you to type in the editor
area. When a cell is in Edit mode, you can type into the cell. Enter Edit mode by
pressing Enter or using the mouse to select on a cell's editor area.
Move cursor up Up
Undo Ctrl + Z
Redo Ctrl + Y
Indent Ctrl + ]
Dedent Ctrl + [
You can easily find all shortcut keys from the notebook ribbon View -> Keybindings.
Next steps
Notebook visualization
Introduction of Fabric MSSparkUtils
Feedback
Was this page helpful? Yes No
Microsoft Fabric is an integrated analytics service that accelerates time to insight, across
data warehouses and big data analytics systems. Data visualization in notebook is a key
component in being able to gain insight into your data. It helps make big and small data
easier for humans to understand. It also makes it easier to detect patterns, trends, and
outliers in groups of data.
) Important
When you use Apache Spark in Microsoft Fabric, there are various built-in options to
help you visualize your data, including Microsoft Fabric notebook chart options, and
access to popular open-source libraries.
display(df) function
The display function allows you to turn SQL query result and Apache Spark dataframes
into rich data visualizations. The display function can be used on dataframes created in
PySpark and Scala.
1. The output of %%sql magic commands appear in the rendered table view by
default. You can also call display(df) on Spark DataFrames or Resilient Distributed
Datasets (RDD) function to produce the rendered table view.
2. Once you have a rendered table view, switch to the Chart view.
3. You can now customize your visualization by specifying the following values:
Configuration Description
Chart Type The display function supports a wide range of chart types, including bar
charts, scatter plots, line graphs, and more.
7 Note
By default the display(df) function will only take the first 1000 rows of the data
to render the charts. Select Aggregation over all results and then select
Apply to apply the chart generation from the whole dataset. A Spark job will
be triggered when the chart setting changes. Please note that it may take
several minutes to complete the calculation and render the chart.
4. Once done, you can view and interact with your final visualization!
displayHTML() option
Microsoft Fabric notebooks support HTML graphics using the displayHTML function.
Python
displayHTML("""<!DOCTYPE html>
<meta charset="utf-8">
<!-- Create a div where the graph will take place -->
<div id="my_dataviz"></div>
<script>
// Create Data
var data = [12,19,11,13,12,22,13,4,15,16,18,19,20,12,11,9]
"""
)
Popular libraries
When it comes to data visualization, Python offers multiple graphing libraries that come
packed with many different features. By default, every Apache Spark pool in Microsoft
Fabric contains a set of curated and popular open-source libraries.
Matplotlib
You can render standard plotting libraries, like Matplotlib, using the built-in rendering
functions for each library.
Python
# Bar chart
x1 = [1, 3, 4, 5, 6, 7, 9]
y1 = [4, 7, 2, 4, 7, 8, 3]
x2 = [2, 4, 6, 8, 10]
y2 = [5, 6, 2, 6, 2]
plt.xlabel("bar number")
plt.ylabel("bar height")
plt.title("Bar Chart Example")
plt.legend()
plt.show()
Bokeh
You can render HTML or interactive libraries, like bokeh, using the displayHTML(df).
The following image is an example of plotting glyphs over a map using bokeh.
Python
tile_provider = get_provider(Vendors.CARTODBPOSITRON)
Plotly
You can render HTML or interactive libraries like Plotly, using the displayHTML().
Python
import pandas as pd
df =
pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/fips-
unemp-16.csv",
dtype={"fips": str})
import plotly
import plotly.express as px
Pandas
You can view html output of pandas dataframe as the default output, notebook
automatically shows the styled html content.
Python
import pandas as pd
import numpy as np
columns=pd.MultiIndex.from_product([['Decision Tree',
'Regression', 'Random'],['Tumour', 'Non-Tumour']], names=['Model:',
'Predicted:']))
df
Next steps
Use a notebook with lakehouse to explore your data
Explore the data in your lakehouse with
a notebook
Article • 06/08/2023
In this tutorial, learn how to explore the data in your lakehouse with a notebook.
) Important
Prerequisites
To get started, you need the following prerequisites:
In the lakehouse list, the pin icon next to the name of a lakehouse indicates that it's the
default lakehouse in your current notebook. In the notebook code, if only a relative path
is provided to access the data from the Microsoft Fabric OneLake, then the default
lakehouse is served as the root folder at run time.
7 Note
The default lakehouse decide which Hive metastore to use when running the
notebook with Spark SQL.if multiple lakehouse are added into the notebook, make
sure when Spark SQL is used, the target lakehouse and the current default
lakehouse are from the same workspace.
To remove all the lakehouses from the notebook, click "Remove all Lakehouses" in the
lakehouse list.
Select Add lakehouse to add more lakehouses to the notebook. You can either add an
existing one or create a new one.
Next steps
How to use a notebook to load data into your lakehouse
Use a notebook to load data into your
Lakehouse
Article • 05/23/2023
In this tutorial, learn how to read/write data into your lakehouse with a notebook.Spark
API and Pandas API are supported to achieve this goal.
) Important
To specify the location to read from, you can use the relative path if the data is from the
default lakehouse of current notebook, or you can use the absolute ABFS path if the
data is from other lakehouse. you can copy this path from the context menu of the data
Copy ABFS path : this return the absolute path of the file
Copy relative path for Spark : this return the relative path of the file in the default
lakehouse
Python
# Keep it if you want to save dataframe as CSV files to Files section of the
default Lakehouse
df.write.mode("overwrite").format("parquet").save("Files/" +
parquet_table_name)
df.write.mode("overwrite").format("delta").saveAsTable(delta_table_name)
# Keep it if you want to save the dataframe as a delta lake, appending the
data to an existing table
df.write.mode("append").format("delta").saveAsTable(delta_table_name)
Python
# Keep it if you want to read parquet file with Pandas from the default
lakehouse mount point
import pandas as pd
df = pd.read_parquet("/lakehouse/default/Files/sample.parquet")
# Keep it if you want to read parquet file with Pandas from the absolute
abfss path
import pandas as pd
df = pd.read_parquet("abfss://DevExpBuildDemo@msit-
onelake.dfs.fabric.microsoft.com/Marketing_LH.Lakehouse/Files/sample.parquet
")
Tip
For Spark API, please use the option of Copy ABFS path or Copy relative path for
Spark to get the path of the file. For Pandas API, please use the option of Copy
ABFS path or Copy File API path to get the path of the file.
The quickest way to have the code to work with Spark API or Pandas API is to use the
option of Load data and select the API you want to use. The code will be automatically
generated in a new code cell of the notebook.
Next steps
Explore the data in your lakehouse with a notebook
How-to use end-to-end AI samples in
Microsoft Fabric
Article • 09/15/2023
In providing the Synapse Data Science in Microsoft Fabric SaaS experience we want to
enable ML professionals to easily and frictionlessly build, deploy, and operationalize
their machine learning models, in a single analytics platform, while collaborating with
other key roles. Begin here to understand the various capabilities the Synapse Data
Science experience offers and explore examples of how ML models can address your
common business problems.
) Important
Use the in-line installation capabilities (such as %pip or %conda ) in your notebook
Install libraries directly in your current workspace
You can use the in-line installation capabilities (for example, %pip or %conda ) within your
notebook to install new libraries. This installation option would install the libraries only
in the current notebook and not in the workspace.
To install a library, use the following code, replacing <library name> with the name of
your desired library, such as imblearn , wordcloud , etc.
Python
) Important
Only your Workspace admin has access to update the Workspace-level settings.
For more information on installing libraries in your workspace, see Install workspace
libraries.
Recommender
An online bookstore is looking to increase sales by providing customized
recommendations. Using customer book rating data in this sample you'll see how to
clean, explore the data leading to developing and deploying a recommendation to
provide predictions.
Fraud detection
As unauthorized transactions increase, detecting credit card fraud in real time will
support financial institutions to provide their customers faster turnaround time on
resolution. This end to end sample will include preprocessing, training, model storage
and inferencing. The training section will review implementing multiple models and
methods that address challenges like imbalanced examples and trade-offs between false
positives and false negatives.
Forecasting
Using historical New York City Property Sales data and Facebook Prophet in this sample,
we'll build a time series model with the trend, seasonality and holiday information to
forecast what sales will look like in future cycles.
Text classification
In this sample, we'll predict whether a book in the British Library is fiction or non-fiction
based on book metadata. This will be accomplished by applying text classification with
word2vec and linear-regression model on Spark.
Uplift model
In this sample, we'll estimate the causal impact of certain treatments on an individual's
behavior by using an Uplift model. We'll walk through step by step how to create, train
and evaluate the model touching on four core learnings:
Predictive maintenance
In this tutorial, you proactively predict mechanical failures. This is accomplished by
training multiple models on historical data such as temperature and rotational speed,
then determining which model is the best fit for predicting future failures.
Follow along in the Predictive maintenance tutorial.
Next steps
How to use Microsoft Fabric notebooks
Machine learning model in Microsoft Fabric
Feedback
Was this page helpful? Yes No
In this tutorial, you'll see a Microsoft Fabric data science workflow with an end-to-end example. The scenario is to build a model to predict
whether bank customers would churn or not. The churn rate, also known as the rate of attrition refers to the rate at which bank customers
stop doing business with the bank.
) Important
Prerequisites
A Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview) trial.
Switch to the Data Science experience using the experience switcher icon at the left corner of your homepage.
If you don't have a Microsoft Fabric lakehouse, create one by following the steps in Create a lakehouse in Microsoft Fabric.
If you'd like to open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science to import
the tutorial notebooks to your workspace.
Or, create a new notebook if you'd rather copy/paste the code from this page.
Be sure to attach a lakehouse to the notebook before you start running code.
7 Note
The PySpark kernel will restart after %pip install . Install libraries before you run any other cells.
Python
The dataset also includes columns such as row number, customer ID, and customer surname that should have no impact on customer's
decision to leave the bank. The event that defines the customer's churn is the closing of the customer's bank account, therefore, the column
exit in the dataset refers to customer's abandonment. Since you don't have much context about these attributes, you'll proceed without
having background information about the dataset. Your aim is to understand how these attributes contribute to the exit status.
Out of the 10,000 customers, only 2037 customers (around 20%) have left the bank. Therefore, given the class imbalance ratio, we
recommend generating synthetic data. Moreover, confusion matrix accuracy may not be meaningful for imbalanced classification and it
might be better to also measure the accuracy using the Area Under the Precision-Recall Curve (AUPRC).
churn.csv
"CustomerID" "Surname" "CreditScore" "Geography" "Gender" "Age" "Tenure" "Balance" "NumOfProducts" "HasCrCard" "IsActiveMemb
Introduction to SMOTE
The problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the
decision boundary. Synthetic Minority Oversampling Technique (SMOTE) is the most widely used approach to synthesize new samples for
the minority class. Learn more about SMOTE here and here .
You will be able to access SMOTE using the imblearn library that you installed in Step 1.
Tip
By defining the following parameters, you can use this notebook with different datasets easily.
Python
IS_SAMPLE = False # if TRUE, use only SAMPLE_ROWS of data for training, otherwise use all data
SAMPLE_ROWS = 5000 # if IS_SAMPLE is True, use only this number of rows for training
DATA_ROOT = "/lakehouse/default"
DATA_FOLDER = "Files/churn" # folder with data files
DATA_FILE = "churn.csv" # data file name
This code downloads a publicly available version of the dataset and then stores it in a Fabric lakehouse.
) Important
Make sure you add a lakehouse to the notebook before running it. Failure to do so will result in an error.
Python
import os, requests
if not IS_CUSTOM_DATA:
# Using synapse blob, this can be done in one line
if not os.path.exists("/lakehouse/default"):
raise FileNotFoundError(
"Default lakehouse not found, please add a lakehouse and restart the session."
)
os.makedirs(download_path, exist_ok=True)
for fname in file_list:
if not os.path.exists(f"{download_path}/{fname}"):
r = requests.get(f"{remote_url}/{fname}", timeout=30)
with open(f"{download_path}/{fname}", "wb") as f:
f.write(r.content)
print("Downloaded demo data files into lakehouse.")
Python
ts = time.time()
Python
df = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv("Files/churn/raw/churn.csv")
.cache()
)
Python
df = df.toPandas()
Python
Python
display(df, summary=True)
7 Note
Data Wrangler can not be opened while the notebook kernel is busy. The cell execution must complete prior to launching Data
Wrangler. Learn more about Data Wrangler .
Once the Data Wrangler is launched, a descriptive overview of the displayed data panel is generated as shown in the following images. It
includes information about the DataFrame's dimension, missing values, etc. You can then use Data Wrangler to generate the script for
dropping the rows with missing values, the duplicate rows and the columns with specific names, then copy the script into a cell. The next
cell shows that copied script.
Python
def clean_data(df):
# Drop rows with missing data across all columns
df.dropna(inplace=True)
# Drop duplicate rows in columns: 'RowNumber', 'CustomerId'
df.drop_duplicates(subset=['RowNumber', 'CustomerId'], inplace=True)
# Drop columns: 'RowNumber', 'CustomerId', 'Surname'
df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)
return df
df_clean = clean_data(df.copy())
Determine attributes
Use this code to determine categorical, numerical, and target attributes.
Python
Python
df_num_cols = df_clean[numeric_variables]
sns.set(font_scale = 0.7)
fig, axes = plt.subplots(nrows = 2, ncols = 3, gridspec_kw = dict(hspace=0.3), figsize = (17,8))
fig.tight_layout()
for ax,col in zip(axes.flatten(), df_num_cols.columns):
sns.boxplot(x = df_num_cols[col], color='green', ax = ax)
# fig.suptitle('visualize and compare the distribution and central tendency of numerical attributes', color = 'k', fontsize
= 12)
fig.delaxes(axes[1,2])
Distribution of exited and non-exited customers
Show the distribution of exited versus non-exited customers across the categorical attributes.
Python
Python
df_clean["NewTenure"] = df_clean["Tenure"]/df_clean["Age"]
df_clean["NewCreditsScore"] = pd.qcut(df_clean['CreditScore'], 6, labels = [1, 2, 3, 4, 5, 6])
df_clean["NewAgeScore"] = pd.qcut(df_clean['Age'], 8, labels = [1, 2, 3, 4, 5, 6, 7, 8])
df_clean["NewBalanceScore"] = pd.qcut(df_clean['Balance'].rank(method="first"), 5, labels = [1, 2, 3, 4, 5])
df_clean["NewEstSalaryScore"] = pd.qcut(df_clean['EstimatedSalary'], 10, labels = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
table_name = "df_clean"
# Create PySpark DataFrame from Pandas
sparkDF=spark.createDataFrame(df_clean)
sparkDF.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Spark dataframe saved to delta table: {table_name}")
Use scikit-learn and lightgbm to implement the models within a few lines of code. Also use MLfLow and Fabric Autologging to track the
experiments.
Here you'll load the delta table from the lakehouse. You may use other delta tables considering the lakehouse as the source.
Python
SEED = 12345
df_clean = spark.read.format("delta").load("Tables/df_clean").toPandas()
Generate experiment for tracking and logging the models using MLflow
This section demonstrates how to generate an experiment, specify model and training parameters as well as scoring metrics, train the
models, log them, and save the trained models for later use.
Python
import mlflow
Extending the MLflow autologging capabilities, autologging works by automatically capturing the values of input parameters and output
metrics of a machine learning model as it is being trained. This information is then logged to your workspace, where it can be accessed and
visualized using the MLflow APIs or the corresponding experiment in your workspace.
When complete, your experiment will look like this image. All the experiments with their respective names are logged and you'll be able to
track their parameters and performance metrics. To learn more about autologging, see Autologging in Microsoft Fabric .
Set experiment and autologging specifications
Python
y = df_clean["Exited"]
X = df_clean.drop("Exited",axis=1)
# Train-Test Separation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=SEED)
Apply SMOTE to the training data to synthesize new samples for the minority class
SMOTE should only be applied to the training dataset. You must leave the test dataset in its original imbalanced distribution in order to get
a valid approximation of how the model will perform on the original data, which is representing the situation in production.
Python
sm = SMOTE(random_state=SEED)
X_res, y_res = sm.fit_resample(X_train, y_train)
new_train = pd.concat([X_res, y_res], axis=1)
Model training
Train the model using Random Forest with maximum depth of four, with four features.
Python
Train the model using Random Forest with maximum depth of eight, with six features.
Python
Python
# lgbm_model
mlflow.lightgbm.autolog(registered_model_name='lgbm_sm') # Register the trained model with autologging
lgbm_sm_model = LGBMClassifier(learning_rate = 0.07,
max_delta_step = 2,
n_estimators = 100,
max_depth = 10,
eval_metric = "logloss",
objective='binary',
random_state=42)
Python
Python
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
Create a confusion matrix for Random Forest Classifier with maximum depth of four, with four features.
Python
Create a confusion matrix for Random Forest Classifier with maximum depth of eight, with six features.
Python
Python
Python
df_pred = X_test.copy()
df_pred['y_test'] = y_test
df_pred['ypred_rfc1_sm'] = ypred_rfc1_sm
df_pred['ypred_rfc2_sm'] =ypred_rfc2_sm
df_pred['ypred_lgbm1_sm'] = ypred_lgbm1_sm
table_name = "df_pred_results"
sparkDF=spark.createDataFrame(df_pred)
sparkDF.write.mode("overwrite").format("delta").option("overwriteSchema", "true").save(f"Tables/{table_name}")
print(f"Spark dataframe saved to delta table: {table_name}")
Some example visualizations are shown here. The data panel shows the delta tables and columns from the table to select. Upon selecting
appropriate x and y axes, you can pick the filters and functions, for example, sum or average of the table column.
7 Note
We show an illustrated example of how you would analyze the saved prediction results in Power BI. However, for a real customer churn
use-case, the platform user may have to do more thorough ideation of what visualizations to create, based on subject matter
expertise, and what their firm and business analytics team has standardized as metrics.
The Power BI report shows that customers who use more than two of the bank products have a higher churn rate although few customers
had more than two products. The bank should collect more data, but also investigate other features correlated with more products (see the
plot in the bottom left panel). Bank customers in Germany have a higher churn rate than in France and Spain (see the plot in the bottom
right panel), which suggests that an investigation into what has encouraged customers to leave could be beneficial. There are more middle
aged customers (between 25-45) and customers between 45-60 tend to exit more. Finally, customers with lower credit scores would most
likely leave the bank for other financial institutes. The bank should look into ways that encourage customers with lower credit scores and
account balances to stay with the bank.
Python
Next steps
Machine learning model in Microsoft Fabric
Train machine learning models
Machine learning experiments in Microsoft Fabric
Feedback
Was this page helpful? Yes No
In this tutorial, you walk through the data engineering and data science workflow with an end-to-end tutorial. The scenario is to build a
recommender for online book recommendation. The steps you'll take are:
) Important
There are different types of recommendation algorithms. This tutorial uses a model based collaborative filtering algorithm named
Alternating Least Squares (ALS) matrix factorization.
ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, U and V, U * Vt = R. Typically these
approximations are called 'factor' matrices.
The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least
squares. The newly solved factor matrix is then held constant while solving for the other factor matrix.
Prerequisites
A Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview) trial.
If you don't have a Microsoft Fabric lakehouse, create one by following the steps in Create a lakehouse in Microsoft Fabric.
If you'd like to open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science to import
the tutorial notebooks to your workspace.
Or, create a new notebook if you'd rather copy/paste the code from this page.
Be sure to attach a lakehouse to the notebook before you start running code.
Books.csv - Each book is identified with an International Standard Book Number (ISBN), with invalid dates already removed. Additional
information, such as the title, author, and publisher, has also been added. If a book has multiple authors, only the first is listed. URLs
point to Amazon for cover images in three different sizes.
Ratings.csv - Ratings for each book are either explicit (provided by users on a scale of 1-10) or implicit (observed without user input,
and indicated by 0).
276725 034545104X 0
276726 0155061224 5
Users.csv - User IDs, which have been anonymized and mapped to integers. Demographic data such as location and age, are provided
if available. If unavailable, the value is null.
By defining the following parameters, you can more easily apply the code in this tutorial to different datasets.
Python
IS_CUSTOM_DATA = False # if True, dataset has to be uploaded manually
USER_ID_COL = "User-ID" # must not be '_user_id' for this notebook to run successfully
ITEM_ID_COL = "ISBN" # must not be '_item_id' for this notebook to run successfully
ITEM_INFO_COL = (
"Book-Title" # must not be '_item_info' for this notebook to run successfully
)
RATING_COL = (
"Book-Rating" # must not be '_rating' for this notebook to run successfully
)
IS_SAMPLE = True # if True, use only <SAMPLE_ROWS> rows of data for training; otherwise use all data
SAMPLE_ROWS = 5000 # if IS_SAMPLE is True, use only this number of rows for training
) Important
Python
if not IS_CUSTOM_DATA:
# Download data files into lakehouse if it does not exist
import os, requests
remote_url = "https://synapseaisolutionsa.blob.core.windows.net/public/Book-Recommendation-Dataset"
file_list = ["Books.csv", "Ratings.csv", "Users.csv"]
download_path = f"/lakehouse/default/{DATA_FOLDER}/raw"
if not os.path.exists("/lakehouse/default"):
raise FileNotFoundError(
"Default lakehouse not found, please add a lakehouse and restart the session."
)
os.makedirs(download_path, exist_ok=True)
for fname in file_list:
if not os.path.exists(f"{download_path}/{fname}"):
r = requests.get(f"{remote_url}/{fname}", timeout=30)
with open(f"{download_path}/{fname}", "wb") as f:
f.write(r.content)
print("Downloaded demo data files into lakehouse.")
Python
mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.autolog(disable=True) # disable mlflow autologging
Python
df_items = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv(f"{DATA_FOLDER}/raw/{ITEMS_FILE}")
.cache()
)
df_ratings = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv(f"{DATA_FOLDER}/raw/{RATINGS_FILE}")
.cache()
)
df_users = (
spark.read.option("header", True)
.option("inferSchema", True)
.csv(f"{DATA_FOLDER}/raw/{USERS_FILE}")
.cache()
)
Python
import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette() # adjusting plotting style
import pandas as pd # dataframes
Use the following code to look at the DataFrame that contains the book data:
Python
display(df_items, summary=True)
Add an _item_id column for later usage. The _item_id must be an integer for recommendation models. The following code uses
StringIndexer to transform ITEM_ID_COL to indices.
Python
df_items = (
StringIndexer(inputCol=ITEM_ID_COL, outputCol="_item_id")
.setHandleInvalid("skip")
.fit(df_items)
.transform(df_items)
.withColumn("_item_id", F.col("_item_id").cast("int"))
)
Display and check if the _item_id increases monotonically and successively as expected.
Python
display(df_items.sort(F.col("_item_id").desc()))
Use the following code to plot the top 10 authors with the maximum number of books. Agatha Christie is the leading author with over 600
books, followed by William Shakespeare.
Python
df_books = df_items.toPandas() # Create a pandas dataframe from the spark dataframe for visualization
plt.figure(figsize=(8,5))
sns.countplot(y="Book-Author",palette = 'Paired', data=df_books,order=df_books['Book-Author'].value_counts().index[0:10])
plt.title("Top 10 authors with maximum number of books")
Python
display(df_users, summary=True)
If there's a missing value in User-ID , drop the row with missing value. It doesn't matter if customized dataset doesn't have missing value.
Python
df_users = df_users.dropna(subset=(USER_ID_COL))
display(df_users, summary=True)
Add a _user_id column for later usage. The _user_id must be integer for recommendation models. In the following code, you use
StringIndexer to transform USER_ID_COL to indices.
The book dataset already contains a User-ID column, which is integer. However adding a _user_id column for compatibility to different
datasets makes this example more robust. To add the _user_id column, use the following code:
Python
df_users = (
StringIndexer(inputCol=USER_ID_COL, outputCol="_user_id")
.setHandleInvalid("skip")
.fit(df_users)
.transform(df_users)
.withColumn("_user_id", F.col("_user_id").cast("int"))
)
Python
display(df_users.sort(F.col("_user_id").desc()))
Python
display(df_ratings, summary=True)
Get the distinct ratings and save them to a list named ratings for later use.
Python
To display the top 10 books with the highest ratings, use the following code:
Python
plt.figure(figsize=(8,5))
sns.countplot(y="Book-Title",palette = 'Paired',data= df_books, order=df_books['Book-Title'].value_counts().index[0:10])
plt.title("Top 10 books per number of ratings")
"Selected Poems" is most favorable among users according to ratings. The books "Adventures of Huckleberry Finn", "The Secret Garden",
and "Dracula", have the same rating.
Merge data
Merge the three DataFrames into one DataFrame for a more comprehensive analysis.
Python
# Reorders the columns to ensure that _user_id, _item_id, and Book-Rating are the first three columns
df_all = (
df_all.select(["_user_id", "_item_id", RATING_COL] + df_all_columns)
.withColumn("id", F.monotonically_increasing_id())
.cache()
)
display(df_all)
To display a count of the total distinct users, books, and interactions, use the following code:
Python
Python
Tip
The <topn> can be used for Popular or "Top purchased" recommendation sections.
Python
Python
if IS_SAMPLE:
# Must sort by '_user_id' before performing limit to ensure ALS work normally
# Note that if train and test datasets have no common _user_id, ALS will fail
df_all = df_all.sort("_user_id").limit(SAMPLE_ROWS)
# Using a fraction between 0 to 1 returns the approximate size of the dataset, i.e., 0.8 means 80% of the dataset
# Rating = 0 means the user didn't rate the item, so it can't be used for training
# We use the 80% if the dataset with rating > 0 as the training dataset
fractions_train = {0: 0}
fractions_test = {0: 0}
for i in ratings:
if i == 0:
continue
fractions_train[i] = 0.8
fractions_test[i] = 1
# training dataset
train = df_all.sampleBy(RATING_COL, fractions=fractions_train)
# Join with leftanti will select all rows from df_all with rating > 0 and not in train dataset, i.e., the remaining 20% of
the dataset
# test dataset
test = df_all.join(train, on="id", how="leftanti").sampleBy(
RATING_COL, fractions=fractions_test
)
To gain a better understanding of the data and the problem at hand, use the following code to compute the sparsity of the dataset.
Sparsity refers to the situation in which feedback data is sparse and not sufficient to identify similarities in users' interests.
Python
# Count the total number of distinct user_id and distinct product_id - used as denominator
total_elements = (
ratings.select("_user_id").distinct().count()
* ratings.select("_item_id").distinct().count()
)
get_mat_sparsity(df_all)
Python
Spark ML provides a convenient API for building the ALS model. However, the model isn't good enough at handling problems like data
sparsity and cold start (making recommendations when the users or items are new). To improve the performance of the model, you can
combine cross validation and auto hyperparameter tuning.
To import libraries required for training and evaluating the model, use the following code:
Python
Python
# Construct a grid search to select the best values for the training parameters
param_grid = (
ParamGridBuilder()
.addGrid(als.rank, rank_size_list)
.addGrid(als.regParam, reg_param_list)
.build()
)
# Define evaluator and set the loss fucntion to root mean squared error (RMSE)
evaluator = RegressionEvaluator(
metricName="rmse", labelCol=RATING_COL, predictionCol="prediction"
)
Use the following code to initiate different model tuning methods based on the preconfigured parameters. For more information on model
tuning, see https://spark.apache.org/docs/latest/ml-tuning.html .
Python
Evaluate model
Modules should be evaluated against the test data. If a model is well trained, it should have high metrics on the dataset.
If the model is overfitted, you may need to increase the size of the training data or reduce some of the redundant features. You may also
need to change the model's architecture or fine-tune its hyperparameters.
7 Note
If the R-Squared metric value is negative, it indicates that the trained model performs worse than a horizontal straight line. Suggesting
that the data isn't explained by the trained model.
Python
predictions = model.transform(data).withColumn(
"prediction", F.col("prediction").cast("double")
)
if verbose > 1:
# Show 10 predictions
predictions.select("_user_id", "_item_id", RATING_COL, "prediction").limit(
10
).show()
if verbose > 0:
print(f"RMSE score = {rmse}")
print(f"MAE score = {mae}")
print(f"R2 score = {r2}")
print(f"Explained variance = {var}")
Python
with mlflow.start_run(run_name="als"):
# Train models
models = tuner.fit(train)
best_metrics = {"RMSE": 10e6, "MAE": 10e6, "R2": 0, "Explained variance": 0}
best_index = 0
# Evaluate models
# Log model, metrics and parameters
for idx, model in enumerate(models.subModels):
with mlflow.start_run(nested=True, run_name=f"als_{idx}") as run:
print("\nEvaluating on testing data:")
print(f"subModel No. {idx + 1}")
predictions, (rmse, mae, r2, var) = evaluate(model, test, verbose=1)
signature = infer_signature(
train.select(["_user_id", "_item_id"]),
predictions.select(["_user_id", "_item_id", "prediction"]),
)
print("log model:")
mlflow.spark.log_model(
model,
f"{EXPERIMENT_NAME}-alsmodel",
signature=signature,
registered_model_name=f"{EXPERIMENT_NAME}-alsmodel",
dfs_tmpdir="Files/spark",
)
print("log metrics:")
current_metric = {
"RMSE": rmse,
"MAE": mae,
"R2": r2,
"Explained variance": var,
}
mlflow.log_metrics(current_metric)
if rmse < best_metrics["RMSE"]:
best_metrics = current_metric
best_index = idx
print("log parameters:")
mlflow.log_params(
{
"subModel_idx": idx,
"num_epochs": num_epochs,
"rank_size_list": rank_size_list,
"reg_param_list": reg_param_list,
"model_tuning_method": model_tuning_method,
"DATA_FOLDER": DATA_FOLDER,
}
)
# Log best model and related metrics and parameters to the parent run
mlflow.spark.log_model(
models.subModels[best_index],
f"{EXPERIMENT_NAME}-alsmodel",
signature=signature,
registered_model_name=f"{EXPERIMENT_NAME}-alsmodel",
dfs_tmpdir="Files/spark",
)
mlflow.log_metrics(best_metrics)
mlflow.log_params(
{
"subModel_idx": idx,
"num_epochs": num_epochs,
"rank_size_list": rank_size_list,
"reg_param_list": reg_param_list,
"model_tuning_method": model_tuning_method,
"DATA_FOLDER": DATA_FOLDER,
}
)
To view the logged information for the training run, select the experiment named aisample-recommendation from your workspace. If you
changed the experiment name, select the experiment with the name you specified. The logged information appears similar to the following
image:
Step 4: Load the final model for scoring and make predictions
Once the training has completed and the best model is selected, load the model for scoring (sometimes called inferencing). The following
code loads the model and uses predictions to recommend the top 10 books for each user:
Python
Python
Next steps
Training and evaluating a text classification model
Machine learning model in Microsoft Fabric
Train machine learning models
Machine learning experiments in Microsoft Fabric
Tutorial: Create, evaluate, and score a fraud detection model
Article • 09/21/2023
In this tutorial, you walk through the Synapse Data Science in Microsoft Fabric workflow with an end-to-end example. The scenario is to
build a fraud detection model by using machine learning algorithms trained on historical data, and then use the model to detect future
fraudulent transactions. The steps that you take are:
) Important
Prerequisites
Get a Microsoft Fabric subscription. Or sign up for a free Microsoft Fabric (Preview) trial.
Switch to the Data Science experience by using the experience switcher icon on the left side of your home page.
If you don't have a Microsoft Fabric lakehouse, create one by following the steps in Create a lakehouse in Microsoft Fabric.
If you want to open the accompanying notebook for this tutorial, follow the instructions in Prepare your system for data science to import
the tutorial notebooks to your workspace.
Or, if you'd rather copy and paste the code from this page, you can create a new notebook.
Be sure to attach a lakehouse to the notebook before you start running code.
Use the inline installation capabilities (such as %pip or %conda ) of your notebook to install libraries in your current notebook only.
Install libraries directly in your workspace, so that the libraries are available for use by all notebooks in your workspace.
For this tutorial, you install the imblearn library in your notebook by using %pip install . When you run %pip install , the PySpark kernel
restarts. So you should install the library before you run any other cells in the notebook:
Python
The features V1 , V2 , V3 , …, V28 are the principal components obtained with PCA.
The feature Time contains the elapsed seconds between a transaction and the first transaction in the dataset.
The feature Amount is the transaction amount. You can use this feature for example-dependent, cost-sensitive learning.
The column Class is the response (target) variable and takes the value 1 for fraud and 0 otherwise.
Out of the 284,807 transactions, only 492 are fraudulent. The minority class (fraud) accounts for only about 0.172% of the data, so the
dataset is highly imbalanced.
Time V1 V2 V3 V4 V5 V6 V7
Introduction to SMOTE
The imblearn (imbalanced learn) library uses the Synthetic Minority Oversampling Technique (SMOTE) approach to address the problem of
imbalanced classification. Imbalanced classification happens when there are too few examples of the minority class for a model to
effectively learn the decision boundary.
SMOTE is the most widely used approach to synthesize new samples for the minority class. To learn more about SMOTE, see the scikit-learn
reference page for the SMOTE method and the scikit-learn user guide on oversampling .
Tip
You can apply the SMOTE approach by using the imblearn library that you installed in Step 1.
Python
The following code downloads a publicly available version of the dataset and then stores it in a Fabric lakehouse.
) Important
Be sure to add a lakehouse to the notebook before running it. If you don't, you'll get an error.
Python
if not IS_CUSTOM_DATA:
# Download data files into the lakehouse if they're not already there
import os, requests
remote_url = "https://synapseaisolutionsa.blob.core.windows.net/public/Credit_Card_Fraud_Detection"
fname = "creditcard.csv"
download_path = f"/lakehouse/default/{DATA_FOLDER}/raw"
if not os.path.exists("/lakehouse/default"):
raise FileNotFoundError("Default lakehouse not found, please add a lakehouse and restart the session.")
os.makedirs(download_path, exist_ok=True)
if not os.path.exists(f"{download_path}/{fname}"):
r = requests.get(f"{remote_url}/{fname}", timeout=30)
with open(f"{download_path}/{fname}", "wb") as f:
f.write(r.content)
print("Downloaded demo data files into lakehouse.")
The Data Science experience in Microsoft Fabric includes an autologging feature. This feature reduces the amount of code required to
automatically log the parameters, metrics, and items of a machine learning model during training. The feature extends MLflow's
autologging capabilities and is deeply integrated into the Data Science experience.
By using autologging, you can easily track and compare the performance of different models and experiments without the need for manual
tracking. For more information, see Autologging in Microsoft Fabric .
You can disable Microsoft Fabric autologging in a notebook session by calling mlflow.autolog() and setting disable=True :
Python
mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.autolog(disable=True) # Disable MLflow autologging
Python
df = (
spark.read.format("csv")
.option("header", "true")
.option("inferSchema", True)
.load(f"{DATA_FOLDER}/raw/{DATA_FILE}")
.cache()
)
Python
display(df)
Python
Python
import pyspark.sql.functions as F
df_columns = df.columns
df_columns.remove(TARGET_COL)
if IS_SAMPLE:
df = df.limit(SAMPLE_ROWS)
2. Convert Spark DataFrame to pandas DataFrame for easier visualization and processing:
Python
df_pd = df.toPandas()
Python
The code returns the following class distribution of the dataset: 99.83% No Frauds and 0.17% Frauds . This class distribution shows
that most of the transactions are nonfraudulent. Therefore, data preprocessing is required before model training, to avoid overfitting.
2. Use a plot to show the class imbalance in the dataset, by viewing the distribution of fraudulent versus nonfraudulent transactions:
Python
3. Show the five-number summary (minimum score, first quartile, median, third quartile, and maximum score) for the transaction
amount, by using box plots:
Python
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,5))
s = sns.boxplot(ax = ax1, x="Class", y="Amount", hue="Class",data=df_pd, palette="PRGn", showfliers=True) # Remove
outliers from the plot
s = sns.boxplot(ax = ax2, x="Class", y="Amount", hue="Class",data=df_pd, palette="PRGn", showfliers=False) # Keep
outliers from the plot
plt.show()
When the data is highly imbalanced, these box plots might not demonstrate accurate insights. Alternatively, you can address the class
imbalance problem first and then create the same plots for more accurate insights.
Python
Apply SMOTE to the training data to synthesize new samples for the minority class
Apply SMOTE only to the training dataset, and not to the test dataset. When you score the model with the test data, you want an
approximation of the model's performance on unseen data in production. For this approximation to be valid, your test data needs to
represent production data as closely as possible by having the original imbalanced distribution.
Python
X = train[feature_cols]
y = train[TARGET_COL]
print("Original dataset shape %s" % Counter(y))
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
print("Resampled dataset shape %s" % Counter(y_res))
There are several options for training machine learning models by using Apache Spark in Microsoft Fabric: Apache Spark MLlib, SynapseML,
and other open-source libraries. For more information, see Train machine learning models in Microsoft Fabric .
A machine learning experiment is the primary unit of organization and control for all related machine learning runs. A run corresponds to a
single execution of model code. Machine learning experiment tracking refers to the process of managing all the experiments and their
components, such as parameters, metrics, models, and other artifacts.
When you use experiment tracking, you can organize all the required components of a specific machine learning experiment. You can also
easily reproduce past results, by using saved experiments. For more information on machine learning experiments, see Machine learning
experiments in Microsoft Fabric .
1. Update the MLflow autologging configuration to track more metrics, parameters, and files, by setting exclusive=False :
Python
mlflow.autolog(exclusive=False)
2. Train two models by using LightGBM: one model on the imbalanced dataset and the other on the balanced dataset (via SMOTE). Then
compare the performance of the two models.
Python
Python
# Train LightGBM for both imbalanced and balanced datasets and define the evaluation metrics
print("Start training with imbalanced data:\n")
with mlflow.start_run(run_name="raw_data") as raw_run:
model = model.fit(
train[feature_cols],
train[TARGET_COL],
eval_set=[(test[feature_cols], test[TARGET_COL])],
eval_metric="auc",
callbacks=[
lgb.log_evaluation(10),
],
)
Python
with mlflow.start_run(run_id=raw_run.info.run_id):
importance = lgb.plot_importance(
model, title="Feature importance for imbalanced data"
)
importance.figure.savefig("feauture_importance.png")
mlflow.log_figure(importance.figure, "feature_importance.png")
2. Determine feature importance for the model that you trained on balanced data (generated via SMOTE):
Python
with mlflow.start_run(run_id=smote_run.info.run_id):
smote_importance = lgb.plot_importance(
smote_model, title="Feature importance for balanced (via SMOTE) data"
)
smote_importance.figure.savefig("feauture_importance_smote.png")
mlflow.log_figure(smote_importance.figure, "feauture_importance_smote.png")
The important features are drastically different when you train a model with the imbalanced dataset versus the balanced dataset.
Python
return predictions
2. Use the prediction_to_spark function to perform predictions with the two models, model and smote_model :
Python
Python
metrics = ComputeModelStatistics(
evaluationMetric="classification", labelCol=TARGET_COL, scoredLabelsCol="prediction"
).transform(predictions)
smote_metrics = ComputeModelStatistics(
evaluationMetric="classification", labelCol=TARGET_COL, scoredLabelsCol="prediction"
).transform(smote_predictions)
display(metrics)
A confusion matrix displays the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) that a model
produces when it's scored with test data. For binary classification, you get a 2x2 confusion matrix. For multiclass classification, you get an
nxn confusion matrix, where n is the number of classes.
1. Use a confusion matrix to summarize the performances of the trained machine learning models on the test data:
Python
2. Plot the confusion matrix for the predictions of smote_model (trained on balanced data):
Python
# Plot the confusion matrix
import seaborn as sns
def plot(cm):
"""
Plot the confusion matrix.
"""
sns.set(rc={"figure.figsize": (5, 3.5)})
ax = sns.heatmap(cm, annot=True, fmt=".20g")
ax.set_title("Confusion Matrix")
ax.set_xlabel("Predicted label")
ax.set_ylabel("True label")
return ax
with mlflow.start_run(run_id=smote_run.info.run_id):
ax = plot(smote_cm)
mlflow.log_figure(ax.figure, "ConfusionMatrix.png")
3. Plot the confusion matrix for the predictions of model (trained on raw, imbalanced data):
Python
with mlflow.start_run(run_id=raw_run.info.run_id):
ax = plot(cm)
mlflow.log_figure(ax.figure, "ConfusionMatrix.png")
The Area Under the Curve Receiver Operating Characteristic (AUC-ROC) measure is widely used to assess the performance of binary
classifiers. AUC-ROC is a chart that visualizes the trade-off between the true positive rate (TPR) and the false positive rate (FPR).
In some cases, it's more appropriate to evaluate your classifier based on the Area Under the Precision-Recall Curve (AUPRC) measure.
AUPRC is a curve that combines these rates:
Python
def evaluate(predictions):
"""
Evaluate the model by computing AUROC and AUPRC with the predictions.
"""
2. Log the AUC-ROC and AUPRC metrics for the model trained on imbalanced data:
Python
with mlflow.start_run(run_id=raw_run.info.run_id):
auroc, auprc = evaluate(predictions)
mlflow.log_metrics({"AUPRC": auprc, "AUROC": auroc})
mlflow.log_params({"Data_Enhancement": "None", "DATA_FILE": DATA_FILE})
3. Log the AUC-ROC and AUPRC metrics for the model trained on balanced data:
Python
with mlflow.start_run(run_id=smote_run.info.run_id):
auroc, auprc = evaluate(smote_predictions)
mlflow.log_metrics({"AUPRC": auprc, "AUROC": auroc})
mlflow.log_params({"Data_Enhancement": "SMOTE", "DATA_FILE": DATA_FILE})
The model trained on balanced data returns higher AUC-ROC and AUPRC values compared to the model trained on imbalanced data.
Based on these measures, SMOTE appears to be an effective technique for enhancing model performance when you're working with highly
imbalanced data.
As shown in the following image, any experiment is logged along with its respective name. You can track the experiment's parameters and
performance metrics in your workspace.
The following image also shows performance metrics for the model trained on the balanced dataset (in Version 2). You can select Version 1
to see the metrics for the model trained on the imbalanced dataset. When you compare the metrics, you notice that AUROC is higher for
the model trained with the balanced dataset. These results indicate that this model is better at correctly predicting 0 classes as 0 and
predicting 1 classes as 1 .