Using Databricks Notebook in Talend Studio - Talend Solution Templates
Using Databricks Notebook in Talend Studio
Introduction
Data Scientists build machine learning models using notebooks that support multiple languages and built-in data visualizations. Then, with the help of
Data Engineers, these models are integrated into a data pipeline. This solution template shows you how to integrate a Databricks notebook into a data
pipeline using Talend Data Integration.
For use in conjunction with the DataBricks_Notebook_Execution_DemoJob.zip – Talend Project.
Prerequisites
Recommended training
• Talend Data Integration Basics
• Talend Data Integration Advanced
• Knowledge Asset - Context Management Common Joblet
• Knowledge of Databricks on Microsoft Azure
Technical prerequisites
• Minimum system requirements:
OS CPU RAM SSD Disk Size
Windows/Linux/Mac Intel i7 Processor 4 Cores or equivalent 16 GB 500 GB
• Talend Data Integration Platform Studio version 8.0.1 (on-premises or cloud) with R2023-02 build patch applied
• Access to the Context Management Common Framework.zip and DataBricks_Notebook_Execution_DemoJob.zip projects
• Databricks cluster setup on Microsoft Azure
Copyright Talend 2023 1
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
Contents
Getting started ..................................................................................................................................................................................................................................... 3
Understanding the essential context variables .................................................................................................................................................................................. 4
Job overview ........................................................................................................................................................................................................................................ 5
PreJob .............................................................................................................................................................................................................................................. 6
Getting the Databricks Job details .................................................................................................................................................................................................. 7
Executing the Databricks notebook ................................................................................................................................................................................................ 9
Using a loop to check the notebook execution status.................................................................................................................................................................. 12
Using an API call to check the notebook execution status ........................................................................................................................................................... 13
Post Job.......................................................................................................................................................................................................................................... 16
Reviewing the Job and logs ........................................................................................................................................................................................................... 17
References .......................................................................................................................................................................................................................................... 19
Talend............................................................................................................................................................................................................................................. 19
Databricks ...................................................................................................................................................................................................................................... 19
Copyright Talend 2023 2
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
Getting started
In Talend Studio, ensure that the Context Management Common Framework Jobs are available as Reference Projects, by performing the following
steps:
1. Navigate to File > Edit Project Properties > Reference Projects.
2. Select CommonFramework from the Project pull-down menu, then select the preferred Branch for the project, in this case, master.
The project is stored in the Referenced project section of the Repository.
Copyright Talend 2023 3
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
3. Import the DataBricks_Notebook_Execution_DemoJob.zip file and extract the Job and the Contexts.
Understanding the essential context variables
The connection parameters for Databricks on Microsoft Azure are handled through a context variable mechanism. These context variables are used
with Default context data and are grouped under the DataBricks_Notebook_Execution_Context_Parameters context group. The context variables
are loaded to the Job by adding the context load common Joblet in the PreJob section of the Job.
Copyright Talend 2023 4
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
Job overview
This section describes the overall Job flow and goes through the configuration and process of the entire Job. The Job gets the JOB_ID from the
Databricks cluster, fetches the token for authentication, and accesses the instance URL from Microsoft Azure.
The Job has four steps:
• REST API call gets the Databricks Job details
• REST API call executes the Databricks Notebook
• Loop checks the Job status
• REST API call checks the status of Job execution
Copyright Talend 2023 5
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
PreJob
The first stage in the process is to add a tPreJob component to the Job, then load the context variables through the common framework
CommonFramework:LoadContext Joblet.
Copyright Talend 2023 6
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
Getting the Databricks Job details
The tRESTClient_1 component calls Databricks to get the Job details: the URL, access token, and JOB_ID, which pass to the component as context
parameters.
Copyright Talend 2023 7
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
tExtractJSONFields_1 parses the response and extracts the Job’s details and settings. The tLogRow_1 component displays the details in the console.
Copyright Talend 2023 8
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
Executing the Databricks notebook
The tFixedFlowInput_1 component passes the job_id to the tWriteJSONField_1 component.
tWriteJSONField_1 converts the job_id to JSON format.
Copyright Talend 2023 9
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
tRESTClient_2 gets the Databricks notebook URL and access token and passes them to the component as context parameters.
Copyright Talend 2023 10
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
tExtractJSONFields_2 parses the response and extracts the Job’s details and settings. The tLogRow_2 component displays the details in the console.
The tSetGlobalVar_1 component sets “run_id” as a global variable and uses it to check the status of the Job.
Copyright Talend 2023 11
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
Using a loop to check the notebook execution status
tLoop_1 sets the loop to run until the status of the notebook execution is finished. The tSleep_1 component sets the sleep interval for 30 seconds.
Copyright Talend 2023 12
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
Using an API call to check the notebook execution status
The tRESTClient_3 component passes the URL and access token as context parameters.
Copyright Talend 2023 13
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
tExtractJSONFields_3 parses the response and extracts the status details for the Job. tLogRow_3 displays the details in the console.
Copyright Talend 2023 14
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
tSetGlobalVar_2 sets “life_cycle_state” as a global variable and uses it to check the Job’s status in the tLoop. tSetGlobalVar_2 also sets
“result_state” as a global variable and uses it to check the display of the final status message in logs using the tWarn component.
Copyright Talend 2023 15
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
Post Job
After the data processing is complete, control of the Job transfers to the PostJob_1 section of the Talend Job. The tWarn_1 component displays the final
Job status message. This uses the “result_state” global variable.
Copyright Talend 2023 16
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
Reviewing the Job and logs
After executing the Job using the Databricks notebook, it should look like this:
Copyright Talend 2023 17
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
Job logs
Copyright Talend 2023 18
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates
References
Talend
• Best Practice: Conventions for Return Codes (Use these codes in the tDie and tWarn components.)
Databricks
• Introduction to Databricks notebooks
• Create, run, and manage Databricks Jobs
• 02_1_Cloud_Trail_Ingest (sample notebook for ingestion using Python and SQL)
• 05_Impact_Analysis (sample notebook using SQL queries)
• Jobs API 2.1 (used in current Job)
Copyright Talend 2023 19
This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.