0% found this document useful (0 votes)
43 views19 pages

Using Databricks Notebook in Talend Studio

Using Databricks Notebook in Talend Studio

Uploaded by

rodrigofjorge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views19 pages

Using Databricks Notebook in Talend Studio

Using Databricks Notebook in Talend Studio

Uploaded by

rodrigofjorge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Using Databricks Notebook in Talend Studio - Talend Solution Templates

Using Databricks Notebook in Talend Studio


Introduction
Data Scientists build machine learning models using notebooks that support multiple languages and built-in data visualizations. Then, with the help of
Data Engineers, these models are integrated into a data pipeline. This solution template shows you how to integrate a Databricks notebook into a data
pipeline using Talend Data Integration.

For use in conjunction with the DataBricks_Notebook_Execution_DemoJob.zip – Talend Project.

Prerequisites
Recommended training
• Talend Data Integration Basics
• Talend Data Integration Advanced
• Knowledge Asset - Context Management Common Joblet
• Knowledge of Databricks on Microsoft Azure

Technical prerequisites
• Minimum system requirements:

OS CPU RAM SSD Disk Size


Windows/Linux/Mac Intel i7 Processor 4 Cores or equivalent 16 GB 500 GB
• Talend Data Integration Platform Studio version 8.0.1 (on-premises or cloud) with R2023-02 build patch applied
• Access to the Context Management Common Framework.zip and DataBricks_Notebook_Execution_DemoJob.zip projects
• Databricks cluster setup on Microsoft Azure

Copyright Talend 2023 1


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

Contents
Getting started ..................................................................................................................................................................................................................................... 3
Understanding the essential context variables .................................................................................................................................................................................. 4
Job overview ........................................................................................................................................................................................................................................ 5
PreJob .............................................................................................................................................................................................................................................. 6
Getting the Databricks Job details .................................................................................................................................................................................................. 7
Executing the Databricks notebook ................................................................................................................................................................................................ 9
Using a loop to check the notebook execution status.................................................................................................................................................................. 12
Using an API call to check the notebook execution status ........................................................................................................................................................... 13
Post Job.......................................................................................................................................................................................................................................... 16
Reviewing the Job and logs ........................................................................................................................................................................................................... 17
References .......................................................................................................................................................................................................................................... 19
Talend............................................................................................................................................................................................................................................. 19
Databricks ...................................................................................................................................................................................................................................... 19

Copyright Talend 2023 2


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

Getting started
In Talend Studio, ensure that the Context Management Common Framework Jobs are available as Reference Projects, by performing the following
steps:
1. Navigate to File > Edit Project Properties > Reference Projects.
2. Select CommonFramework from the Project pull-down menu, then select the preferred Branch for the project, in this case, master.

The project is stored in the Referenced project section of the Repository.

Copyright Talend 2023 3


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

3. Import the DataBricks_Notebook_Execution_DemoJob.zip file and extract the Job and the Contexts.

Understanding the essential context variables


The connection parameters for Databricks on Microsoft Azure are handled through a context variable mechanism. These context variables are used
with Default context data and are grouped under the DataBricks_Notebook_Execution_Context_Parameters context group. The context variables
are loaded to the Job by adding the context load common Joblet in the PreJob section of the Job.

Copyright Talend 2023 4


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

Job overview
This section describes the overall Job flow and goes through the configuration and process of the entire Job. The Job gets the JOB_ID from the
Databricks cluster, fetches the token for authentication, and accesses the instance URL from Microsoft Azure.

The Job has four steps:


• REST API call gets the Databricks Job details
• REST API call executes the Databricks Notebook
• Loop checks the Job status
• REST API call checks the status of Job execution

Copyright Talend 2023 5


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

PreJob
The first stage in the process is to add a tPreJob component to the Job, then load the context variables through the common framework
CommonFramework:LoadContext Joblet.

Copyright Talend 2023 6


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

Getting the Databricks Job details


The tRESTClient_1 component calls Databricks to get the Job details: the URL, access token, and JOB_ID, which pass to the component as context
parameters.

Copyright Talend 2023 7


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

tExtractJSONFields_1 parses the response and extracts the Job’s details and settings. The tLogRow_1 component displays the details in the console.

Copyright Talend 2023 8


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

Executing the Databricks notebook


The tFixedFlowInput_1 component passes the job_id to the tWriteJSONField_1 component.

tWriteJSONField_1 converts the job_id to JSON format.

Copyright Talend 2023 9


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

tRESTClient_2 gets the Databricks notebook URL and access token and passes them to the component as context parameters.

Copyright Talend 2023 10


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

tExtractJSONFields_2 parses the response and extracts the Job’s details and settings. The tLogRow_2 component displays the details in the console.

The tSetGlobalVar_1 component sets “run_id” as a global variable and uses it to check the status of the Job.

Copyright Talend 2023 11


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

Using a loop to check the notebook execution status


tLoop_1 sets the loop to run until the status of the notebook execution is finished. The tSleep_1 component sets the sleep interval for 30 seconds.

Copyright Talend 2023 12


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

Using an API call to check the notebook execution status


The tRESTClient_3 component passes the URL and access token as context parameters.

Copyright Talend 2023 13


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

tExtractJSONFields_3 parses the response and extracts the status details for the Job. tLogRow_3 displays the details in the console.

Copyright Talend 2023 14


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

tSetGlobalVar_2 sets “life_cycle_state” as a global variable and uses it to check the Job’s status in the tLoop. tSetGlobalVar_2 also sets
“result_state” as a global variable and uses it to check the display of the final status message in logs using the tWarn component.

Copyright Talend 2023 15


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

Post Job
After the data processing is complete, control of the Job transfers to the PostJob_1 section of the Talend Job. The tWarn_1 component displays the final
Job status message. This uses the “result_state” global variable.

Copyright Talend 2023 16


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

Reviewing the Job and logs


After executing the Job using the Databricks notebook, it should look like this:

Copyright Talend 2023 17


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

Job logs

Copyright Talend 2023 18


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.
Using Databricks Notebook in Talend Studio - Talend Solution Templates

References
Talend
• Best Practice: Conventions for Return Codes (Use these codes in the tDie and tWarn components.)

Databricks
• Introduction to Databricks notebooks
• Create, run, and manage Databricks Jobs
• 02_1_Cloud_Trail_Ingest (sample notebook for ingestion using Python and SQL)
• 05_Impact_Analysis (sample notebook using SQL queries)
• Jobs API 2.1 (used in current Job)

Copyright Talend 2023 19


This document is Confidential Information of Talend Inc. and may only be reproduced and/or shared with Talend’s written permission.

You might also like