0% found this document useful (0 votes)
203 views

De Mod 1 Get Started With Databricks Data Science and Engineering Workspace

This document provides an overview of the Databricks platform and workspace. It describes [1] the core components of the Databricks Lakehouse platform, [2] how to navigate the Databricks user interface, [3] how to create and manage clusters, and [4] how to develop code using Databricks notebooks.

Uploaded by

Jaya Bharathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
203 views

De Mod 1 Get Started With Databricks Data Science and Engineering Workspace

This document provides an overview of the Databricks platform and workspace. It describes [1] the core components of the Databricks Lakehouse platform, [2] how to navigate the Databricks user interface, [3] how to create and manage clusters, and [4] how to develop code using Databricks notebooks.

Uploaded by

Jaya Bharathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Get Started with

Databricks Data
Science & Engineering
Workspace

Module 01

©2023 Databricks Inc. — All rights reserved 1


Module Objectives
Get Started with Databricks Data Science and Engineering Workspace

1. Describe the core components of the Databricks Lakehouse platform.


2. Navigate the Databricks Data Science & Engineering Workspace UI.
3. Create and manage clusters using the Databricks Clusters UI.
4. Develop and run code in multi-cell Databricks notebooks using basic
operations.
5. Integrate git support using Databricks Repos.

©2023 Databricks Inc. — All rights reserved 2


Module Overview
Get Started with Databricks Data Science and Engineering Workspace

Databricks Workspace and Services


Navigate the Workspace UI
Compute Resources
DE 1.1 - Create and Manage Interactive Clusters
Develop Code with Notebooks & Databricks Repos
DE 1.2 - Databricks Notebook Operations
DE 1.3L - Get Started with the Databricks Platform Lab

©2023 Databricks Inc. — All rights reserved 3


Databricks Workspace
and Services

©2022 Databricks Inc. — All rights reserved 4


Databricks Workspace and Services
Control Plane
Data Plane

Web App
Unity Catalog

Cluster Manager Workspace SQL


Metastore clusters Warehouses

Workflow Manager
Access Control

Data Jobs Cloud Storage


Lineage/Explorer

Notebooks, Repos,
DBSQL

©2023 Databricks Inc. — All rights reserved 5


Demo: Navigate the
Workspace UI

©2023 Databricks Inc. — All rights reserved 6


Compute Resources

©2022 Databricks Inc. — All rights reserved 7


Clusters
Overview

Collection of VM instances Workloads Cluster

Worker
Distributes workloads
Notebook
across workers VM instance

Two main types: Driver Worker


Job
1. All-purpose clusters for VM instance VM instance
interactive development
Pipeline
2. Job clusters for Worker

automating workloads VM instance

©2023 Databricks Inc. — All rights reserved 8


Cluster Types

All-purpose Clusters Job Clusters

Analyze data collaboratively using Run automated jobs


interactive notebooks
The Databricks job scheduler creates job
Create clusters from the Workspace or API clusters when running jobs
Configuration information retained for up Configuration information retained for up
to 70 clusters for up to 30 days to 30 most recently terminated clusters

©2023 Databricks Inc. — All rights reserved 9


Cluster Configuration

©2022 Databricks Inc. — All rights reserved 10


Cluster Mode

Standard (Multi Node)


Default mode for workloads developed in any supported language (requires
at least two VM instances)
Single node
Low-cost single-instance cluster catering to single-node machine learning
workloads and lightweight exploratory analysis

©2023 Databricks Inc. — All rights reserved 11


Databricks Runtime Version
Standard
Apache Spark and many other components and
updates to provide an optimized big data analytics
experiences
Machine learning
Adds popular machine learning libraries like
TensorFlow, Keras, PyTorch, and XGBoost.
Photon
An optional add-on to optimize SQL workloads

©2023 Databricks Inc. — All rights reserved 12


Access Mode
Access mode Unity Catalog Supported
Visible to user
dropdown support languages

Python, SQL,
Single user Always Yes
Scala, R

Python (DBR
Shared Always (Premium plan required) Yes
11.1+), SQL

No isolation Can be hidden by enforcing user isolation in the admin Python, SQL,
No
shared console or configuring account-level settings Scala, R

Only shown for existing clusters without access modes


Python, SQL,
Custom (i.e. legacy cluster modes, Standard or High No
Scala, R
Concurrency); not an option for creating new clusters.

©2023 Databricks Inc. — All rights reserved 13


Cluster Policies

Cluster policies can help to achieve the following:


• Standardize cluster configurations
• Provide predefined configurations targeting specific use cases
• Simplify the user experience
• Prevent excessive use and control cost
• Enforce correct tagging

©2023 Databricks Inc. — All rights reserved 14


Cluster Access Control
No Permissions Can Attach To Can Restart Can Manage

Attach notebook ✓ ✓ ✓
View Spark UI, cluster
metrics, driver logs ✓ ✓ ✓
Start, restart,
terminate ✓ ✓
Edit ✓
Attach library ✓
Resize ✓
Change permissions ✓

©2023 Databricks Inc. — All rights reserved 15


DE 1.1: Create and Manage
Interactive Clusters
Use the Clusters UI to configure and deploy a cluster
Edit, terminate, restart, and delete clusters

©2023 Databricks Inc. — All rights reserved 16


Develop Code with
Notebooks

©2022 Databricks Inc. — All rights reserved 17


Databricks Notebooks
Collaborative, reproducible, and enterprise ready

Multi-language Reproducible
Use Python, SQL, Scala, and R, all in one Automatically track version history, and
Notebook use git version control with Repos

Collaborative
Real-time co-presence, co-editing, and Get to production faster
commenting Quickly schedule notebooks as jobs or
create dashboards from their results, all
in the Notebook
Ideal for exploration
Explore, visualize, and summarize data
with built-in charts and data profiles
Enterprise-ready
Enterprise-grade access controls,
Adaptable identity management, and auditability
Install standard libraries and use local
modules

©2022 Databricks Inc. — All rights reserved 18


Notebook magic commands
Use to override default languages, run utilities/auxiliary commands, etc.

%python, %r, %scala, %sql Switch languages in a command cell


%sh Run shell code (only runs on driver node, not worker nodes)
%fs Shortcut for dbutils filesystem commands
%md Markdown for styling the display
%run Execute a remote notebook from a notebook
%pip Install new Python libraries

©2022 Databricks Inc. — All rights reserved 19


dbutils (Databricks Utilities)
Perform various tasks with Databricks using notebooks
Utility Description Example

Manipulates the Databricks filesystem (DBFS)


fs dbutils.fs.ls()
from the console

Provides utilities for leveraging secrets within


secrets dbutils.secrets.get()
notebooks

notebook Utilities for the control flow of a notebook dbutils.notebook.run()

Methods to create and get bound value of


widgets dbutils.widget.text()
input widgets inside notebooks

jobs Utilities for leveraging jobs features dbutils.jobs.taskValues.set()

Available within Python, R, or Scala notebooks

©2022 Databricks Inc. — All rights reserved 20


Git Versioning
with Databricks
Repos

©2022 Databricks Inc. — All rights reserved 21


Databricks Repos

Git Versioning CI/CD Integration Enterprise ready

Native integration with API surface to integrate Allow lists to avoid


Github, Gitlab, Bitbucket with automation exfiltration
and Azure Devops
Simplifies the Secret detection to avoid
UI-based workflows dev/staging/prod leaking keys
multi-workspace story

CI CD

©2022 Databricks Inc. — All rights reserved 22


Databricks Repos
CI/CD Integration

Control Plane in Databricks Git and CI/CD Systems


Manage customer accounts, datasets, and clusters

Databricks Web Repos / Cluster Version Review Test


Application Jobs Management
Notebooks

Repos Service

©2022 Databricks Inc. — All rights reserved 23


CI/CD workflows with Git and Repos
Documentation
User workflow in Merge workflow in Git Production job workflow in
Admin workflow
Databricks provider Databricks

Set up top-level Clone remote


repository to user Pull request and API call brings Repo in
Repos folders
folder review process Production folder to
(example:
latest version
Production)

Create new branch


based on main Merge into main
branch branch

Set up Git
Run Databricks job
automation to
based on Repo in
update Repos on Create and edit code Production folder
merge Git automation calls
Databricks Repos API

Steps in Databricks
Commit and push to
feature branch Steps in your Git provider

©2022 Databricks Inc. — All rights reserved 24


DE 1.2: Databricks
Notebook Operations
Attach a notebook to a cluster to execute a cell in a notebook
Set the default language for a notebook
Describe and use magic commands
Create and run SQL, Python, and markdown cells
Export a single or collection of notebook

©2023 Databricks Inc. — All rights reserved 25


DE 1.3L: Get Started with
the Databricks Platform
Rename a notebook and change the default language
Attach a cluster
Use the %run magic command
Run Python and SQL cells
Create a Markdown cell

©2023 Databricks Inc. — All rights reserved 26


©2023 Databricks Inc. — All rights reserved 27

You might also like