Azure Databricks Monitoring
Azure Databricks Monitoring
Azure Databricks Monitoring
Change Log...................................................................................................................................................2
Azure Databricks Monitoring........................................................................................................................2
Monitoring of Spark Metrics with Azure Databricks Monitoring Library..................................................2
Prerequisites:.......................................................................................................................................2
Configuration........................................................................................................................................2
Logging custom metrics......................................................................................................................10
Monitoring of Linux virtual machines provisioned for Databricks clusters.............................................12
Configuration:.....................................................................................................................................12
Monitoring of user activities in Databricks Workspace UI......................................................................17
Prerequisites......................................................................................................................................17
Configuration......................................................................................................................................17
Diagnostic log schema........................................................................................................................19
Browsing diagnostic logs in Azure Monitor........................................................................................20
Limitations..........................................................................................................................................22
1
Change Log
Change Type Author Date
Document creation Łukasz Olejniczak 28.05.2019
([email protected])
All categories log into Log Analytics Workspace. Azure Databricks Monitoring Library comes with ARM
template to create Log Analytics Workspace together with queries which help to get insights from raw
logs.
Prerequisites
The following components need to be installed in order to build Azure Databricks Monitoring Library
from sources:
Configuration
1. Use GIT client to import Azure Databricks Monitoring Library sources into your local machine.
2
The GitHub repo for the Azure Databricks Monitoring Library has the following directory structure:
/perftools
/src
/spark-jobs
/spark-listeners-loganalytics
/spark-listeners
/pom.xml
The src/spark-jobs directory is a sample Spark application demonstrating how to implement a Spark
application metric counter.
The src/spark-listeners directory includes functionality that enables Azure Databricks to send Apache
Spark events at the service level to an Azure Log Analytics workspace. Azure Databricks is a service based
on Apache Spark, which includes a set of structured APIs for batch processing data using Datasets,
DataFrames, and SQL. With Apache Spark 2.0, support was added for Structured Streaming, a data
stream processing API built on Spark's batch processing APIs.
The src/spark-listeners-loganalytics directory includes a sink for Spark listeners with client for an Azure
Log Analytics Workspace. This directory also includes a log4j Appender for your Apache Spark application
logs.
The spark-listeners-loganalytics and spark-listeners directories contain the code for building the two JAR
files that are deployed to the Databricks cluster. The spark-listeners directory includes a scripts directory
that contains a cluster node initialization script to copy the JAR files from a staging directory in the Azure
Databricks file system to execution nodes.
The pom.xml file is the main Maven build file for the entire project.
3
2. Go to src directory and execute the following command to start build process:
cd src
mvn clean install
so that every project should have corresponding jar file in its target folder.
Project Jar
spark-jobs spark-jobs/target/spark-jobs-1.0-SNAPHOST.jar
spark-listeners spark-listeners/target/spark-listeners-1.0-SNAPHOST.jar
spark-listeners-loganalytics Spark-listeners/target/spark-listeners-loganalytics-1.0-SNAPSHOT.jar
4
5. Log into Azure Portal and open the created Azure Log Analytics Workspace resource to get the
corresponding WORKSPACE ID and PRIMARY KEY:
8. Set up authentication details for Databricks (access token is required). Credentials will be stored
in ~/.databrickscfg.
databricks configure
5
databricks configure --token
6
13. Use the Azure Databricks CLI to copy Azure Databricks Monitoring Library jars
spark-listeners-1.0-SNAPSHOT.jar and spark-listeners-loganalytics-1.0-SNAPSHOT.jar to
dbfs:/databricks/monitoring-staging
14. Create cluster from Databricks Workspace UI. Under advanced options select „Init scripts”.
Under destination select DBFS and enter: dbfs:/databricks/monitoring-staging/listeners.sh
15. After you complete these steps, your Databricks cluster streams some metric data about the
cluster itself to Azure Monitor. This log data is available in your Azure Log Analytics workspace
under the "Active | Custom Logs | SparkMetric_CL" schema.
7
To get the list of available Spark metrics the following query can be used:
HiveExternalCatalog.parallelListingJobCount
HiveExternalCatalog.partitionsFetched
CodeGenerator.compilationTime
CodeGenerator.generatedClassSize
CodeGenerator.generatedMethodSize
CodeGenerator.sourceCodeSize
shuffleService.blockTransferRateBytes
shuffleService.openBlockRequestLatencyMillis
shuffleService.registerExecutorRequestLatencyMillis
shuffleService.registeredExecutorsSize
shuffleService.shuffle-server.usedDirectMemory
shuffleService.shuffle-server.usedHeapMemory
Databricks.directoryCommit.autoVacuumCount
CodeGenerator.generatedMethodSize
shuffleService.registeredExecutorsSize
shuffleService.shuffle-server.usedDirectMemory
shuffleService.shuffle-server.usedHeapMemory
Databricks.directoryCommit.autoVacuumCount
Databricks.directoryCommit.deletedFilesFiltered
Databricks.directoryCommit.filterListingCount
Databricks.directoryCommit.jobCommitCompleted
Databricks.directoryCommit.markerReadErrors
Databricks.directoryCommit.markerRefreshCount
8
Databricks.directoryCommit.markerRefreshErrors
Databricks.directoryCommit.markersRead
Databricks.directoryCommit.repeatedListCount
Databricks.directoryCommit.uncommittedFilesFiltered
Databricks.directoryCommit.untrackedFilesFound
Databricks.directoryCommit.vacuumCount
Databricks.directoryCommit.vacuumErrors
Databricks.preemption.numChecks
Databricks.preemption.numPoolsAutoExpired
Databricks.preemption.numTasksPreempted
Databricks.preemption.poolStarvationMillis
Databricks.preemption.schedulerOverheadNanos
Databricks.preemption.taskTimeWastedMillis
HiveExternalCatalog.fileCacheHits
HiveExternalCatalog.filesDiscovered
HiveExternalCatalog.hiveClientCalls
Azure Log Analytics Workspace deployed from ARM template available in Azure Databricks Monitoring
Library sources includes a set of predefined queries:
Query Description
%cpu time per executor
% deserialize time per executor
% jvm time per executor
% serialize time per executor
Disk Bytes Spilled
9
Error traces
File system bytes read per executor
File system bytes write per executor
Job errors per job
Job latency per job
Job Throughput
Running Executors
Shuffle Bytes Read
Shuffle Bytes read per executor
Shuffle bytes read to disk per executor
Shuffle client direct memory
Shuffle client direct memory per executor
Shuffle disk bytes spilled to disk per executor
Shuffle heap memory per executor
Shuffle memory spilled per executor
Stage latency per stage
Stage throughput per stage
Streaming errors per stream
Streaming latency per stream
Streaming throughput inputrowssec
Streaming throughput processedrowssec
Sum Task Execution Per Host
Task Deserialization Time
Task errors per Stage
Task Executor Compute time
Task Input Bytes read
Task Latency per Stage
Task result serialization Time
Task Scheduler Delay Latency
Task Shuffle Bytes Read
Task Shuffle Bytes Written
Task Shuffle Read Time
Task Shuffle Write time
Task throughput
Tasks per executor
Tasks per stage
In order to log custom metrics the following needs to be added to the application code:
10
2. Register custom metric (e.g. as counter – you can declare gauge, histogram, meter, timer. For
more details on how to register and use distinct types of custom metrics check:
https://github.com/groupon/spark-
metrics/blob/master/src/main/scala/org/apache/spark/groupon/metrics/example/MetricsBenc
hmarkApp.scala)
3. You can browse Azure Log Analytics Workspace used for Spark Metrics to find custom events:
11
Monitoring of Linux virtual machines provisioned for Databricks clusters
To monitor Databricks Cluster VM instances the recommended approach is to configure Log Analytics
Agent. The agent for Linux communicates outbound to the Azure Monitor service over TCP port 443
Configuration:
1. Copy the following script to a new file on local machine (replace WORKSPACE_ID and
WORKSPACE_PRIMARY_KEY placeholders with values of Azure Log Analytics Workspace created
for Databricks Monitoring). The script will downloads the agent, validate its checksum, install
it and finally start it.
4. Select Data options and the select Syslog tab. Type syslog in search panel and specify which level
of information should be captured:
12
5. Click Save and go to Linux Performance Counter Tab:
Type * in search panel to check available metrics and select those that are needed:
13
Metric category Metric name
14
Logical Disk % Free Inodes
Logical Disk % Free Space
Logical Disk % Used Inodes
Logical Disk % Used Space
Logical Disk Disk Read Bytes/sec
Logical Disk Disk Reads/sec
Logical Disk Disk Transfers/sec
Logical Disk Disk Write Bytes/sec
Logical Disk Disk Writes/sec
Logical Disk Free Megabytes
Logical Disk Logical Disk Bytes/sec
Memory % Available Memory
Memory % Available Swap Space
Memory % Used Memory
Memory % Used Swap Space
Memory Available MBytes Memory
Memory Available MBytes Swap
Memory Page Reads/sec
Memory Page Writes/sec
Memory Pages/sec
Memory Used MBytes Swap Space
Memory Used Memory MBytes
Network Total Bytes Transmitted
Network Total Bytes Received
Network Total Bytes
Network Total Packets Transmitted
Network Total Packets Received
Network Total Rx Errors
Network Total Tx Errors
Network Total Collisions
Physical Disk Avg. Disk sec/Read
Physical Disk Avg. Disk sec/Transfer
Physical Disk Avg. Disk sec/Write
Physical Disk Physical Disk Bytes/sec
Process Pct Privileged Time
Process Pct User Time
Process Used Memory kBytes
Process Virtual Shared Memory
Processor % DPC Time
Processor % Idle Time
Processor % Interrupt Time
Processor % IO Wait Time
15
Processor % Nice Time
Processor % Privileged Time
6. Save changes.
9. Open Azure Monitor, select Search Log option and select Log Analytics Workspace used for
Databricks Monitoring.
10. Query Perf object:
16
Monitoring of user activities in Databricks Workspace UI
Databricks provides diagnostic logs of activities performed by Azure Databricks users.
Prerequisites
Diagnostic logs require Azure Databricks Premium Plan.
Configuration
The following steps are necessary to enable diagnostic logs delivery:
17
6. Provide the following configuration:
Select where diagnostic logs should be delivered. There are three options available:
Archive to a storage account
Stream to an event hub
Send to Log Analytics
It is possible to select all three options.
Choose which components should be monitored. The following components are
available:
Dbfs
Clusters
Accounts
Jobs
Notebook
SSH
Workspace
Secrets
sqlPermissions
It is possible to select all components
18
7. Click save
8. Once logging is enabled for your account, Azure Databricks automatically starts sending
diagnostic logs in to your delivery location on a periodic basis. Logs are available within 24 to
72 hours of activation. On any given day, Azure Databricks delivers at least 99% of diagnostic
logs within the first 24 hours, and the remaining 1% in no more than 72 hours.
Field Description
operationversion The schema version of the diagnostic log format
time UTC timestamp of the action
properties.sourceIPAddress The IP address from where request was sent
properties.userAgent The browser or API client used to make the
request
properties.sessionId Session ID of the action
identities Information about the user that makes the
19
requests
category The service that logged the request
operationName The action, such as login, logout, read, write, etc.
properties.requestId Unique request ID. If action take a long time, the
request and response are logged separately, but
the request and response pair have the same
properties.requestId
properties.requestParams Parameter key-value pairs used in the event
properties.response Response to the request
Properties.response.errorMessage The error message if there was an error
Properties.response.result The result of the request
Properties.response.statusCode HTTP status code that indicates the request
succeeds or not
properties.logId The unique identifier for the log messages
3. Expand LogManagement group in sidebar. You should see the following groups:
DatabricksAccounts
DatabricksClusters
DatabricksDBFS
DatabricksJobs
DatabricksNotebook
DatabricksSQLPermissions
DatabricksSSH
DatabricksSecrets
20
DatabricksTables
DatabricksWorkspace
For example: The following query will list all events related to Clusters component which were
triggered within defined period of time:
21
Limitations
Because diagnostic logs are delivered not immediately when event is triggered but on periodic basis so
that they are available within 24 to 72 hours, they should not be used for alerting. Instead they are a
great source of information for reporting.
22