100% found this document useful (1 vote)
634 views18 pages

Databricks Certified Associate Data Engineer

Uploaded by

namitamarpuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
634 views18 pages

Databricks Certified Associate Data Engineer

Uploaded by

namitamarpuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Databricks

Databricks-Certified-Associate-Data-
Engineer
Databricks Certified Data Engineer Associate
QUESTION & ANSWERS

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
QUESTION 1

You were asked to create a table that can store the below data, note that orderDate is the truncated
date format of OrderTime, fill in the blank to complete the DDL.

CREATE TABLE orders (


orderId int,
orderTime timestamp,
orderdate date _____________________________________________ ,
units int)
A. AS DEFAULT (CAST(orderTime as DATE))
B. GENERATED ALWAYS AS (CAST(orderTime as DATE))
C. GENERATED DEFAULT AS (CAST(orderTime as DATE))
D. AS (CAST(orderTime as DATE))
E. Delta lake does not support calculated columns, value should be inserted into the table as part of
the ingestion process

Correct Answer: B

Explanation/Reference:

The answer is, GENERATED ALWAYS AS (CAST(orderTime as DATE))


https://docs.microsoft.com/en-us/azure/databricks/delta/delta-batch#--use-generated-columns
Delta Lake supports generated columns which are a special type of columns whose values are
automatically generated based on a user-specified function over other columns in the Delta table.
When you write to a table with generated columns and you do not explicitly provide values for them,
Delta Lake automatically computes the values.
Note: Databricks also supports partitioning using generated column

QUESTION 2

Which of the following statements are incorrect about the lakehouse

A. Support end-to-end streaming and batch workloads


B. Supports ACID
C. Support for diverse data types that can store both structured and unstructured
D. Supports BI and Machine learning
E. Storage is coupled with Compute

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
Correct Answer: E

Explanation/Reference:

The answer is, Storage is coupled with Compute.


The question was asking what is the incorrect option, in Lakehouse Storage is decoupled with
compute so both can scale independently.
What Is a Lakehouse? - The Databricks Blog

QUESTION 3

Which of the following two options are supported in identifying the arrival of new files, and
incremental data from Cloud object storage using Auto Loader?

A. Directory listing, File notification


B. Checking pointing, watermarking

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
C. Writing ahead logging, read head logging
D. File hashing, Dynamic file lookup
E. Checkpointing and Write ahead logging

Correct Answer: A

Explanation/Reference:

The answer is A, Directory listing, File notifications


Directory listing: Auto Loader identifies new files by listing the input directory.
File notification: Auto Loader can automatically set up a notification service and queue service that
subscribe to file events from the input directory.
Choosing between file notification and directory listing modes | Databricks on AWS

QUESTION 4

You are noticing job cluster is taking 6 to 8 mins to start which is delaying your job to finish on time,
what steps you can take to reduce the amount of time cluster startup time

A. Setup a second job ahead of first job to start the cluster, so the cluster is ready with resources
when the job starts
B. Use All purpose cluster instead to reduce cluster start up time
C. Reduce the size of the cluster, smaller the cluster size shorter it takes to start the cluster
D. Use cluster pools to reduce the startup time of the jobs
E. Use SQL endpoints to reduce the startup time

Correct Answer: D

Explanation/Reference:

The answer is, Use cluster pools to reduce the startup time of the jobs.
Cluster pools allow us to reserve VM's ahead of time, when a new job cluster is created VM are
grabbed from the pool. Note: when the VM's are waiting to be used by the cluster only cost incurred is
Azure. Databricks run time cost is only billed once VM is allocated to a cluster.
Here is a demo of how to setup and follow some best practices,
https://www.youtube.com/watch?v=FVtITxOabxg&ab_channel=DatabricksAcademy

QUESTION 5

You are working on a marketing team request to identify customers with same information between
two tables CUSTOMERS_2021 and CUSTOMERS_2020 each table contains 25 columns with same
schema, You are looking to identify rows match between two tables across all columns, which of the
following can be used to perform in SQL

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
A. SELECT * FROM CUSTOMERS_2021 UNIONSELECT * FROM CUSTOMERS_2020
B. SELECT * FROM CUSTOMERS_2021 UNION ALLSELECT * FROM CUSTOMERS_2020
C. SELECT * FROM CUSTOMERS_2021 C1INNER JOIN CUSTOMERS_2020 C2ON C1.CUSTOMER_ID =
C2.CUSTOMER_ID
D. SELECT * FROM CUSTOMERS_2021 INTERSECTSELECT * FROM CUSTOMERS_2020
E. SELECT * FROM CUSTOMERS_2021 EXCEPTSELECT * FROM CUSTOMERS_2020

Correct Answer: D

Explanation/Reference:

Answer is
SELECT * FROM CUSTOMERS_2021
INTERSECT
SELECT * FROM CUSTOMERS_202
INTERSECT [ALL | DISTINCT]
Returns the set of rows which are in both subqueries.
If ALL is specified a row that appears multiple times in the subquery1 as well as in subquery will be
returned multiple times.
If DISTINCT is specified the result does not contain duplicate rows. This is the default.

QUESTION 6

Kevin is the owner of the schema sales, Steve wanted to create new table in sales schema called
regional_sales so Kevin grants the create table permissions to Steve. Steve creates the new table
called regional_sales in sales schema, who is the owner of the table regional_sales
A. Kevin is the owner of sales schema, all the tables in the schema will be owned by Kevin
B. Steve is the owner of the table
C. By default ownership is assigned DBO
D. By default ownership is assigned to DEFAULT_OWNER
E. Kevin and Smith both are owners of table

Correct Answer: B

Explanation/Reference:

A user who creates the object becomes its owner, does not matter who is the owner of the parent
object.

QUESTION 7

Drop a DELTA table

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
A. DROP DELTA table_name
B. DROP TABLE table_name
C. DROP TABLE table_name FORMAT DELTA
D. DROP table_name

Correct Answer: B

QUESTION 8

Which of the following describes how Databricks Repos can help facilitate CI/CD workflows on the
Databricks Lakehouse Platform?

A. Databricks Repos can facilitate the pull request, review, and approval process before merging
branches
B. Databricks Repos can merge changes from a secondary Git branch into a main Git branch
C. Databricks Repos can be used to design, develop, and trigger Git automation pipelines
D. Databricks Repos can store the single-source-of-truth Git repository
E. Databricks Repos can commit or push code changes to trigger a CI/CD process

Correct Answer: E

Explanation/Reference:

Answer is Databricks Repos can commit or push code changes to trigger a CI/CD process
See below diagram to understand the role Databricks Repos and Git provider plays when building a
CI/CD workdlow.
All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in Gray are
done in a git provider like Github or Azure Devops.

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
QUESTION 9

Data science team members are using a single cluster to perform data analysis, although cluster size
was chosen to handle multiple users and auto-scaling was enabled, the team realized queries are still
running slow, what would be the suggested fix for this?

A. Setup multiple clusters so each team member has their own cluster
B. Disable the auto-scaling feature
C. Use High concurrency mode instead of the standard mode
D. Increase the size of the driver node

Correct Answer: C

Explanation/Reference:

The answer is Use High concurrency mode instead of the standard mode,
https://docs.databricks.com/clusters/cluster-config-best-practices.html#cluster-mode
High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc
jobs. Administrators usually create High Concurrency clusters. Databricks recommends enabling
autoscaling for High Concurrency clusters.

QUESTION 10

What is the main difference between AUTO LOADER and COPY INTO?

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
A. COPY INTO supports schema evolution.
B. AUTO LOADER supports schema evolution.
C. COPY INTO supports file notification when performing incremental loads.
D. AUTO LOADER supports directory listing when performing incremental loads.
E. AUTO LOADER Supports file notification when performing incremental loads.

Correct Answer: E

Explanation/Reference:

Auto loader supports both directory listing and file notification but COPY INTO only supports directory
listing.
Auto loader file notification will automatically set up a notification service and queue service that
subscribe to file events from the input directory in cloud object storage like Azure blob storage or S3.
File notification mode is more performant and scalable for large input directories or a high volume of
files.

Auto Loader and Cloud Storage Integration


Auto Loader supports a couple of ways to ingest data incrementally
Directory listing - List Directory and maintain the state in RocksDB, supports incremental file listing
File notification - Uses a trigger+queue to store the file notification which can be later used to retrieve
the file, unlike Directory listing File notification can scale up to millions of files per day.
[OPTIONAL]
Auto Loader vs COPY INTO?
Auto Loader
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage
without any additional setup. Auto Loader provides a new Structured Streaming source called
cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically
processes new files as they arrive, with the option of also processing existing files in that directory.

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
When to use Auto Loader instead of the COPY INTO?
You want to load data from a file location that contains files in the order of millions or higher. Auto
Loader can discover files more efficiently than the COPY INTO SQL command and can split file
processing into multiple batches.
You do not plan to load subsets of previously uploaded files. With Auto Loader, it can be more difficult
to reprocess subsets of files. However, you can use the COPY INTO SQL command to reload subsets of
files while an Auto Loader stream is simultaneously running.
Auto loader file notification will automatically set up a notification service and queue service that
subscribe to file events from the input directory in cloud object storage like Azure blob storage or S3.
File notification mode is more performant and scalable for large input directories or a high volume of
files.
Here are some additional notes on when to use COPY INTO vs Auto Loader
When to use COPY INTO
https://docs.databricks.com/delta/delta-ingest.html#copy-into-sql-command
When to use Auto Loader
https://docs.databricks.com/delta/delta-ingest.html#auto-loader

QUESTION 11

A data engineer needs to apply custom logic to string column city in table stores for a specific use
case. In
order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined
function (UDF).
Which of the following code blocks creates this SQL UDF?

A.

B.

C.

D.

E.

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
Correct Answer: E

QUESTION 12

You are tasked to setup a set notebook as a job for six departments and each department can run the
task parallelly, the notebook takes an input parameter dept number to process the data by
department how do you go about to setup this in job?
A. Use a single notebook as task in the job and use dbutils.notebook.run to run each notebook with
parameter in a different cell
B. A task in the job cannot take an input parameter, create six notebooks with hardcoded dept
number and setup six tasks with linear dependency in the job
C. A task accepts key-value pair parameters, creates six tasks pass department number as
parameter foreach task with no dependency in the job as they can all run in parallel.
D. A parameter can only be passed at the job level, create six jobs pass department number to each
job with linear job dependency
E. A parameter can only be passed at the job level, create six jobs pass department number to each
job with no job dependency

Correct Answer: C

Explanation/Reference:

Here is how you setup


Create a single job and six tasks with the same notebook and assign a different parameter for each
task ,

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
All tasks are added in a single job and can run parallel either using single shared cluster or with
individual clusters

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
QUESTION 13

When you drop a managed table using SQL syntax DROP TABLE table_name how does it impact
metadata, history, and data stored in the table?

A. Drops table from meta store, drops metadata, history, and data in storage.
B. Drops table from meta store and data from storage but keeps metadata and history in storage
C. Drops table from meta store, meta data and history but keeps the data in storage
D. Drops table but keeps meta data, history and data in storage
E. Drops table and history but keeps meta data and data in storage

Correct Answer: A

Explanation/Reference:

For a managed table, a drop command will drop everything from metastore and storage.

QUESTION 14

Which of the following techniques structured streaming uses to ensure recovery of failures during
stream processing?

A. Checkpointing and Watermarking


B. Write ahead logging and watermarking
C. Checkpointing and write-ahead logging
D. Delta time travel
E. The stream will failover to available nodes in the cluster
F. Checkpointing and Idempotent sinks

Correct Answer: C

Explanation/Reference:

The answer is Checkpointing and write-ahead logging.


Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
being processed during each trigger interval.

QUESTION 15

Which of the following tool provides Data Access control, Access Audit, Data Lineage, and Data
discovery?

A. DELTA LIVE Pipelines


B. Unity Catalog
C. Data Governance
D. DELTA lake
E. Lakehouse

Correct Answer: B

Explanation/Reference:

The answer is Unity Catalog

QUESTION 16

The Delta Live Tables Pipeline is configured to run in Development mode using the Triggered Pipeline
Mode. what is the expected outcome after clicking Start to update the pipeline?

A. All datasets will be updated once and the pipeline will shut down. The compute resources will be
terminated
B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources
will be deployed for the update and terminated when the pipeline is stopped
C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources
will persist after the pipeline is stopped to allow for additional development and testing
D. All datasets will be updated once and the pipeline will shut down. The compute resources will
persist to allow for additional development and testing
E. All datasets will be updated continuously and the pipeline will not shut down. The compute
resources will persist with the pipeline

Correct Answer: D

Explanation/Reference:

The answer is All datasets will be updated once and the pipeline will shut down. The compute
resources will persist to allow for additional testing.
DLT pipeline supports two modes Development and Production, you can switch between the two

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
based on the stage of your development and deployment lifecycle.
Development and production modes
When you run your pipeline in development mode, the Delta Live Tables system:
Reuses a cluster to avoid the overhead of restarts.
Disables pipeline retries so you can immediately detect and fix errors.
In production mode, the Delta Live Tables system:
Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
Retries execution in the event of specific errors, for example, a failure to start a cluster.
Use the

buttons in the Pipelines UI to switch between development and production modes. By default,
pipelines run in development mode.
Switching between development and production modes only controls cluster and pipeline execution
behavior. Storage locations must be configured as part of pipeline settings and are not affected when
switching between modes.
Please review additional DLT concepts using below link
https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-concepts.html#delta-
live-tables-concepts

QUESTION 17

Kevin is the owner of both the sales table and regional_sales_vw view which uses the sales table as
underlying source for the data, and Kevin is looking to grant select privilege on the view
regional_sales_vw to one of newly joined team members Steven. Which of the following is a true
statement?

A. Kevin can not grant access to Steven since he does not have security admin privilege
B. Kevin although is the owner but does not have ALL PRIVILEGES permission
C. Kevin can grant access to the view, because he is the owner of the view and the underlying table
D. Kevin can not grant access to Steven since he does have workspace admin privilege
E. Steve will also require SELECT access on the underlying table

Correct Answer: C

Explanation/Reference:

The answer is, Kevin can grant access to the view, because he is the owner of the view and the
underlying table,
Ownership determines whether or not you can grant privileges on derived objects to other users, a
user who creates a schema, table, view, or function becomes its owner. The owner is granted all
privileges and can grant privileges to other users

QUESTION 18

Which of the following SQL statements can be used to update a transactions table, to set a flag on the
table from Y to N
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
A. MODIFY transactions SET active_flag = 'N' WHERE active_flag = 'Y'
B. MERGE transactions SET active_flag = 'N' WHERE active_flag = 'Y'
C. UPDATE transactions SET active_flag = 'N' WHERE active_flag = 'Y'
D. REPLACE transactions SET active_flag = 'N' WHERE active_flag = 'Y'

Correct Answer: C

Explanation/Reference:

The answer is
UPDATE transactions SET active_flag = 'N' WHERE active_flag = 'Y'
Delta Lake supports UPDATE statements on the delta table, all of the changes as part of the update
are ACID compliant.

QUESTION 19

You are currently working on a production job failure with a job set up in job clusters due to a data
issue, what cluster do you need to start to investigate and analyze the data?

A. A Job cluster can be used to analyze the problem


B. All-purpose cluster/ interactive cluster is the recommended way to run commands and view the
data.
C. Existing job cluster can be used to investigate the issue
D. Databricks SQL Endpoint can be used to investigate the issue

Correct Answer: B

Explanation/Reference:

Answer is All-purpose cluster/ interactive cluster is the recommended way to run commands and view
the data.
A job cluster can not provide a way for a user to interact with a notebook once the job is submitted,
but an Interactive cluster allows to you display data, view visualizations write or edit quries, which
makes it a perfect fit to investigate and analyze the data.

QUESTION 20

Which of the following benefits of using the Databricks Lakehouse Platform is provided by Delta
Lake?
A. The ability to manipulate the same data using a variety of languages
B. The ability to collaborate in real time on a single notebook
C. The ability to set up alerts for query failures

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
D. The ability to support batch and streaming workloads
E. The ability to distribute complex data operations

Correct Answer: D

QUESTION 21

You are looking to process the data based on two variables, one to check if the department is supply
chain or check if process flag is set to True

A. if department = “supply chain” | process:


B. if department == “supply chain” or process = TRUE:
C. if department == “supply chain” | process == TRUE:
D. if department == “supply chain” | if process == TRUE:
E. if department == “supply chain” or process:

Correct Answer: E

QUESTION 22

The research team has put together a funnel analysis query to monitor the customer traffic on the e-
commerce platform, the query takes about 30 mins to run on a small SQL endpoint cluster with max
scaling set to 1 cluster.
A. They can turn on the Serverless feature for the SQL endpoint.
B. They can increase the maximum bound of the SQL endpoint’s scaling range anywhere from
between 1 to 100 to review the performance and select the size that meets the required SLA.
C. They can increase the cluster size anywhere from X small to 3XL to review the performance and
select the size that meets the required SLA.
D. They can turn off the Auto Stop feature for the SQL endpoint to more than 30 mins.
E. They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy
from “Cost optimized” to “Reliability Optimized.”

Correct Answer: C

Explanation/Reference:

The answer is, They can increase the cluster size anywhere from small to 3XL to review the
performance and select the size that meets your SLA.

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
SQL endpoint scales horizontally(scale-out) and vertical (scale-up), you have to understand when to
use what.
Scale-out -> to add more clusters for a SQL endpoint, change max number of clusters
If you are trying to improve the throughput, being able to run as many queries as possible then
having an additional cluster(s) will improve the performance.
Scale-up-> Increase the size of the SQL endpoint, change cluster size from x-small to small, to
medium, X Large....
If you are trying to improve the performance of a single query having additional memory, additional
nodes and cpu in the cluster will improve the performance.

QUESTION 23

You were asked to create a unique item’s list that were added to the cart by user, fill in blanks by
choosing the appropriate function
Schema: cartId INT, items Array< INT >

SELECT cartId, _(_(items)) FROM carts

A. ARRAY_UNION, ARRAY_DISCINT
B. ARRAY_DISTINCT, ARRAY_UNION
C. ARRAY_DISTINCT, FLATTEN
D. FLATTEN, ARRAY_DISTINCT
E. ARRAY_DISTINCT, ARRAY_FLATTEN

Correct Answer: C

Explanation/Reference:

FLATTEN -> Transforms an array of arrays into a single array.


ARRAY_DISTINCT -> The function returns an array of the same type as the input argument where all
duplicate values have been removed.

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
QUESTION 24

you are working to set up two notebooks to run on a schedule second notebook is dependent on first
notebook, both notebooks need different types of compute to run in an optimal fashion? What is the
best way to setup these notebooks as jobs?

A. Use DELTA LIVE PIPELINES instead of notebook tasks


B. A Job can only use single cluster, setup job for each notebook and use job dependency to link both
jobs together
C. Each task can use different cluster, add these two notebooks as two tasks in a single job with
linear dependency and modify the cluster as needed for each of the tasks
D. Use a single job to setup both notebooks as individual tasks, but use the cluster API to setup the
second cluster before the start of second task
E. Use a very large cluster to run both the tasks in a single job

Correct Answer: C

Explanation/Reference:

Tasks in Jobs support different clusters for each task in the same job.

QUESTION 25

Which of the following commands can be used to write data into a Delta table while avoiding the
writing of duplicate records?
A. DROP
B. IGNORE
C. MERGE
D. APPEND
E. INSERT

Correct Answer: C

https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps

You might also like