Databricks Certified Associate Data Engineer
Databricks Certified Associate Data Engineer
Databricks-Certified-Associate-Data-
Engineer
Databricks Certified Data Engineer Associate
QUESTION & ANSWERS
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
QUESTION 1
You were asked to create a table that can store the below data, note that orderDate is the truncated
date format of OrderTime, fill in the blank to complete the DDL.
Correct Answer: B
Explanation/Reference:
QUESTION 2
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
Correct Answer: E
Explanation/Reference:
QUESTION 3
Which of the following two options are supported in identifying the arrival of new files, and
incremental data from Cloud object storage using Auto Loader?
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
C. Writing ahead logging, read head logging
D. File hashing, Dynamic file lookup
E. Checkpointing and Write ahead logging
Correct Answer: A
Explanation/Reference:
QUESTION 4
You are noticing job cluster is taking 6 to 8 mins to start which is delaying your job to finish on time,
what steps you can take to reduce the amount of time cluster startup time
A. Setup a second job ahead of first job to start the cluster, so the cluster is ready with resources
when the job starts
B. Use All purpose cluster instead to reduce cluster start up time
C. Reduce the size of the cluster, smaller the cluster size shorter it takes to start the cluster
D. Use cluster pools to reduce the startup time of the jobs
E. Use SQL endpoints to reduce the startup time
Correct Answer: D
Explanation/Reference:
The answer is, Use cluster pools to reduce the startup time of the jobs.
Cluster pools allow us to reserve VM's ahead of time, when a new job cluster is created VM are
grabbed from the pool. Note: when the VM's are waiting to be used by the cluster only cost incurred is
Azure. Databricks run time cost is only billed once VM is allocated to a cluster.
Here is a demo of how to setup and follow some best practices,
https://www.youtube.com/watch?v=FVtITxOabxg&ab_channel=DatabricksAcademy
QUESTION 5
You are working on a marketing team request to identify customers with same information between
two tables CUSTOMERS_2021 and CUSTOMERS_2020 each table contains 25 columns with same
schema, You are looking to identify rows match between two tables across all columns, which of the
following can be used to perform in SQL
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
A. SELECT * FROM CUSTOMERS_2021 UNIONSELECT * FROM CUSTOMERS_2020
B. SELECT * FROM CUSTOMERS_2021 UNION ALLSELECT * FROM CUSTOMERS_2020
C. SELECT * FROM CUSTOMERS_2021 C1INNER JOIN CUSTOMERS_2020 C2ON C1.CUSTOMER_ID =
C2.CUSTOMER_ID
D. SELECT * FROM CUSTOMERS_2021 INTERSECTSELECT * FROM CUSTOMERS_2020
E. SELECT * FROM CUSTOMERS_2021 EXCEPTSELECT * FROM CUSTOMERS_2020
Correct Answer: D
Explanation/Reference:
Answer is
SELECT * FROM CUSTOMERS_2021
INTERSECT
SELECT * FROM CUSTOMERS_202
INTERSECT [ALL | DISTINCT]
Returns the set of rows which are in both subqueries.
If ALL is specified a row that appears multiple times in the subquery1 as well as in subquery will be
returned multiple times.
If DISTINCT is specified the result does not contain duplicate rows. This is the default.
QUESTION 6
Kevin is the owner of the schema sales, Steve wanted to create new table in sales schema called
regional_sales so Kevin grants the create table permissions to Steve. Steve creates the new table
called regional_sales in sales schema, who is the owner of the table regional_sales
A. Kevin is the owner of sales schema, all the tables in the schema will be owned by Kevin
B. Steve is the owner of the table
C. By default ownership is assigned DBO
D. By default ownership is assigned to DEFAULT_OWNER
E. Kevin and Smith both are owners of table
Correct Answer: B
Explanation/Reference:
A user who creates the object becomes its owner, does not matter who is the owner of the parent
object.
QUESTION 7
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
A. DROP DELTA table_name
B. DROP TABLE table_name
C. DROP TABLE table_name FORMAT DELTA
D. DROP table_name
Correct Answer: B
QUESTION 8
Which of the following describes how Databricks Repos can help facilitate CI/CD workflows on the
Databricks Lakehouse Platform?
A. Databricks Repos can facilitate the pull request, review, and approval process before merging
branches
B. Databricks Repos can merge changes from a secondary Git branch into a main Git branch
C. Databricks Repos can be used to design, develop, and trigger Git automation pipelines
D. Databricks Repos can store the single-source-of-truth Git repository
E. Databricks Repos can commit or push code changes to trigger a CI/CD process
Correct Answer: E
Explanation/Reference:
Answer is Databricks Repos can commit or push code changes to trigger a CI/CD process
See below diagram to understand the role Databricks Repos and Git provider plays when building a
CI/CD workdlow.
All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in Gray are
done in a git provider like Github or Azure Devops.
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
QUESTION 9
Data science team members are using a single cluster to perform data analysis, although cluster size
was chosen to handle multiple users and auto-scaling was enabled, the team realized queries are still
running slow, what would be the suggested fix for this?
A. Setup multiple clusters so each team member has their own cluster
B. Disable the auto-scaling feature
C. Use High concurrency mode instead of the standard mode
D. Increase the size of the driver node
Correct Answer: C
Explanation/Reference:
The answer is Use High concurrency mode instead of the standard mode,
https://docs.databricks.com/clusters/cluster-config-best-practices.html#cluster-mode
High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc
jobs. Administrators usually create High Concurrency clusters. Databricks recommends enabling
autoscaling for High Concurrency clusters.
QUESTION 10
What is the main difference between AUTO LOADER and COPY INTO?
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
A. COPY INTO supports schema evolution.
B. AUTO LOADER supports schema evolution.
C. COPY INTO supports file notification when performing incremental loads.
D. AUTO LOADER supports directory listing when performing incremental loads.
E. AUTO LOADER Supports file notification when performing incremental loads.
Correct Answer: E
Explanation/Reference:
Auto loader supports both directory listing and file notification but COPY INTO only supports directory
listing.
Auto loader file notification will automatically set up a notification service and queue service that
subscribe to file events from the input directory in cloud object storage like Azure blob storage or S3.
File notification mode is more performant and scalable for large input directories or a high volume of
files.
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
When to use Auto Loader instead of the COPY INTO?
You want to load data from a file location that contains files in the order of millions or higher. Auto
Loader can discover files more efficiently than the COPY INTO SQL command and can split file
processing into multiple batches.
You do not plan to load subsets of previously uploaded files. With Auto Loader, it can be more difficult
to reprocess subsets of files. However, you can use the COPY INTO SQL command to reload subsets of
files while an Auto Loader stream is simultaneously running.
Auto loader file notification will automatically set up a notification service and queue service that
subscribe to file events from the input directory in cloud object storage like Azure blob storage or S3.
File notification mode is more performant and scalable for large input directories or a high volume of
files.
Here are some additional notes on when to use COPY INTO vs Auto Loader
When to use COPY INTO
https://docs.databricks.com/delta/delta-ingest.html#copy-into-sql-command
When to use Auto Loader
https://docs.databricks.com/delta/delta-ingest.html#auto-loader
QUESTION 11
A data engineer needs to apply custom logic to string column city in table stores for a specific use
case. In
order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined
function (UDF).
Which of the following code blocks creates this SQL UDF?
A.
B.
C.
D.
E.
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
Correct Answer: E
QUESTION 12
You are tasked to setup a set notebook as a job for six departments and each department can run the
task parallelly, the notebook takes an input parameter dept number to process the data by
department how do you go about to setup this in job?
A. Use a single notebook as task in the job and use dbutils.notebook.run to run each notebook with
parameter in a different cell
B. A task in the job cannot take an input parameter, create six notebooks with hardcoded dept
number and setup six tasks with linear dependency in the job
C. A task accepts key-value pair parameters, creates six tasks pass department number as
parameter foreach task with no dependency in the job as they can all run in parallel.
D. A parameter can only be passed at the job level, create six jobs pass department number to each
job with linear job dependency
E. A parameter can only be passed at the job level, create six jobs pass department number to each
job with no job dependency
Correct Answer: C
Explanation/Reference:
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
All tasks are added in a single job and can run parallel either using single shared cluster or with
individual clusters
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
QUESTION 13
When you drop a managed table using SQL syntax DROP TABLE table_name how does it impact
metadata, history, and data stored in the table?
A. Drops table from meta store, drops metadata, history, and data in storage.
B. Drops table from meta store and data from storage but keeps metadata and history in storage
C. Drops table from meta store, meta data and history but keeps the data in storage
D. Drops table but keeps meta data, history and data in storage
E. Drops table and history but keeps meta data and data in storage
Correct Answer: A
Explanation/Reference:
For a managed table, a drop command will drop everything from metastore and storage.
QUESTION 14
Which of the following techniques structured streaming uses to ensure recovery of failures during
stream processing?
Correct Answer: C
Explanation/Reference:
QUESTION 15
Which of the following tool provides Data Access control, Access Audit, Data Lineage, and Data
discovery?
Correct Answer: B
Explanation/Reference:
QUESTION 16
The Delta Live Tables Pipeline is configured to run in Development mode using the Triggered Pipeline
Mode. what is the expected outcome after clicking Start to update the pipeline?
A. All datasets will be updated once and the pipeline will shut down. The compute resources will be
terminated
B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources
will be deployed for the update and terminated when the pipeline is stopped
C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources
will persist after the pipeline is stopped to allow for additional development and testing
D. All datasets will be updated once and the pipeline will shut down. The compute resources will
persist to allow for additional development and testing
E. All datasets will be updated continuously and the pipeline will not shut down. The compute
resources will persist with the pipeline
Correct Answer: D
Explanation/Reference:
The answer is All datasets will be updated once and the pipeline will shut down. The compute
resources will persist to allow for additional testing.
DLT pipeline supports two modes Development and Production, you can switch between the two
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
based on the stage of your development and deployment lifecycle.
Development and production modes
When you run your pipeline in development mode, the Delta Live Tables system:
Reuses a cluster to avoid the overhead of restarts.
Disables pipeline retries so you can immediately detect and fix errors.
In production mode, the Delta Live Tables system:
Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
Retries execution in the event of specific errors, for example, a failure to start a cluster.
Use the
buttons in the Pipelines UI to switch between development and production modes. By default,
pipelines run in development mode.
Switching between development and production modes only controls cluster and pipeline execution
behavior. Storage locations must be configured as part of pipeline settings and are not affected when
switching between modes.
Please review additional DLT concepts using below link
https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-concepts.html#delta-
live-tables-concepts
QUESTION 17
Kevin is the owner of both the sales table and regional_sales_vw view which uses the sales table as
underlying source for the data, and Kevin is looking to grant select privilege on the view
regional_sales_vw to one of newly joined team members Steven. Which of the following is a true
statement?
A. Kevin can not grant access to Steven since he does not have security admin privilege
B. Kevin although is the owner but does not have ALL PRIVILEGES permission
C. Kevin can grant access to the view, because he is the owner of the view and the underlying table
D. Kevin can not grant access to Steven since he does have workspace admin privilege
E. Steve will also require SELECT access on the underlying table
Correct Answer: C
Explanation/Reference:
The answer is, Kevin can grant access to the view, because he is the owner of the view and the
underlying table,
Ownership determines whether or not you can grant privileges on derived objects to other users, a
user who creates a schema, table, view, or function becomes its owner. The owner is granted all
privileges and can grant privileges to other users
QUESTION 18
Which of the following SQL statements can be used to update a transactions table, to set a flag on the
table from Y to N
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
A. MODIFY transactions SET active_flag = 'N' WHERE active_flag = 'Y'
B. MERGE transactions SET active_flag = 'N' WHERE active_flag = 'Y'
C. UPDATE transactions SET active_flag = 'N' WHERE active_flag = 'Y'
D. REPLACE transactions SET active_flag = 'N' WHERE active_flag = 'Y'
Correct Answer: C
Explanation/Reference:
The answer is
UPDATE transactions SET active_flag = 'N' WHERE active_flag = 'Y'
Delta Lake supports UPDATE statements on the delta table, all of the changes as part of the update
are ACID compliant.
QUESTION 19
You are currently working on a production job failure with a job set up in job clusters due to a data
issue, what cluster do you need to start to investigate and analyze the data?
Correct Answer: B
Explanation/Reference:
Answer is All-purpose cluster/ interactive cluster is the recommended way to run commands and view
the data.
A job cluster can not provide a way for a user to interact with a notebook once the job is submitted,
but an Interactive cluster allows to you display data, view visualizations write or edit quries, which
makes it a perfect fit to investigate and analyze the data.
QUESTION 20
Which of the following benefits of using the Databricks Lakehouse Platform is provided by Delta
Lake?
A. The ability to manipulate the same data using a variety of languages
B. The ability to collaborate in real time on a single notebook
C. The ability to set up alerts for query failures
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
D. The ability to support batch and streaming workloads
E. The ability to distribute complex data operations
Correct Answer: D
QUESTION 21
You are looking to process the data based on two variables, one to check if the department is supply
chain or check if process flag is set to True
Correct Answer: E
QUESTION 22
The research team has put together a funnel analysis query to monitor the customer traffic on the e-
commerce platform, the query takes about 30 mins to run on a small SQL endpoint cluster with max
scaling set to 1 cluster.
A. They can turn on the Serverless feature for the SQL endpoint.
B. They can increase the maximum bound of the SQL endpoint’s scaling range anywhere from
between 1 to 100 to review the performance and select the size that meets the required SLA.
C. They can increase the cluster size anywhere from X small to 3XL to review the performance and
select the size that meets the required SLA.
D. They can turn off the Auto Stop feature for the SQL endpoint to more than 30 mins.
E. They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy
from “Cost optimized” to “Reliability Optimized.”
Correct Answer: C
Explanation/Reference:
The answer is, They can increase the cluster size anywhere from small to 3XL to review the
performance and select the size that meets your SLA.
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
SQL endpoint scales horizontally(scale-out) and vertical (scale-up), you have to understand when to
use what.
Scale-out -> to add more clusters for a SQL endpoint, change max number of clusters
If you are trying to improve the throughput, being able to run as many queries as possible then
having an additional cluster(s) will improve the performance.
Scale-up-> Increase the size of the SQL endpoint, change cluster size from x-small to small, to
medium, X Large....
If you are trying to improve the performance of a single query having additional memory, additional
nodes and cpu in the cluster will improve the performance.
QUESTION 23
You were asked to create a unique item’s list that were added to the cart by user, fill in blanks by
choosing the appropriate function
Schema: cartId INT, items Array< INT >
A. ARRAY_UNION, ARRAY_DISCINT
B. ARRAY_DISTINCT, ARRAY_UNION
C. ARRAY_DISTINCT, FLATTEN
D. FLATTEN, ARRAY_DISTINCT
E. ARRAY_DISTINCT, ARRAY_FLATTEN
Correct Answer: C
Explanation/Reference:
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps
QUESTION 24
you are working to set up two notebooks to run on a schedule second notebook is dependent on first
notebook, both notebooks need different types of compute to run in an optimal fashion? What is the
best way to setup these notebooks as jobs?
Correct Answer: C
Explanation/Reference:
Tasks in Jobs support different clusters for each task in the same job.
QUESTION 25
Which of the following commands can be used to write data into a Delta table while avoiding the
writing of duplicate records?
A. DROP
B. IGNORE
C. MERGE
D. APPEND
E. INSERT
Correct Answer: C
https://www.dumpsbee.com/Databricks-Certified-Associate-Data-Engineer-pdf-dumps