0% found this document useful (0 votes)
200 views20 pages

Data Bricks

The document outlines various questions and answers related to data engineering, specifically focusing on the use of Databricks, Delta Lake, and SQL commands. It addresses issues such as data quality, table management, and the functionality of data lakehouses. The questions cover practical scenarios and technical commands relevant to data engineers and analysts in their workflows.

Uploaded by

Parthiban r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
200 views20 pages

Data Bricks

The document outlines various questions and answers related to data engineering, specifically focusing on the use of Databricks, Delta Lake, and SQL commands. It addresses issues such as data quality, table management, and the functionality of data lakehouses. The questions cover practical scenarios and technical commands relevant to data engineers and analysts in their workflows.

Uploaded by

Parthiban r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

1.

A data organization leader is upset about the data analysis team’s reports
being different from the data engineering team’s reports. The leader believes
the siloed nature of their organization’s data engineering and data analysis
architectures is to blame.
Which of the following describes how a data lakehouse could alleviate this
issue?

 A. Both teams would autoscale their work as data size evolves

 B. Both teams would use the same source of truth for their work Most
Voted

 C. Both teams would reorganize to report to the same department

 D. Both teams would be able to collaborate on projects in real-time

 E. Both teams would respond more quickly to ad-hoc requests

2. Which of the following describes a scenario in which a data team will want
to utilize cluster pools?

 A. An automated report needs to be refreshed as quickly as


possible. Most Voted

 B. An automated report needs to be made reproducible.

 C. An automated report needs to be tested to identify errors.

 D. An automated report needs to be version-controlled across multiple


collaborators.

 E. An automated report needs to be runnable by all stakeholders.

3. Which of the following is hosted completely in the control plane of the


classic Databricks architecture?

 A. Worker node

 B. JDBC data source


 C. Databricks web application Most Voted

 D. Databricks Filesystem

 E. Driver node

4. Which of the following benefits of using the Databricks Lakehouse Platform


is provided by Delta Lake?

 A. The ability to manipulate the same data using a variety of languages


 B. The ability to collaborate in real time on a single notebook
 C. The ability to set up alerts for query failures
 D. The ability to support batch and streaming workloads
 E. The ability to distribute complex data operations

5. Which of the following describes the storage organization of a Delta table?

 A. Delta tables are stored in a single file that contains data, history,
metadata, and other attributes.

 B. Delta tables store their data in a single file and all metadata in a
collection of files in a separate location.

 C. Delta tables are stored in a collection of files that contain data,


history, metadata, and other attributes. Most Voted

 D. Delta tables are stored in a collection of files that contain only the
data stored within the table.

 E. Delta tables are stored in a single file that contains only the data
stored within the table.

6. Which of the following code blocks will remove the rows where the value
in column age is greater than 25 from the existing Delta table my_table and
save the updated table?

 A. SELECT * FROM my_table WHERE age > 25;


 B. UPDATE my_table WHERE age > 25;

 C. DELETE FROM my_table WHERE age > 25; Most Voted

 D. UPDATE my_table WHERE age <= 25;

 E. DELETE FROM my_table WHERE age <= 25;

7. A data engineer has realized that they made a mistake when making a
daily update to a table. They need to use Delta time travel to restore the
table to a version that is 3 days old. However, when the data engineer
attempts to time travel to the older version, they are unable to restore the
data because the data files have been deleted.
Which of the following explains why the data files are no longer present?

 A. The VACUUM command was run on the table Most Voted

 B. The TIME TRAVEL command was run on the table

 C. The DELETE HISTORY command was run on the table

 D. The OPTIMIZE command was nun on the table

 E. The HISTORY command was run on the table

8. Which of the following Git operations must be performed outside of


Databricks Repos?

 A. Commit

 B. Pull

 C. Push

 D. Clone

 E. Merge Most Voted


9. Which of the following data lakehouse features results in improved data
quality over a traditional data lake?

 A. A data lakehouse provides storage solutions for structured and


unstructured data.

 B. A data lakehouse supports ACID-compliant transactions. Most Voted

 C. A data lakehouse allows the use of SQL queries to examine data.

 D. A data lakehouse stores data in open formats.

 E. A data lakehouse enables machine learning and artificial Intelligence


workloads.

10. A data engineer has left the organization. The data team needs to
transfer ownership of the data engineer’s Delta tables to a new data
engineer. The new data engineer is the lead engineer on the data team.
Assuming the original data engineer no longer has access, which of the
following individuals must be the one to transfer ownership of the Delta
tables in Data Explorer?

 A. Databricks account representative

 B. This transfer is not possible

 C. Workspace administrator Most Voted

 D. New lead data engineer

 E. Original data engineer

11. A data engineer needs to determine whether to use the built-in


Databricks Notebooks versioning or version their project using Databricks
Repos.
Which of the following is an advantage of using Databricks Repos over the
Databricks Notebooks versioning?

 A. Databricks Repos automatically saves development progress

 B. Databricks Repos supports the use of multiple branches Most Voted


 C. Databricks Repos allows users to revert to previous versions of a
notebook

 D. Databricks Repos provides the ability to comment on specific


changes

 E. Databricks Repos is wholly housed within the Databricks Lakehouse


Platform

Hide Solution
12. A data analyst has created a Delta table sales that is used by the entire
data analysis team. They want help from the data engineering team to
implement a series of tests to ensure the data is clean. However, the data
engineering team uses Python for its tests rather than SQL.
Which of the following commands could the data engineering team use to
access sales in PySpark?

 A. SELECT * FROM sales

 B. There is no way to share data between PySpark and SQL.

 C. spark.sql("sales")D. spark.delta.table("sales")

 E. spark.table("sales") Most Vote

13. Which of the following commands will return the location of database
customer360?

 A. DESCRIBE LOCATION customer360;

 B. DROP DATABASE customer360;

 C. DESCRIBE DATABASE customer360; Most Voted

 D. ALTER DATABASE customer360 SET DBPROPERTIES ('location' =


'/user'};

 E. USE DATABASE customer360;

14. A data engineer wants to create a new table containing the names of
customers that live in France.
They have written the following command:
A senior data engineer mentions that it is organization policy to include a
table property indicating that the new table includes personally identifiable
information (PII).
Which of the following lines of code fills in the above blank to successfully
complete the task?

 A. There is no way to indicate whether a table contains PII.

 B. "COMMENT PII"

 C. TBLPROPERTIES PII

 D. COMMENT "Contains PII" Most Voted

 E. PII

15. Which of the following benefits is provided by the array functions from
Spark SQL?

 A. An ability to work with data in a variety of types at once

 B. An ability to work with data within certain partitions and windows

 C. An ability to work with time-related data in specified intervals

 D. An ability to work with complex, nested data ingested from JSON


files Most Voted

 E. An ability to work with an array of tables for procedural automation

16. Which of the following commands can be used to write data into a Delta
table while avoiding the writing of duplicate records?
 A. DROP

 B. IGNORE

 C. MERGE Most Voted

 D. APPEND

 E. INSERT

17. A data engineer needs to apply custom logic to string column city in
table stores for a specific use case. In order to apply this custom logic at
scale, the data engineer wants to create a SQL user-defined function (UDF).
Which of the following code blocks creates this SQL UDF?

 A. Most Voted

 B.

 C.
 D.

 E.

18. A data analyst has a series of queries in a SQL program. The data analyst
wants this program to run every day. They only want the final query in the
program to run on Sundays. They ask for help from the data engineering
team to complete this task.
Which of the following approaches could be used by the data engineering
team to complete this task?

 A. They could submit a feature request with Databricks to add this


functionality.

 B. They could wrap the queries using PySpark and use Python’s control
flow system to determine when to run the final query. Most Voted

 C. They could only run the entire program on Sundays.

 D. They could automatically restrict access to the source table in the


final query so that it is only accessible on Sundays.

 E. They could redesign the data model to separate the data used in the
final query into a new table.

19. A data engineer runs a statement every day to copy the previous day’s
sales into the table transactions. Each day’s sales are in their own file in the
location "/transactions/raw".
Today, the data engineer runs the following command to complete this task:
After running the command today, the data engineer notices that the
number of records in table transactions has not changed.
Which of the following describes why the statement might not have copied
any new records into the table?

 A. The format of the files to be copied were not included with the
FORMAT_OPTIONS keyword.

 B. The names of the files to be copied were not included with the FILES
keyword.

 C. The previous day’s file has already been copied into the table. Most
Voted

 D. The PARQUET file format does not support COPY INTO.

 E. The COPY INTO statement requires the table to be refreshed to view


the copied rows.

20. A data engineer needs to create a table in Databricks using data from
their organization’s existing SQLite database.
They run the following command:

Which of the following lines of code fills in the above blank to successfully
complete the task?

 A. org.apache.spark.sql.jdbc Most Voted

 B. autoloader
 C. DELTA

 D. sqlite

 E. org.apache.spark.sql.sqlite

21. A data engineering team has two tables. The first table
march_transactions is a collection of all retail transactions in the month of
March. The second table april_transactions is a collection of all retail
transactions in the month of April. There are no duplicate records between
the tables.
Which of the following commands should be run to create a new table
all_transactions that contains all records from march_transactions and
april_transactions without duplicate records?

 A. CREATE TABLE all_transactions AS


SELECT * FROM march_transactions
INNER JOIN SELECT * FROM april_transactions;

 B. CREATE TABLE all_transactions AS


SELECT * FROM march_transactions
UNION SELECT * FROM april_transactions; Most Voted

 C. CREATE TABLE all_transactions AS


SELECT * FROM march_transactions
OUTER JOIN SELECT * FROM april_transactions;

 D. CREATE TABLE all_transactions AS


SELECT * FROM march_transactions
INTERSECT SELECT * from april_transactions;

 E. CREATE TABLE all_transactions AS


SELECT * FROM march_transactions
MERGE SELECT * FROM april_transactions;

22. A data engineer only wants to execute the final block of a Python program
if the Python variable day_of_week is equal to 1 and the Python variable
review_period is True.
Which of the following control flow statements should the data engineer use
to begin this conditionally executed code block?

 A. if day_of_week = 1 and review_period:

 B. if day_of_week = 1 and review_period = "True":


 C. if day_of_week == 1 and review_period == "True":

 D. if day_of_week == 1 and review_period: Most Voted

 E. if day_of_week = 1 & review_period: = "True":

23. A data engineer is attempting to drop a Spark SQL table my_table. The
data engineer wants to delete all table metadata and data.
They run the following command:

DROP TABLE IF EXISTS my_table -


While the object no longer appears when they run SHOW TABLES, the data
files still exist.
Which of the following describes why the data files still exist and the
metadata files were deleted?

 A. The table’s data was larger than 10 GB

 B. The table’s data was smaller than 10 GB

 C. The table was external Most Voted

 D. The table did not have a location

 E. The table was managed

24. A data engineer wants to create a data entity from a couple of tables. The
data entity must be used by other data engineers in other sessions. It also
must be saved to a physical location.
Which of the following data entities should the data engineer create?

 A. Database

 B. Function

 C. View

 D. Temporary view

 E. Table Most Voted

25. A data engineer is maintaining a data pipeline. Upon data ingestion, the
data engineer notices that the source data is starting to have a lower level of
quality. The data engineer would like to automate the process of monitoring
the quality level.
Which of the following tools can the data engineer use to solve this problem?

 A. Unity Catalog

 B. Data Explorer

 C. Delta Lake

 D. Delta Live Tables Most Voted

 E. Auto Loader

26. A Delta Live Table pipeline includes two datasets defined using
STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table
sources using LIVE TABLE.
The table is configured to run in Production mode using the Continuous
Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid,
what is the expected outcome after clicking Start to update the pipeline?

 A. All datasets will be updated at set intervals until the pipeline is shut
down. The compute resources will persist to allow for additional
testing.

 B. All datasets will be updated once and the pipeline will persist
without any processing. The compute resources will persist but go
unused.

 C. All datasets will be updated at set intervals until the pipeline is shut
down. The compute resources will be deployed for the update and
terminated when the pipeline is stopped. Most Voted

 D. All datasets will be updated once and the pipeline will shut down.
The compute resources will be terminated.

 E. All datasets will be updated once and the pipeline will shut down.
The compute resources will persist to allow for additional testing.

27. In order for Structured Streaming to reliably track the exact progress of
the processing so that it can handle any kind of failure by restarting and/or
reprocessing, which of the following two approaches is used by Spark to
record the offset range of the data being processed in each trigger?
 A. Checkpointing and Write-ahead Logs Most Voted

 B. Structured Streaming cannot record the offset range of the data


being processed in each trigger.

 C. Replayable Sources and Idempotent Sinks

 D. Write-ahead Logs and Idempotent Sinks

 E. Checkpointing and Idempotent Sinks

28. Which of the following describes the relationship between Gold tables and
Silver tables?

 A. Gold tables are more likely to contain aggregations than Silver


tables. Most Voted

 B. Gold tables are more likely to contain valuable data than Silver
tables.

 C. Gold tables are more likely to contain a less refined view of data
than Silver tables.

 D. Gold tables are more likely to contain more data than Silver tables.

 E. Gold tables are more likely to contain truthful data than Silver
tables.

29. Which of the following describes the relationship between Bronze tables
and raw data?

 A. Bronze tables contain less data than raw data files.

 B. Bronze tables contain more truthful data than raw data.

 C. Bronze tables contain aggregates while raw data is unaggregated.

 D. Bronze tables contain a less refined view of data than raw data.

 E. Bronze tables contain raw data with a schema applied. Most Voted
30. Which of the following tools is used by Auto Loader process data
incrementally?

 A. Checkpointing

 B. Spark Structured Streaming Most Voted

 C. Data Explorer

 D. Unity Catalog

 E. Databricks SQL

31. A data engineer has configured a Structured Streaming job to read from a
table, manipulate the data, and then perform a streaming write into a new
table.
The cade block used by the data engineer is below:

If the data engineer only wants the query to execute a micro-batch to


process data every 5 seconds, which of the following lines of code should the
data engineer use to fill in the blank?

 A. trigger("5 seconds")

 B. trigger()

 C. trigger(once="5 seconds")

 D. trigger(processingTime="5 seconds") Most Voted

 E. trigger(continuous="5 seconds")

32. A dataset has been defined using Delta Live Tables and includes an
expectations clause:
CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON
VIOLATION DROP ROW
What is the expected behavior when a batch of data containing data that
violates these constraints is processed?

 A. Records that violate the expectation are dropped from the target
dataset and loaded into a quarantine table.

 B. Records that violate the expectation are added to the target dataset
and flagged as invalid in a field added to the target dataset.

 C. Records that violate the expectation are dropped from the target
dataset and recorded as invalid in the event log. Most Voted

 D. Records that violate the expectation are added to the target dataset
and recorded as invalid in the event log.

 E. Records that violate the expectation cause the job to fail.

33. Which of the following describes when to use the CREATE STREAMING
LIVE TABLE (formerly CREATE INCREMENTAL LIVE TABLE) syntax over the
CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT) tables
using SQL?

 A. CREATE STREAMING LIVE TABLE should be used when the


subsequent step in the DLT pipeline is static.

 B. CREATE STREAMING LIVE TABLE should be used when data needs to


be processed incrementally. Most Voted

 C. CREATE STREAMING LIVE TABLE is redundant for DLT and it does not
need to be used.

 D. CREATE STREAMING LIVE TABLE should be used when data needs to


be processed through complicated aggregations.

 E. CREATE STREAMING LIVE TABLE should be used when the previous


step in the DLT pipeline is static.

34. A data engineer is designing a data pipeline. The source system


generates files in a shared directory that is also used by other processes. As
a result, the files should be kept as is and will accumulate in the directory.
The data engineer needs to identify which files are new since the previous
run in the pipeline, and set up the pipeline to only ingest those new files with
each run.
Which of the following tools can the data engineer use to solve this problem?

 A. Unity Catalog

 B. Delta Lake

 C. Databricks SQL

 D. Data Explorer

 E. Auto Loader Most Voted

35. Which of the following Structured Streaming queries is performing a hop


from a Silver table to a Gold table?

 A.

 B.
 C.

 D.

 E.

Most Voted

36. A data engineer has three tables in a Delta Live Tables (DLT) pipeline.
They have configured the pipeline to drop invalid records at each table. They
notice that some data is being dropped due to quality concerns at some
point in the DLT pipeline. They would like to determine at which table in their
pipeline the data is being dropped.
Which of the following approaches can the data engineer take to identify the
table that is dropping the records?

 A. They can set up separate expectations for each table when


developing their DLT pipeline.
 B. They cannot determine which table is dropping the records.

 C. They can set up DLT to notify them via email when records are
dropped.

 D. They can navigate to the DLT pipeline page, click on each table, and
view the data quality statistics. Most Voted

 E. They can navigate to the DLT pipeline page, click on the “Error”
button, and review the present errors.

37. A data engineer has a single-task Job that runs each morning before they
begin working. After identifying an upstream data issue, they need to set up
another task to run a new notebook prior to the original task.
Which of the following approaches can the data engineer use to set up the
new task?

 A. They can clone the existing task in the existing Job and update it to
run the new notebook.

 B. They can create a new task in the existing Job and then add it as a
dependency of the original task. Most Voted

 C. They can create a new task in the existing Job and then add the
original task as a dependency of the new task.

 D. They can create a new job from scratch and add both tasks to run
concurrently.

 E. They can clone the existing task to a new Job and then edit it to run
the new notebook.

38. An engineering manager wants to monitor the performance of a recent


project using a Databricks SQL query. For the first week following the
project’s release, the manager wants the query results to be updated every
minute. However, the manager is concerned that the compute resources
used for the query will be left running and cost the organization a lot of
money beyond the first week of the project’s release.
Which of the following approaches can the engineering team use to ensure
the query does not cost the organization any money beyond the first week of
the project’s release?
 A. They can set a limit to the number of DBUs that are consumed by
the SQL Endpoint.

 B. They can set the query’s refresh schedule to end after a certain
number of refreshes.

 C. They cannot ensure the query does not cost the organization money
beyond the first week of the project’s release.

 D. They can set a limit to the number of individuals that are able to
manage the query’s refresh schedule.

 E. They can set the query’s refresh schedule to end on a certain date in
the query scheduler. Most Voted

39. A data analysis team has noticed that their Databricks SQL queries are
running too slowly when connected to their always-on SQL endpoint. They
claim that this issue is present when many members of the team are running
small queries simultaneously. They ask the data engineering team for help.
The data engineering team notices that each of the team’s queries uses the
same SQL endpoint.
Which of the following approaches can the data engineering team use to
improve the latency of the team’s queries?

 A. They can increase the cluster size of the SQL endpoint.

 B. They can increase the maximum bound of the SQL endpoint’s


scaling range. Most Voted

 C. They can turn on the Auto Stop feature for the SQL endpoint.

 D. They can turn on the Serverless feature for the SQL endpoint.

 E. They can turn on the Serverless feature for the SQL endpoint and
change the Spot Instance Policy to “Reliability Optimized.”

40. A data engineer wants to schedule their Databricks SQL dashboard to


refresh once per day, but they only want the associated SQL endpoint to be
running when it is necessary.
Which of the following approaches can the data engineer use to minimize the
total running time of the SQL endpoint used in the refresh schedule of their
dashboard?
 A. They can ensure the dashboard’s SQL endpoint matches each of the
queries’ SQL endpoints.

 B. They can set up the dashboard’s SQL endpoint to be serverless.

 C. They can turn on the Auto Stop feature for the SQL endpoint. Most
Voted

 D. They can reduce the cluster size of the SQL endpoint.

 E. They can ensure the dashboard’s SQL endpoint is not one of the
included query’s SQL endpoint.

41.

You might also like