0% found this document useful (0 votes)
310 views13 pages

Databricks Certified Data Engineer Professional Practice Questions

This document provides a set of practice questions for the Databricks Certified Data Engineer Professional exam, designed to reflect the actual exam's structure and topics. It includes topic-focused questions, accurate answer keys, and is intended for personal study only. The document emphasizes the importance of understanding Delta Lake features and best practices in data engineering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
310 views13 pages

Databricks Certified Data Engineer Professional Practice Questions

This document provides a set of practice questions for the Databricks Certified Data Engineer Professional exam, designed to reflect the actual exam's structure and topics. It includes topic-focused questions, accurate answer keys, and is intended for personal study only. The document emphasizes the importance of understanding Delta Lake features and best practices in data engineering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This PDF contains a set of carefully selected practice questions for the

Databricks Certified Data Engineer Professional exam. These


questions are designed to reflect the structure, difficulty, and topics
covered in the actual exam, helping you reinforce your understanding
and identify areas for improvement.

What's Inside:

1. Topic-focused questions based on the latest exam objectives


2. Accurate answer keys to support self-review
3. Designed to simulate the real test environment
4. Ideal for final review or daily practice

Important Note:

This material is for personal study purposes only. Please do not


redistribute or use for commercial purposes without permission.

For full access to the complete question bank and topic-wise explanations, visit:
CertQuestionsBank.com

Our YouTube: https://www.youtube.com/@CertQuestionsBank

FB page: https://www.facebook.com/certquestionsbank
Share some Databricks Certified Data Engineer Professional exam
online questions below.
1.Which statement describes Delta Lake Auto Compaction?
A. An asynchronous job runs after the write completes to detect if files could be further compacted;
if yes, an optimize job is executed toward a default of 1 GB.
B. Before a Jobs cluster terminates, optimize is executed on all tables modified during the most
recent job.
C. Optimized writes use logical partitions instead of directory partitions; because partition boundaries
are only represented in metadata, fewer small files are written.
D. Data is queued in a messaging bus instead of committing data directly to memory; all data is
committed from the messaging bus in one batch once the job is complete.
E. An asynchronous job runs after the write completes to detect if files could be further compacted; if
yes, an optimize job is executed toward a default of 128 MB.
Answer: E
Explanation:
This is the correct answer because it describes the behavior of Delta Lake Auto Compaction, which is
a feature that automatically optimizes the layout of Delta Lake tables by coalescing small files into
larger ones. Auto Compaction runs as an asynchronous job after a write to a table has succeeded
and checks if files within a partition can be further compacted. If yes, it runs an optimize job with a
default target file size of 128 MB. Auto Compaction only compacts files that have not been compacted
previously.
Verified Reference: [Databricks Certified Data Engineer Professional], under “Delta Lake” section;
Databricks Documentation, under “Auto Compaction for Delta Lake on Databricks” section.
"Auto compaction occurs after a write to a table has succeeded and runs synchronously on the cluster
that has performed the write. Auto compaction only compacts files that haven’t been compacted
previously."
https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size

2.The security team is exploring whether or not the Databricks secrets module can be leveraged for
connecting to an external database.
After testing the code with all Python variables being defined with strings, they upload the password
to the secrets module and configure the correct permissions for the currently active user. They then
modify their code to the following (leaving all other variables unchanged).
Which statement describes what will happen when the above code is executed?
A. The connection to the external table will fail; the string "redacted" will be printed.
B. An interactive input box will appear in the notebook; if the right password is provided, the
connection will succeed and the encoded password will be saved to DBFS.
C. An interactive input box will appear in the notebook; if the right password is provided, the
connection will succeed and the password will be printed in plain text.
D. The connection to the external table will succeed; the string value of password will be printed in
plain text.
E. The connection to the external table will succeed; the string "redacted" will be printed.
Answer: E
Explanation:
This is the correct answer because the code is using the dbutils.secrets.get method to retrieve the
password from the secrets module and store it in a variable. The secrets module allows users to
securely store and access sensitive information such as passwords, tokens, or API keys. The
connection to the external table will succeed because the password variable will contain the actual
password value. However, when printing the password variable, the string “redacted” will be
displayed instead of the plain text password, as a security measure to prevent exposing sensitive
information in notebooks.
Verified Reference: [Databricks Certified Data Engineer Professional],
under “Security & Governance” section; Databricks Documentation, under “Secrets” section.

3.A new data engineer notices that a critical field was omitted from an application that writes its Kafka
source to Delta Lake. This happened even though the critical field was in the Kafka source. That field
was further missing from data written to dependent, long-term storage. The retention threshold on the
Kafka service is seven days. The pipeline has been in production for three months.
Which describes how Delta Lake can help to avoid data loss of this nature in the future?
A. The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.
B. Delta Lake schema evolution can retroactively calculate the correct value for newly added fields,
as long as the data was in the original source.
C. Delta Lake automatically checks that all fields present in the source data are included in the
ingestion layer.
D. Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible
under any circumstance.
E. Ingestine all raw data and metadata from Kafka to a bronze Delta table creates a permanent,
replayable history of the data state.
Answer: E
Explanation:
This is the correct answer because it describes how Delta Lake can help to avoid data loss of this
nature in the future. By ingesting all raw data and metadata from Kafka to a bronze Delta table, Delta
Lake creates a permanent, replayable history of the data state that can be used for recovery or
reprocessing in case of errors or omissions in downstream applications or pipelines. Delta Lake also
supports schema evolution, which allows adding new columns to existing tables without affecting
existing queries or pipelines. Therefore, if a critical field was omitted from an application that writes its
Kafka source to Delta Lake, it can be easily added later and the data can be reprocessed from the
bronze table without losing any information.
Verified Reference: [Databricks Certified Data Engineer Professional], under “Delta Lake” section;
Databricks Documentation, under “Delta Lake core features” section.

4.A data engineer needs to capture pipeline settings from an existing in the workspace, and use them
to create and version a JSON file to create a new pipeline.
Which command should the data engineer enter in a web terminal configured with the Databricks
CLI?
A. Use the get command to capture the settings for the existing pipeline; remove the pipeline_id and
rename the pipeline; use this in a create command
B. Stop the existing pipeline; use the returned settings in a reset command
C. Use the alone command to create a copy of an existing pipeline; use the get JSON command to
get the pipeline definition; save this to git
D. Use list pipelines to get the specs for all pipelines; get the pipeline spec from the return results
parse and use this to create a pipeline
Answer: A
Explanation:
The Databricks CLI provides a way to automate interactions with Databricks services. When dealing
with pipelines, you can use the databricks pipelines get --pipeline-id command to capture the settings
of an existing pipeline in JSON format. This JSON can then be modified by removing the pipeline_id
to prevent conflicts and renaming the pipeline to create a new pipeline. The modified JSON file can
then be used with the databricks pipelines create command to create a new pipeline with those
settings.
Reference: Databricks Documentation on CLI for Pipelines: Databricks CLI - Pipelines

5.The view updates represents an incremental batch of all newly ingested data to be inserted or
updated in the customers table.
The following logic is used to process these records.
MERGE INTO customers
USING (
SELECT updates.customer_id as merge_ey, updates .*
FROM updates
UNION ALL
SELECT NULL as merge_key, updates .*
FROM updates JOIN customers
ON updates.customer_id = customers.customer_id
WHERE customers.current = true AND updates.address <> customers.address ) staged_updates
ON customers.customer_id = mergekey
WHEN MATCHED AND customers. current = true AND customers.address <>
staged_updates.address
THEN
UPDATE SET current = false, end_date = staged_updates.effective_date
WHEN NOT MATCHED THEN
INSERT (customer_id, address, current, effective_date, end_date) VALUES
(staged_updates.customer_id, staged_updates.address, true, staged_updates.effective_date, null)
Which statement describes this implementation?
A. The customers table is implemented as a Type 2 table; old values are overwritten and new
customers are appended.
B. The customers table is implemented as a Type 1 table; old values are overwritten by new values
and no history is maintained.
C. The customers table is implemented as a Type 2 table; old values are maintained but marked as
no longer current and new values are inserted.
D. The customers table is implemented as a Type 0 table; all writes are append only with no changes
to existing values.
Answer: C
Explanation:
The provided MERGE statement is a classic implementation of a Type 2 SCD in a data warehousing
context. In this approach, historical data is preserved by keeping old records (marking them as not
current) and adding new records for changes. Specifically, when a match is found and there's a
change in the address, the existing record in the customers table is updated to mark it as no longer
current (current = false), and an end date is assigned (end_date = staged_updates.effective_date). A
new record for the customer is then inserted with the updated information, marked as current. This
method ensures that the full history of changes to customer information is maintained in the table,
allowing for time-based analysis of customer data.
Reference: Databricks documentation on implementing SCDs using Delta Lake and the MERGE
statement (https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-using-merge).

6.A table named user_ltv is being used to create a view that will be used by data analysts on various
teams. Users in the workspace are configured into groups, which are used for setting up data access
using ACLs.
The user_ltv table has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:
An analyst who is not a member of the marketing group executes the following query:
SELECT * FROM email_ltv
Which statement describes the results returned by this query?
A. Three columns will be returned, but one column will be named "redacted" and contain only null
values.
B. Only the email and itv columns will be returned; the email column will contain all null values.
C. The email and ltv columns will be returned with the values in user itv.
D. The email, age. and ltv columns will be returned with the values in user ltv.
E. Only the email and ltv columns will be returned; the email column will contain the string
"REDACTED" in each row.
Answer: E
Explanation:
The code creates a view called email_ltv that selects the email and ltv columns from a table called
user_ltv, which has the following schema: email STRING, age INT, ltv INT. The code also uses the
CASE WHEN expression to replace the email values with the string “REDACTED” if the user is not a
member of the marketing group. The user who executes the query is not a member of the marketing
group, so they will only see the email and ltv columns, and the email column will contain the string
“REDACTED” in each row.
Verified Reference: [Databricks Certified Data Engineer Professional], under “Lakehouse” section;
Databricks Documentation, under “CASE expression” section.

7.A Delta Lake table was created with the below query:
Realizing that the original query had a typographical error, the below code was executed:
ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store
Which result will occur after running the second command?
A. The table reference in the metastore is updated and no data is changed.
B. The table name change is recorded in the Delta transaction log.
C. All related files and metadata are dropped and recreated in a single ACID transaction.
D. The table reference in the metastore is updated and all data files are moved.
E. A new Delta transaction log Is created for the renamed table.
Answer: A
Explanation:
The query uses the CREATE TABLE USING DELTA syntax to create a Delta Lake table from an
existing Parquet file stored in DBFS. The query also uses the LOCATION keyword to specify the path
to the Parquet file as /mnt/finance_eda_bucket/tx_sales.parquet. By using the LOCATION keyword,
the query creates an external table, which is a table that is stored outside of the default warehouse
directory and whose metadata is not managed by Databricks. An external table can be created from
an existing directory in a cloud storage system, such as DBFS or S3, that contains data files in a
supported format, such as Parquet or CSV.
The result that will occur after running the second command is that the table reference in the
metastore is updated and no data is changed. The metastore is a service that stores metadata about
tables, such as their schema, location, properties, and partitions. The metastore allows users to
access tables using SQL commands or Spark APIs without knowing their physical location or format.
When renaming an external table using the ALTER TABLE RENAME TO command, only the table
reference in the metastore is updated with the new name; no data files or directories are moved or
changed in the storage system. The table will still point to the same location and use the same format
as before. However, if renaming a managed table, which is a table whose metadata and data are both
managed by Databricks, both the table reference in the metastore and the data files in the default
warehouse directory are moved and renamed accordingly.
Verified Reference: [Databricks Certified Data Engineer Professional], under “Delta Lake” section;
Databricks Documentation, under “ALTER TABLE RENAME TO” section; Databricks Documentation,
under “Metastore” section; Databricks Documentation, under “Managed and external tables”
section.

8.A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate
new records that have not yet been processed to the next table in the pipeline.
Which code snippet completes this function definition?
A) def new_records():
B) return spark.readStream.table("bronze")
C) return spark.readStream.load("bronze")
D) return spark.read.option("readChangeFeed", "true").table ("bronze")
E)

A. Option A
B. Option B
C. Option C
D. Option D
E. Option E
Answer: E
Explanation:
https://docs.databricks.com/en/delta/delta-change-data-feed.html

9.A data architect has heard about lake's built-in versioning and time travel capabilities. For auditing
purposes they have a requirement to maintain a full of all valid street addresses as they appear in the
customers table.
The architect is interested in implementing a Type 1 table, overwriting existing records with new
values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the
project feels that a Type 2 table will provide better performance and scalability.
Which piece of information is critical to this decision?
A. Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning
solution.
B. Delta Lake time travel cannot be used to query previous versions of these tables because Type 1
changes modify data files in place.
C. Shallow clones can be combined with Type 1 tables to accelerate historic queries for long-term
versioning.
D. Data corruption can occur if a query fails in a partially completed state because Type 2 tables
requires Setting multiple fields in a single update.
Answer: A
Explanation:
Delta Lake's time travel feature allows users to access previous versions of a table, providing a
powerful tool for auditing and versioning. However, using time travel as a long-term versioning
solution for auditing purposes can be less optimal in terms of cost and performance, especially as the
volume of data and the number of versions grow. For maintaining a full history of valid street
addresses as they appear in a customers table, using a Type 2 table (where each update creates a
new record with versioning) might provide better scalability and performance by avoiding the
overhead associated with accessing older versions of a large table. While Type 1 tables, where
existing records are overwritten with new values, seem simpler and can leverage time travel for
auditing, the critical piece of information is that time travel might not scale well in cost or latency for
long-term versioning needs, making a Type 2 approach more viable for performance and scalability.
Reference: Databricks Documentation on Delta Lake's Time Travel: Delta Lake Time Travel
Databricks Blog on Managing Slowly Changing Dimensions in Delta Lake: Managing SCDs in Delta
Lake

10.The data engineer team is configuring environment for development testing, and production before
beginning migration on a new data pipeline. The team requires extensive testing on both the code
and data resulting from code execution, and the team want to develop and test against similar
production data as possible.
A junior data engineer suggests that production data can be mounted to the development testing
environments, allowing pre production code to execute against production data. Because all users
have Admin privileges in the development environment, the junior data engineer has offered to
configure permissions and mount this data for the team.
Which statement captures best practices for this situation?
A. Because access to production data will always be verified using passthrough credentials it is safe
to mount data to any Databricks development environment.
B. All developer, testing and production code and data should exist in a single unified workspace;
creating separate environments for testing and development further reduces risks.
C. In environments where interactive code will be executed, production data should only be
accessible with read permissions; creating isolated databases for each environment further reduces
risks.
D. Because delta Lake versions all data and supports time travel, it is not possible for user error or
malicious actors to permanently delete production data, as such it is generally safe to mount
production data anywhere.
Answer: C
Explanation:
The best practice in such scenarios is to ensure that production data is handled securely and with
proper access controls. By granting only read access to production data in development and testing
environments, it mitigates the risk of unintended data modification. Additionally, maintaining isolated
databases for different environments helps to avoid accidental impacts on production data and
systems.
Reference: Databricks best practices for securing data:
https://docs.databricks.com/security/index.html

11.In order to facilitate near real-time workloads, a data engineer is creating a helper function to
leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired
function will automatically detect the schema of the source directly, incrementally process JSON files
as they arrive in a source directory, and automatically evolve the schema of the table when new fields
are detected.
The function is displayed below with a blank:
Which response correctly fills in the blank to meet the specified requirements?
A. Option A
B. Option B
C. Option C
D. Option D
E. Option E
Answer: B
Explanation:
Option B correctly fills in the blank to meet the specified requirements. Option B uses the
“cloudFiles.schemaLocation” option, which is required for the schema detection and evolution
functionality of Databricks Auto Loader. Additionally, option B uses the “mergeSchema” option, which
is required for the schema evolution functionality of Databricks Auto Loader. Finally, option B uses the
“writeStream” method, which is required for the incremental processing of JSON files as they arrive
in a source directory. The other options are incorrect because they either omit the required options,
use the wrong method, or use the wrong format.
Reference: Configure schema inference and evolution in Auto Loader:
https://docs.databricks.com/en/ingestion/auto-loader/schema.html
Write streaming data: https://docs.databricks.com/spark/latest/structured-streaming/writing-streaming-
data.html

12.A Databricks SQL dashboard has been configured to monitor the total number of records present
in a collection of Delta Lake tables using the following query pattern:
SELECT COUNT (*) FROM table -
Which of the following describes how results are generated each time the dashboard is updated?
A. The total count of rows is calculated by scanning all data files
B. The total count of rows will be returned from cached results unless REFRESH is run
C. The total count of records is calculated from the Delta transaction logs
D. The total count of records is calculated from the parquet file metadata
E. The total count of records is calculated from the Hive metastore
Answer: C
Explanation:
https://delta.io/blog/2023-04-19-faster-aggregations-metadata/#:~:text=You can get the number,a
given Delta table version.

13.The Databricks workspace administrator has configured interactive clusters for each of the data
engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each
user should be able to execute workloads against their assigned clusters at any time of the day.
Assuming users have been added to a workspace but not granted any permissions, which of the
following describes the minimal permissions a user would need to start and attach to an already
configured cluster.
A. "Can Manage" privileges on the required cluster
B. Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the required
cluster
C. Cluster creation allowed. "Can Attach To" privileges on the required cluster
D. "Can Restart" privileges on the required cluster
E. Cluster creation allowed. "Can Restart" privileges on the required cluster
Answer: D
Explanation:
https://learn.microsoft.com/en-us/azure/databricks/security/auth-authz/access-control/cluster-acl
https://docs.databricks.com/en/security/auth-authz/access-control/cluster-acl.html

14.The following table consists of items found in user carts within an e-commerce website.

The following MERGE statement is used to update this table using an updates view, with schema
evaluation enabled on this table.

How would the following update be handled?


A. The update is moved to separate ''restored'' column because it is missing a column expected in the
target schema.
B. The new restored field is added to the target schema, and dynamically read as NULL for existing
unmatched records.
C. The update throws an error because changes to existing columns in the target schema are not
supported.
D. The new nested field is added to the target schema, and files underlying existing records are
updated to include NULL values for the new field.
Answer: D
Explanation:
With schema evolution enabled in Databricks Delta tables, when a new field is added to a record
through a MERGE operation, Databricks automatically modifies the table schema to include the new
field. In existing records where this new field is not present, Databricks will insert NULL values for that
field. This ensures that the schema remains consistent across all records in the table, with the new
field being present in every record, even if it is NULL for records that did not originally include it.
Reference: Databricks documentation on schema evolution in Delta Lake:
https://docs.databricks.com/delta/delta-batch.html#schema-evolution

15.A junior data engineer is migrating a workload from a relational database system to the Databricks
Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table
inserts to validate records on write.
Which consideration will impact the decisions made by the engineer while migrating this workload?
A. All Delta Lake transactions are ACID compliance against a single table, and Databricks does not
enforce foreign key constraints.
B. Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-
parallel writes.
C. Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake's
upsert functionality.
D. Committing to multiple tables simultaneously requires taking out multiple table locks and can lead
to a state of deadlock.
Answer: A
Explanation:
In Databricks and Delta Lake, transactions are indeed ACID-compliant, but this compliance is limited
to single table transactions. Delta Lake does not inherently enforce foreign key constraints, which are
a staple in relational database systems for maintaining referential integrity between tables. This
means that when migrating workloads from a relational database system to Databricks Lakehouse,
engineers need to reconsider how to maintain data integrity and relationships that were previously
enforced by foreign key constraints. Unlike traditional relational databases where foreign key
constraints help in maintaining the consistency across tables, in Databricks Lakehouse, the data
engineer has to manage data consistency and integrity at the application level or through careful
design of ETL processes.
Reference: Databricks Documentation on Delta Lake: Delta Lake Guide
Databricks Documentation on ACID Transactions in Delta Lake: ACID Transactions in Delta Lake

16.Which statement describes Delta Lake optimized writes?


A. A shuffle occurs prior to writing to try to group data together resulting in fewer files instead of each
executor writing multiple files based on directory partitions.
B. Optimized writes logical partitions instead of directory partitions partition boundaries are only
represented in metadata fewer small files are written.
C. An asynchronous job runs after the write completes to detect if files could be further compacted;
yes, an OPTIMIZE job is executed toward a default of 1 GB.
D. Before a job cluster terminates, OPTIMIZE is executed on all tables modified during the most
recent job.
Answer: A
Explanation:
Delta Lake optimized writes involve a shuffle operation before writing out data to the Delta table.
The shuffle operation groups data by partition keys, which can lead to a reduction in the number of
output files and potentially larger files, instead of multiple smaller files. This approach can significantly
reduce the total number of files in the table, improve read performance by reducing the metadata
overhead, and optimize the table storage layout, especially for workloads with many small files.
Reference: Databricks documentation on Delta Lake performance tuning:
https://docs.databricks.com/delta/optimizations/auto-optimize.html
17.Which of the following technologies can be used to identify key areas of text when parsing Spark
Driver log4j output?
A. Regex
B. Julia
C. pyspsark.ml.feature
D. Scala Datasets
E. C++
Answer: A
Explanation:
Regex, or regular expressions, are a powerful way of matching patterns in text. They can be used to
identify key areas of text when parsing Spark Driver log4j output, such as the log level, the
timestamp, the thread name, the class name, the method name, and the message. Regex can be
applied in various languages and frameworks, such as Scala, Python, Java, Spark SQL, and
Databricks
notebooks.
Reference:
https://docs.databricks.com/notebooks/notebooks-use.html#use-regular-expressions
https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html#using-regular-expressions-in-udfs
https://docs.databricks.com/spark/latest/sparkr/functions/regexp_extract.html
https://docs.databricks.com/spark/latest/sparkr/functions/regexp_replace.html

18.All records from an Apache Kafka producer are being ingested into a single Delta Lake table with
the following schema:
key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG
There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable
Information (PII). The company wishes to restrict access to PII. The company also wishes to only
retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII
information, it would like to retain these records indefinitely.
Which of the following solutions meets the requirements?
A. All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to
maintain a history of non-PII information.
B. Data should be partitioned by the registration field, allowing ACLs and delete statements to be set
for the PII directory.
C. Because the value field is stored as binary data, this information is not considered PII and no
special precautions should be taken.
D. Separate object storage containers should be specified based on the partition field, allowing
isolation at the storage level.
E. Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage
partition boundaries.
Answer: B
Explanation:
Partitioning the data by the topic field allows the company to apply different access control policies
and retention policies for different topics. For example, the company can use the Table Access
Control feature to grant or revoke permissions to the registration topic based on user roles or groups.
The company can also use the DELETE command to remove records from the registration topic that
are older than 14 days, while keeping the records from other topics indefinitely. Partitioning by the
topic field also improves the performance of queries that filter by the topic field, as they can skip
reading irrelevant partitions.
Reference: Table Access Control: https://docs.databricks.com/security/access-control/table-
acls/index.html
DELETE: https://docs.databricks.com/delta/delta-update.html#delete-from-a-table

19.Which configuration parameter directly affects the size of a spark-partition upon ingestion of data
into Spark?
A. spark.sql.files.maxPartitionBytes
B. spark.sql.autoBroadcastJoinThreshold
C. spark.sql.files.openCostInBytes
D. spark.sql.adaptive.coalescePartitions.minPartitionNum
E. spark.sql.adaptive.advisoryPartitionSizeInBytes
Answer: A
Explanation:
This is the correct answer because spark.sql.files.maxPartitionBytes is a configuration parameter that
directly affects the size of a spark-partition upon ingestion of data into Spark. This parameter
configures the maximum number of bytes to pack into a single partition when reading files from file-
based sources such as Parquet, JSON and ORC. The default value is 128 MB, which means each
partition will be roughly 128 MB in size, unless there are too many small files or only one large file.
Verified Reference: [Databricks Certified Data Engineer Professional], under “Spark Configuration”
section; Databricks Documentation, under “Available Properties - spark.sql.files.maxPartitionBytes”
section.

Get Databricks Certified Data Engineer


Professional exam dumps full version.

Powered by TCPDF (www.tcpdf.org)

You might also like