0% found this document useful (0 votes)
42 views10 pages

eBAY QA 1

The document provides a series of questions and answers related to SQL commands for data engineering tasks, focusing on database creation, table management, and data extraction techniques. It emphasizes the importance of using correct SQL syntax and understanding the implications of different commands, such as creating managed tables and handling duplicates in Delta tables. Key takeaways include the necessity of specifying locations when creating databases, the advantages of Delta format for data updates, and the use of SELECT DISTINCT for removing duplicates.

Uploaded by

Arshad Arain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views10 pages

eBAY QA 1

The document provides a series of questions and answers related to SQL commands for data engineering tasks, focusing on database creation, table management, and data extraction techniques. It emphasizes the importance of using correct SQL syntax and understanding the implications of different commands, such as creating managed tables and handling duplicates in Delta tables. Key takeaways include the necessity of specifying locations when creating databases, the advantages of Delta format for data updates, and the use of SELECT DISTINCT for removing duplicates.

Uploaded by

Arshad Arain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Q12

A data engineer needs to create a database called customer360at the location

/customer/customer360. The data engineer is unsure if one of their colleagues has

already created the database.

Which of the following commands should the data engineer run to complete this

task?

A. CREATE DATABASE customer360 LOCATION '/customer/customer360';

B. CREATE DATABASE IF NOT EXISTS customer360;

C. CREATE DATABASE IF NOT EXISTS customer360 LOCATION'/customer/customer360';

D. CREATE DATABASE IF NOT EXISTS customer360 DELTA LOCATION

'/customer/customer360';

E. CREATE DATABASE customer360 DELTA LOCATION'/customer/customer360';

The correct answer is:

C. CREATE DATABASE IF NOT EXISTS customer360 LOCATION


'/customer/customer360';

Explanation:

• CREATE DATABASE IF NOT EXISTS ensures that the command won't fail if the
database has already been created by a colleague.

• The LOCATION '/customer/customer360' clause ensures that the database is


physically stored at the specified path in the file system.

• Option C is the only one that:

o Checks for existing database (to avoid error),

o Specifies the correct custom location.

Why the other options are incorrect:

• A: Will fail if the database already exists.

• B: Doesn’t specify the location.

• D and E: The DELTA keyword is not valid syntax in standard CREATE DATABASE
statements (you specify Delta format at the table level, not database level).
So, the best and correct answer is:
C. CREATE DATABASE IF NOT EXISTS customer360 LOCATION
'/customer/customer360';

Q13

A junior data engineer needs to create a Spark SQL table my_table for which Spark
manages both the data and the metadata. The metadata and data should also be stored
in the Databricks Filesystem (DBFS).

Which of the following commands should a senior data engineer share with the junior
data engineer to complete this task?

A. CREATE TABLE my_table (id STRING, value STRING) USING


org.apache.spark.sql.parquet OPTIONS (PATH "storage-path");

B. CREATE MANAGED TABLE my_table (id STRING, value STRING) USING


org.apache.spark.sql.parquet OPTIONS (PATH "storage-path");

C. CREATE MANAGED TABLE my_table (id STRING, value STRING);

D. CREATE TABLE my_table (id STRING, value STRING) USING DBFS;

E. CREATE TABLE my_table (id STRING, value STRING);

The correct answer is:

E. CREATE TABLE my_table (id STRING, value STRING);

Explanation:

This command creates a managed table by default in Spark SQL and Databricks.

In a managed table, Spark manages both the data and metadata, which is exactly
what's required.

Data and metadata are stored in DBFS by default when no external LOCATION or
OPTIONS(PATH) is provided.

The USING clause and OPTIONS(PATH) are typically used for external tables, where you
manage the storage location manually.

So, simply using CREATE TABLE without a LOCATION clause or OPTIONS(PATH) makes it
a managed table stored in DBFS by default on Databricks.

Q15

A data engineering team has created a series of tables using Parquet data stored
in an external system. The team is noticing that after appending new rows to the

data in the external system, their queries within Databricks are not returning the

new rows. They identify the caching of the previous data as the cause of this issue.

Which of the following approaches will ensure that the data returned by queries is

alwaysup-to-date?

A. The tables should be converted to the Delta format

B.The tables should be stored in a cloud-based external system

C. The tables should be refreshed in the writing cluster before the next query is run

D. The tables should be altered to include metadata to not cache

E. The tables should be updated before the next query is run

The correct answer is:

A. The tables should be converted to the Delta format

Explanation:

• The issue described — queries not returning updated data after new rows are
appended in an external system — is typically due to caching or the non-
transactional nature of Parquet files.

• Parquet tables are not automatically updated in Databricks when the


underlying files change, especially if the table was registered as a cached or
managed table.

Why Delta format solves this:

• Delta Lake provides ACID transactions, schema enforcement, and automatic


metadata management.

• When using Delta tables in Databricks, changes to the underlying data are
automatically reflected in queries, especially if you avoid explicitly caching the
tables.

• Delta format supports features like time travel, MERGE, and optimized upserts,
making it much more robust for frequent data updates.

Why the other options are incorrect:


• B: Storing data in the cloud (e.g., S3 or ADLS) doesn't inherently solve the
caching/refresh problem.

• C: Manually refreshing is not scalable or reliable.

• D: There is no metadata flag in table schema to "not cache"; caching behavior


must be managed explicitly using CACHE or UNCACHE.

• E: Updating the table manually still doesn't ensure up-to-date results unless
caching and metadata refresh are handled correctly.

Final Answer:

A. The tables should be converted to the Delta format

Q16

A table customerLocationsexists with the following schema:

id STRING,

date STRING, city STRING, country STRING

A senior data engineer wants to create a new table from this table using the

following command:

CREATE TABLE customersPerCountry ASSELECT country,

COUNT(*) AS customersFROM customerLocations

GROUP BY country;

A junior data engineer asks why the schema is not being declared for the new

table. Which of the following responses explains why declaring the schema is not

necessary?

A. CREATE TABLE AS SELECT statements adopt schema details from the sourcetable

and query.

B. CREATE TABLE AS SELECTstatements infer the schema by scanning the data.

C. CREATE TABLE AS SELECT statements result in tables where schemas areoptional.

D. CREATE TABLE AS SELECT statements assign all columns the type STRING.

E. CREATE TABLE AS SELECT statements result in tables that do not support

schemas.
Explanation:

A CREATE TABLE AS SELECT (CTAS) statement creates a new table based on the result
of a query.

The schema of the new table is automatically derived from the columns and their types
in the SELECT clause.

Therefore, manually declaring the schema is not necessary — it's inferred directly from
the source table's schema and the transformations/aggregations applied in the query.

Why not the others?

B is misleading — CTAS doesn't "scan the data" to infer schema like schema inference
on raw files; it uses the schema from the query result.

C is vague — schemas aren't optional; they're just inferred.

D is incorrect — CTAS does not assign all columns the STRING type.

E is false — CTAS-created tables absolutely do support schemas.

Q17

A data engineer is overwriting data in a table by deleting the table and recreating

the table. Another data engineer suggests that this is inefficient and the table

should simply be overwritten instead.

Which of the following reasons to overwrite the table instead of deleting and

recreating thetable is incorrect?

A. Overwriting a table is efficient because no files need to be deleted.

B. Overwriting a table results in a clean table history for logging and audit purposes.
C. Overwriting a table maintains the old version of the table for Time Travel.

D. Overwriting a table is an atomic operation and will not leave the table in an

unfinished state.

E. Overwriting a table allows for concurrent queries to be completed while in

progress.

B. Overwriting a table results in a clean table history for logging and audit
purposes.

Explanation:

This statement is incorrect because overwriting a table does not "clean" the table
history — in fact, Delta Lake preserves the table history, including overwrite
operations, for logging, audit, and Time Travel purposes. That means:

• The overwrite is recorded in the transaction log, and

• You can still access previous versions of the table (unless they've been
vacuumed or the retention period has passed).

Why the other options are correct:

• A. True: Overwriting a table writes new files and marks old ones as deleted —
it’s more efficient than deleting and recreating the entire table structure.

• C. True: Delta Lake maintains the old version of the table, enabling Time
Travel.

• D. True: Overwriting is an atomic operation in Delta Lake; either the new


data fully replaces the old data, or the operation fails cleanly.

• E. True: Delta Lake supports concurrent reads; queries in progress during an


overwrite read from the old snapshot.

Final Answer:

B. Overwriting a table results in a clean table history for logging and audit
purposes.

Q18
Which

of the following commands will from an existing Delta table

my_table where duplicates have been removed?

records

Exar

A. DROP DUPLICATES FROM my_table;

table:

B. SELECT * FROM my_table WHERE duplicate = False;

C. SELECT DISTINCT * FROM my_table;

D. MERGE INTO my_table a USING new_records b ON a.id = b.id WHENNOT MATCHED

THEN INSERT *;

E. MERGE INTO my_table a USING new_records b;

The correct answer is:

C. SELECT DISTINCT * FROM my_table;

Explanation:

• If you want to form a set of records from an existing Delta table with
duplicates removed, the simplest and correct way is to use:

sql

CopyEdit

SELECT DISTINCT * FROM my_table;

• SELECT DISTINCT removes duplicate rows based on all columns, which aligns
with the question.

Why the other options are incorrect:

• A. DROP DUPLICATES FROM my_table;


Invalid SQL syntax — there's no such SQL command as DROP DUPLICATES.
• B. SELECT * FROM my_table WHERE duplicate = False;
Assumes a duplicate column exists and tracks duplicate status, which is not
standard or guaranteed.

• *D. MERGE INTO my_table a USING new_records b ON a.id = b.id WHEN NOT
MATCHED THEN INSERT ;
This is for merging new data into an existing table, not for removing
duplicates from existing data.

• E. MERGE INTO my_table a USING new_records b;


Incomplete syntax and still not related to de-duplication of existing table
records.

Final Answer:

C. SELECT DISTINCT * FROM my_table;

QUESTION 21

A data engineer has ingested a JSON file into a table named "raw_table" with a schema
containing transaction_id (STRING) and payload (ARRAY). The data engineer wants to
efficiently extract the date of each transaction into a new table with the schema
transaction_id (STRING) and date (TIMESTAMP). Which of the following commands
should the data engineer run to complete this task?

SELECT transaction_id, explode(payload) FROM raw_table; SELECT transaction_id,


payload.date FROM raw_table; SELECT transaction_id, date FROM raw_table; SELECT
transaction_id, payload[date] FROM raw_table; SELECT transaction_id, date from
payload FROM raw_table;

Given:

• Table: raw_table

• Schema:

o transaction_id (STRING)

o payload (ARRAY of STRUCTs), where each struct includes:

▪ customer_id (STRING)

▪ date (TIMESTAMP)

▪ store_id (STRING)
Goal:

Create a table with this schema:

• transaction_id (STRING)

• date (TIMESTAMP)

That means we want:

• To extract the date from each struct inside the payload array.

• To flatten the array (payload) so each struct becomes its own row.

Correct SQL Concept:

You need to explode or unnest the payload array, then select transaction_id and the
date field from each struct.

In Spark SQL or Hive:

sql

CopyEdit

SELECT transaction_id, p.date

FROM raw_table

LATERAL VIEW explode(payload) AS p;

In BigQuery:

sql

CopyEdit

SELECT transaction_id, p.date

FROM raw_table, UNNEST(payload) AS p;

Now evaluate the options:

1. SELECT transaction_id, explode(payload) FROM raw_table;

o Explodes the array, returning each struct.

o But doesn't select the date field specifically.


o Close, but incomplete — you'd still need to access payload.date.

o Closest correct among the options given.

2. SELECT transaction_id, payload.date FROM raw_table;

o Invalid. payload is an array — you can't directly use .date on an array.

3. SELECT transaction_id, date FROM raw_table;

o Invalid. date is not a top-level column.

4. SELECT transaction_id, payload[date] FROM raw_table;

o Invalid syntax. Array elements can’t be accessed like this.

5. SELECT transaction_id, date from payload FROM raw_table;

o Invalid SQL syntax.

Correct Answer:

SELECT transaction_id, explode(payload) FROM raw_table;

Even though it's not fully extracting date, it's the only valid syntax among the options
that starts the correct process.

You might also like