dp-203 Dedb75bd432f

Certy IQ
Premium exam material

Get certification quickly with the CertyIQ Premium exam material.
Everything you need to prepare, learn & pass your certification exam easily. Lifetime free updates
First attempt guaranteed success.
https://www.CertyIQ.com
Microsoft
(DP-203)
Data Engineering on Microsoft Azure
Total: 317 Questions

Link: https://certyiq.com/papers?provider=microsoft&exam=dp-203
Question: 1 CertyIQ
You have a table in an Azure Synapse Analytics dedicated SQL pool. The table was created by using the following
Transact-SQL statement.
You need to alter the table to meet the following requirements:

✑ Ensure that users can identify the current manager of employees.
✑ Support creating an employee reporting hierarchy for your entire company.
✑ Provide fast lookup of the managers' attributes such as name and job title.
Which column should you add to the table?
A. [ManagerEmployeeID] [smallint] NULL

B. [ManagerEmployeeKey] [smallint] NULL
C. [ManagerEmployeeKey] [int] NULL
D. [ManagerName] [varchar](200) NULL
Answer: C
Explanation:
We need an extra column to identify the Manager. Use the data type as the EmployeeKey column, an int
column.
Reference:
https://docs.microsoft.com/en-us/analysis-services/tabular-models/hierarchies-ssas-tabular
Question: 2 CertyIQ
You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named
mytestdb.
You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace.
CREATE TABLE mytestdb.myParquetTable(
EmployeeID int,
EmployeeName string,
EmployeeStartDate date)
USING Parquet -
You then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data.
One minute later, you execute the following query from a serverless SQL pool in MyWorkspace.
SELECT EmployeeID -
FROM mytestdb.dbo.myParquetTable
WHERE EmployeeName = 'Alice';
What will be returned by the query?
A. 24
B. an error
C. a null value
Answer: B
Explanation:
But not because of the lowercase. The case has nothing to do with the error.
If you look attentively, you will notice that we create table mytestdb.myParquetTable, but the select
statement contains the reference to table mytestdb.dbo.myParquetTable (!!! - dbo).
Here is the error message I got:
Error: spark_catalog requires a single-part namespace, but got [mytestdb, dbo].
Question: 3 CertyIQ
DRAG DROP -
You have a table named SalesFact in an enterprise data warehouse in Azure Synapse Analytics. SalesFact contains
sales data from the past 36 months and has the following characteristics:
✑ Is partitioned by month
✑ Contains one billion rows
✑ Has clustered columnstore index
At the beginning of each month, you need to remove data from SalesFact that is older than 36 months as quickly
as possible.
Which three actions should you perform in sequence in a stored procedure? To answer, move the appropriate
actions from the list of actions to the answer area and arrange them in the correct order.
Select and Place:
Answer:
Explanation:
Step 1: Create an empty table named SalesFact_work that has the same schema as SalesFact.
Step 2: Switch the partition containing the stale data from SalesFact to SalesFact_Work.
SQL Data Warehouse supports partition splitting, merging, and switching. To switch partitions between two
tables, you must ensure that the partitions align on their respective boundaries and that the table definitions
match.
Loading data into partitions with partition switching is a convenient way stage new data in a table that is not
visible to users the switch in the new data.
Step 3: Drop the SalesFact_Work table.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition
Question: 4 CertyIQ
You have files and folders in Azure Data Lake Storage Gen2 for an Azure Synapse workspace as shown in the
following exhibit.
You create an external table named ExtTable that has LOCATION='/topfolder/'.
When you query ExtTable by using an Azure Synapse Analytics serverless SQL pool, which files are returned?
A. File2.csv and File3.csv only

B. File1.csv and File4.csv only
C. File1.csv, File2.csv, File3.csv, and File4.csv
D. File1.csv only
Answer: B
Explanation:
In case of a serverless pool a wildcard should be added to the location.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?
tabs=hadoop#arguments-create-external-table
Question: 5 CertyIQ
HOTSPOT -
You are planning the deployment of Azure Data Lake Storage Gen2.
You have the following two reports that will access the data lake:
✑ Report1: Reads three columns from a file that contains 50 columns.
✑ Report2: Queries a single record based on a timestamp.
You need to recommend in which format to store the data in the data lake to support the reports. The solution must
minimize read times.
What should you recommend for each report? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Answer:
Explanation:
1: Parquet - column-oriented binary file format

2: AVRO - Row based format, and has logical type timestamp
https://youtu.be/UrWthx8T3UY
Question: 6 CertyIQ
You are designing the folder structure for an Azure Data Lake Storage Gen2 container.
Users will query data by using a variety of services including Azure Databricks and Azure Synapse Analytics
serverless SQL pools. The data will be secured by subject area. Most queries will include data from the current
year or current month.
Which folder structure should you recommend to support fast queries and simplified folder security?
A. / SubjectArea / DataSource / DD / MM / YYYY / FileData _ YYYY _ MM _ DD .csv

B. / DD / MM / YYYY / SubjectArea / DataSource / FileData _ YYYY _ MM _ DD .csv
C. / YYYY / MM / DD / SubjectArea / DataSource / FileData _ YYYY _ MM _ DD .csv
D. / SubjectArea / DataSource / YYYY / MM / DD / FileData _ YYYY _ MM _ DD .csv
Answer: D
Explanation:
There's an important reason to put the date at the end of the directory structure. If you want to lock down
certain regions or subject matters to users/groups, then you can easily do so with the POSIX permissions.
Otherwise, if there was a need to restrict a certain security group to viewing just the UK data or certain
planes, with the date structure in front a separate permission would be required for numerous directories
under every hour directory. Additionally, having the date structure in front would exponentially increase the
number of directories as time went on.
Note: In IoT workloads, there can be a great deal of data being landed in the data store that spans across
numerous products, devices, organizations, and customers. It's important to pre-plan the directory layout for
organization, security, and efficient processing of the data for down-stream consumers. A general template to
consider might be the following layout:
Region / SubjectMatter(s) / yyyy / mm / dd / hh /
Question: 7 CertyIQ
HOTSPOT -
You need to output files from Azure Data Factory.
Which file format should you use for each type of output? To answer, select the appropriate options in the answer
area.
Hot Area:
Answer:
Explanation:
Box 1: Parquet -
Parquet stores data in columns, while Avro stores data in a row-based format. By their very nature, column-
oriented data stores are optimized for read-heavy analytical workloads, while row-based databases are best
for write-heavy transactional workloads.
Box 2: Avro -
An Avro schema is created using JSON format.
AVRO supports timestamps.
Note: Azure Data Factory supports the following file formats (not GZip or TXT).
Avro format -
✑ Binary format
✑ Delimited text format
✑ Excel format
✑ JSON format
✑ ORC format
✑ Parquet format
✑ XML format
Reference:
https://www.datanami.com/2018/05/16/big-data-file-formats-demystified
Question: 8 CertyIQ
HOTSPOT -
You use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools.
Files are initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file contains
the same data attributes and data from a subsidiary of your company.
You need to move the files to a different folder and transform the data to meet the following requirements:
✑ Provide the fastest possible query times.
✑ Automatically infer the schema from the underlying files.
How should you configure the Data Factory copy activity? To answer, select the appropriate options in the answer
area.
Hot Area:
Answer:
Explanation:
1. Merge Files
2. Parquet
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance
Question: 9 CertyIQ
HOTSPOT -
You have a data model that you plan to implement in a data warehouse in Azure Synapse Analytics as shown in the
following exhibit.
All the dimension tables will be less than 2 GB after compression, and the fact table will be approximately 6 TB.
The dimension tables will be relatively static with very few data inserts and updates.
Which type of table should you use for each table? To answer, select the appropriate options in the answer area.
Hot Area:
Answer:
Explanation:
Box 1: Replicated -
Replicated tables are ideal for small star-schema dimension tables, because the fact table is often distributed
on a column that is not compatible with the connected dimension tables. If this case applies to your schema,
consider changing small dimension tables currently implemented as round-robin to replicated.
Box 2: Replicated -
Box 3: Replicated -
Box 4: Hash-distributed -
For Fact tables use hash-distribution with clustered columnstore index. Performance improves when two hash
tables are joined on the same distribution column.
Reference:
https://azure.microsoft.com/en-us/updates/reduce-data-movement-and-make-your-queries-more-efficient-wi
th-the-general-availability-of-replicated-tables/ https://azure.microsoft.com/en-us/blog/replicated-tables-no
w-generally-available-in-azure-sql-data-warehouse/
Question: 10 CertyIQ
HOTSPOT -
You have an Azure Data Lake Storage Gen2 container.
Data is ingested into the container, and then transformed by a data integration application. The data is NOT
modified after that. Users can read files in the container but cannot modify the files.
You need to design a data archiving solution that meets the following requirements:
✑ New data is accessed frequently and must be available as quickly as possible.
✑ Data that is older than five years is accessed infrequently but must be available within one second when
requested.
✑ Data that is older than seven years is NOT accessed. After seven years, the data must be persisted at the lowest
cost possible.
✑ Costs must be minimized while maintaining the required availability.
How should you manage the data? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point
Hot Area:
Answer:
Explanation:
Box 1: Move to cool storage -
Box 2: Move to archive storage -

Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible
latency requirements, on the order of hours.
The following table shows a comparison of premium performance block blob storage, and the hot, cool, and
archive access tiers.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers
DRAG DROP -
You need to create a partitioned table in an Azure Synapse Analytics dedicated SQL pool.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct
targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between
panes or scroll to view content.
Select and Place:
Answer:
Explanation:
Box 1: DISTRIBUTION -
Table distribution options include DISTRIBUTION = HASH ( distribution_column_name ), assigns each row to
one distribution by hashing the value stored in distribution_column_name.
Box 2: PARTITION -
Table partition options. Syntax:
PARTITION ( partition_column_name RANGE [ LEFT | RIGHT ] FOR VALUES ( [ boundary_value [,...n] ] ))
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse
?
You need to design an Azure Synapse Analytics dedicated SQL pool that meets the following requirements:
✑ Can return an employee record from a given point in time.
✑ Maintains the latest employee information.
✑ Minimizes query complexity.
How should you model the employee data?
A. as a temporal table
B. as a SQL graph table
C. as a degenerate dimension table
D. as a Type 2 slowly changing dimension (SCD) table
Answer: D
Explanation:
A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so
the data warehouse load process detects and manages changes in a dimension table. In this case, the
dimension table must use a surrogate key to provide a unique reference to a version of the dimension member.
It also includes columns that define the date range validity of the version (for example, StartDate and
EndDate) and possibly a flag column (for example,
IsCurrent) to easily filter by current dimension members.
Reference:
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analyt
ics-pipelines/3-choose-between-dimension-types
You have an enterprise-wide Azure Data Lake Storage Gen2 account. The data lake is accessible only through an
Azure virtual network named VNET1.
You are building a SQL pool in Azure Synapse that will use data from the data lake.
Your company has a sales team. All the members of the sales team are in an Azure Active Directory group named
Sales. POSIX controls are used to assign the
Sales group access to the files in the data lake.
You plan to load data to the SQL pool every hour.
You need to ensure that the SQL pool can load the sales data from the data lake.
Which three actions should you perform? Each correct answer presents part of the solution.
NOTE: Each area selection is worth one point.
A. Add the managed identity to the Sales group.

B. Use the managed identity as the credentials for the data load process.
C. Create a shared access signature (SAS).
D. Add your Azure Active Directory (Azure AD) account to the Sales group.
E. Use the shared access signature (SAS) as the credentials for the data load process.
F. Create a managed identity.
Answer: ABF
Explanation:
The managed identity grants permissions to the dedicated SQL pools in the workspace.
Note: Managed identity for Azure resources is a feature of Azure Active Directory. The feature provides Azure
services with an automatically managed identity in
Azure AD -
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-managed-identity
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool that contains the users shown in the following table.
User1 executes a query on the database, and the query returns the results shown in the following exhibit.
User1 is the only user who has access to the unmasked data.
Use the drop-down menus to select the answer choice that completes each statement based on the information
presented in the graphic.
Hot Area:
Answer:
Explanation:
Box 1: 0 -
The YearlyIncome column is of the money data type.
The Default masking function: Full masking according to the data types of the designated fields
✑ Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney,
tinyint, float, real).
Box 2: the values stored in the database
Users with administrator privileges are always excluded from masking, and see the original data without any
mask.
Reference:
https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview
You have an enterprise data warehouse in Azure Synapse Analytics.
Using PolyBase, you create an external table named [Ext].[Items] to query Parquet files stored in Azure Data Lake
Storage Gen2 without importing the data to the data warehouse.
The external table has three columns.
You discover that the Parquet files have a fourth column named ItemID.
Which command should you run to add the ItemID column to the external table?
A.
B.
C.
D.
Answer: C
Explanation:
Incorrect Answers:
A, D: Only these Data Definition Language (DDL) statements are allowed on external tables:
✑ CREATE TABLE and DROP TABLE
✑ CREATE STATISTICS and DROP STATISTICS
✑ CREATE VIEW and DROP VIEW
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql
HOTSPOT -
You have two Azure Storage accounts named Storage1 and Storage2. Each account holds one container and has
the hierarchical namespace enabled. The system has files that contain data stored in the Apache Parquet format.
You need to copy folders and files from Storage1 to Storage2 by using a Data Factory copy activity. The solution
must meet the following requirements:
✑ No transformations must be performed.
✑ The original folder structure must be retained.
✑ Minimize time required to perform the copy activity.
How should you configure the copy activity? To answer, select the appropriate options in the answer area.
Hot Area:
Answer:
Explanation:
Binary - copy files as is in fastest way.
PreserveHierarchy - for saving folder structure.

You have an Azure Data Lake Storage Gen2 container that contains 100 TB of data.
You need to ensure that the data in the container is available for read workloads in a secondary region if an outage
occurs in the primary region. The solution must minimize costs.
Which type of data redundancy should you use?
A. geo-redundant storage (GRS)

B. read-access geo-redundant storage (RA-GRS)
C. zone-redundant storage (ZRS)
D. locally-redundant storage (LRS)
Answer: A
Explanation:
1. While it is true that the customer/Microsoft has to initiate the failover, this is not elaborated in any sense in
the question. What is the point of GRS if you cannot read from it after a failover? It provides the service
needed, at the lowest cost.This would be different if there were keywords like "available immediately without
downtime" or "automatically" but there are none, so well, if a region fails, you fail over, and read from
secondary region. Bottom line: A. GRS
2. Answer is A GRS
You plan to implement an Azure Data Lake Gen 2 storage account.
You need to ensure that the data lake will remain available if a data center fails in the primary Azure region. The
solution must minimize costs.
Which type of replication should you use for the storage account?
A. geo-redundant storage (GRS)

B. geo-zone-redundant storage (GZRS)
C. locally-redundant storage (LRS)
D. zone-redundant storage (ZRS)
Answer: D
Explanation:
Zone-redundant storage (ZRS) copies your data synchronously across three Azure availability zones in the
primary region.
Incorrect Answers:
C: Locally redundant storage (LRS) copies your data synchronously three times within a single physical
location in the primary region. LRS is the least expensive replication option, but is not recommended for
applications requiring high availability or durability
Reference:
https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy
HOTSPOT -
You have a SQL pool in Azure Synapse.
You plan to load data from Azure Blob storage to a staging table. Approximately 1 million rows of data will be
loaded daily. The table will be truncated before each daily load.
You need to create the staging table. The solution must minimize how long it takes to load the data to the staging
table.
How should you configure the table? To answer, select the appropriate options in the answer area.
Hot Area:
Answer:
Explanation:
Round -robin
heap
none
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table
contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.
FactPurchase will have 1 million rows of data added daily and will contain three years of data.
Transact-SQL queries similar to the following query will be executed daily.
SELECT -
SupplierKey, StockItemKey, IsOrderFinalized, COUNT(*)
FROM FactPurchase -
WHERE DateKey >= 20210101 -
AND DateKey <= 20210131 -

GROUP By SupplierKey, StockItemKey, IsOrderFinalized
Which table distribution will minimize query times?
A. replicated
B. hash-distributed on PurchaseKey
C. round-robin
D. hash-distributed on IsOrderFinalized
Answer: B
Explanation:
Hash-distributed tables improve query performance on large fact tables.
To balance the parallel processing, select a distribution column that:
✑ Has many unique values. The column can have duplicate values. All rows with the same value are assigned
to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique values while
others may end with zero values.
✑ Does not have NULLs, or has only a few NULLs.
✑ Is not a date column.
Incorrect Answers:
C: Round-robin tables are useful for improving loading speed.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-
distribute
HOTSPOT -
From a website analytics system, you receive data extracts about user interactions such as downloads, link clicks,
form submissions, and video plays.
The data contains the following columns.
You need to design a star schema to support analytical queries of the data. The star schema will contain four
tables including a date dimension.
To which table should you add each column? To answer, select the appropriate options in the answer area.
Hot Area:
Answer:
Explanation:
Box 1: DimEvent -
Box 2: DimChannel -
Box 3: FactEvents -
Fact tables store observations or events, and can be sales orders, stock balances, exchange rates,
temperatures, etc
Reference:
https://docs.microsoft.com/en-us/power-bi/guidance/star-schema
Note: This question is part of a series of questions that present the same scenario. Each question in the series
contains a unique solution that might meet the stated goals. Some question sets might have more than one correct
solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not
appear in the review screen.
You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical
values. 75% of the rows contain description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
Solution: You convert the files to compressed delimited text files.
Does this meet the goal?
A. Yes
B. No
Answer: A
Explanation:
All file formats have different performance characteristics. For the fastest load, use compressed delimited
text files.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data
Solution: You copy the files to a table that has a columnstore index.
A. Yes
B. No
Answer: B
Explanation:
Instead convert the files to compressed delimited text files.
Reference:
Solution: You modify the files to ensure that each row is more than 1 MB.
A. Yes
B. No
Answer: B
Explanation:
Instead convert the files to compressed delimited text files.
Reference:
You build a data warehouse in an Azure Synapse Analytics dedicated SQL pool.
Analysts write a complex SELECT query that contains multiple JOIN and CASE statements to transform data for
use in inventory reports. The inventory reports will use the data and additional WHERE parameters depending on
the report. The reports will be produced once daily.
You need to implement a solution to make the dataset available for the reports. The solution must minimize query
times.
What should you implement?
A. an ordered clustered columnstore index

B. a materialized view
C. result set caching
D. a replicated table
Answer: B
Explanation:
Materialized views for dedicated SQL pools in Azure Synapse provide a low maintenance method for complex
analytical queries to get fast performance without any query change.
Incorrect Answers:
C: One daily execution does not make use of result cache caching.
Note: When result set caching is enabled, dedicated SQL pool automatically caches query results in the user
database for repetitive use. This allows subsequent query executions to get results directly from the
persisted cache so recomputation is not needed. Result set caching improves query performance and reduces
compute resource usage. In addition, queries using cached results set do not use any concurrency slots and
thus do not count against existing concurrency limits.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-materiali
zed-views https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tunin
g-result-set-caching
You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1.
You plan to create a database named DB1 in Pool1.
You need to ensure that when tables are created in DB1, the tables are available automatically as external tables
to the built-in serverless SQL pool.
Which format should you use for the tables in DB1?
A. CSV
B. ORC
C. JSON
D. Parquet
Answer: AD
Explanation:
Both A and D are correct
"For each Spark external table based on Parquet or CSV and located in Azure Storage, an external table is
created in a serverless SQL pool database. As such, you can shut down your Spark pools and still query Spark
external tables from serverless SQL pool."
You are planning a solution to aggregate streaming data that originates in Apache Kafka and is output to Azure
Data Lake Storage Gen2. The developers who will implement the stream processing solution use Java.
Which service should you recommend using to process the streaming data?
A. Azure Event Hubs

B. Azure Data Factory
C. Azure Stream Analytics
D. Azure Databricks
Answer: D
Explanation:
The following tables summarize the key differences in capabilities for stream processing technologies in
Azure.
General capabilities -
Integration capabilities -
Reference:
https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/stream-processing
You plan to implement an Azure Data Lake Storage Gen2 container that will contain CSV files. The size of the files
will vary based on the number of events that occur per hour.
File sizes range from 4 KB to 5 GB.
You need to ensure that the files stored in the container are optimized for batch processing.
What should you do?
A. Convert the files to JSON

B. Convert the files to Avro
C. Compress the files
D. Merge the files
Answer: D
Explanation:
1. If you store your data as many small files, this can negatively affect performance. In general, organize your
data into larger sized files for better performance (256 MB to 100 GB in size). https://docs.microsoft.com/en-
us/azure/storage/blobs/data-lake-storage-best-practices#optimize-for-data-ingest
2. Correct Answer: DIf you store your data as many small files, this can negatively affect performance. In
general, organize your data into larger sized files for better performance (256 MB to 100 GB in
size).https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#optimize-for-
data-ingesthttps://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#file-size
HOTSPOT -
You store files in an Azure Data Lake Storage Gen2 container. The container has the storage policy shown in the
following exhibit.
Hot Area:
Answer:
Explanation:
Box 1: moved to cool storage -
The ManagementPolicyBaseBlob.TierToCool property gets or sets the function to tier blobs to cool storage.
Support blobs currently at Hot tier.
Box 2: container1/contoso.csv -
As defined by prefixMatch.
prefixMatch: An array of strings for prefixes to be matched. Each rule can define up to 10 case-senstive
prefixes. A prefix string must start with a container name.
Reference:
https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.management.storage.fluent.models.managemen
tpolicybaseblob.tiertocool
You are designing a financial transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will
have a clustered columnstore index and will include the following columns:
✑ TransactionType: 40 million rows per transaction type
✑ CustomerSegment: 4 million per customer segment
✑ TransactionMonth: 65 million rows per month
AccountType: 500 million per account type
You have the following query requirements:
✑ Analysts will most commonly analyze transactions for a given month.
✑ Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or
account type
You need to recommend a partition strategy for the table to minimize query times.
On which column should you recommend partitioning the table?
A. CustomerSegment
B. AccountType
C. TransactionType
D. TransactionMonth
Answer: D
Explanation:
For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per
distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each
table into 60 distributed databases.
Example: Any partitioning added to a table is in addition to the distributions created behind the scenes. Using
this example, if the sales fact table contained 36 monthly partitions, and given that a dedicated SQL pool has
60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all
months are populated. If a table contains fewer than the recommended minimum number of rows per
partition, consider using fewer partitions in order to increase the number of rows per partition.
HOTSPOT -
You have an Azure Data Lake Storage Gen2 account named account1 that stores logs as shown in the following
table.
You do not expect that the logs will be accessed during the retention periods.
You need to recommend a solution for account1 that meets the following requirements:
✑ Automatically deletes the logs at the end of each retention period
✑ Minimizes storage costs
What should you include in the recommendation? To answer, select the appropriate options in the answer area.
Hot Area:
Answer:
Explanation:
Box 1: Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier
For infrastructure logs: Cool tier - An online tier optimized for storing data that is infrequently accessed or
modified. Data in the cool tier should be stored for a minimum of 30 days. The cool tier has lower storage
costs and higher access costs compared to the hot tier.
For application logs: Archive tier - An offline tier optimized for storing data that is rarely accessed, and that
has flexible latency requirements, on the order of hours.
Data in the archive tier should be stored for a minimum of 180 days.
Box 2: Azure Blob storage lifecycle management rules
Blob storage lifecycle management offers a rule-based policy that you can use to transition your data to the
desired access tier when your specified conditions are met. You can also use lifecycle management to expire
data at the end of its life.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview
You plan to ingest streaming social media data by using Azure Stream Analytics. The data will be stored in files in
Azure Data Lake Storage, and then consumed by using Azure Databricks and PolyBase in Azure Synapse Analytics.
You need to recommend a Stream Analytics data output format to ensure that the queries from Databricks and
PolyBase against the files encounter the fewest possible errors. The solution must ensure that the files can be
queried quickly and that the data type information is retained.
What should you recommend?
A. JSON
B. Parquet
C. CSV
D. Avro
Answer: B
Explanation:
Need Parquet to support both Databricks and PolyBase.
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql
You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a partitioned fact table
named dbo.Sales and a staging table named stg.Sales that has the matching table and partition definitions.
You need to overwrite the content of the first partition in dbo.Sales with the content of the same partition in
stg.Sales. The solution must minimize load times.
What should you do?
A. Insert the data from stg.Sales into dbo.Sales.

B. Switch the first partition from dbo.Sales to stg.Sales.
C. Switch the first partition from stg.Sales to dbo.Sales.
D. Update dbo.Sales from stg.Sales.
Answer: C
Explanation:
Target partition should be empty and answer is not obvious.
In default way data comes to stg and dbo partition is empty or you clean it, after that switching from source
(stg) to target (dbo).
You are designing a slowly changing dimension (SCD) for supplier data in an Azure Synapse Analytics dedicated
SQL pool.
You plan to keep a record of changes to the available fields.
The supplier data contains the following columns.
Which three additional columns should you add to the data to create a Type 2 SCD? Each correct answer presents
part of the solution.
A. surrogate primary key

B. effective start date
C. business key
D. last modified date
E. effective end date
F. foreign key
Answer: ABE
Explanation:
A type 2 SCD requires a surrogate key to uniquely identify each record when versioning.
A business key is already part of this table - SupplierSystemID. The column is derived from the source data.
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-
analytics-pipelines/3-choose-between-dimension-types under SCD Type 2 “ the dimension table must use a
surrogate key to provide a unique reference to a version of the dimension member.”
HOTSPOT -
You have a Microsoft SQL Server database that uses a third normal form schema.
You plan to migrate the data in the database to a star schema in an Azure Synapse Analytics dedicated SQL pool.
You need to design the dimension tables. The solution must optimize read operations.
What should you include in the solution? To answer, select the appropriate options in the answer area.
Hot Area:
Answer:
Explanation:
Box 1: Denormalize to a second normal form
Denormalization is the process of transforming higher normal forms to lower normal forms via storing the join
of higher normal form relations as a base relation.
Denormalization increases the performance in data retrieval at cost of bringing update anomalies to a
database.
Box 2: New identity columns -

The collapsing relations strategy can be used in this step to collapse classification entities into component
entities to obtain flat dimension tables with single-part keys that connect directly to the fact table. The
single-part key is a surrogate key generated to ensure it remains unique over time.
Example:
Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated
from the table data. Data modelers like to create surrogate keys on their tables when they design data
warehouse models. You can use the IDENTITY property to achieve this goal simply and effectively without
affecting load performance.
Reference:
https://www.mssqltips.com/sqlservertip/5614/explore-the-role-of-normal-forms-in-dimensional-modeling/ htt
ps://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-iden
tity
HOTSPOT -
You plan to develop a dataset named Purchases by using Azure Databricks. Purchases will contain the following
columns:
✑ ProductID
✑ ItemPrice
✑ LineTotal
✑ Quantity
✑ StoreID
✑ Minute
✑ Month
✑ Hour
Year -
✑ Day
You need to store the data to support hourly incremental load pipelines that will vary for each Store ID. The
solution must minimize storage costs.
How should you complete the code? To answer, select the appropriate options in the answer area.
Hot Area:
Answer:
Explanation:
Box 1: partitionBy -
We should overwrite at the partition level.
Example:
df.write.partitionBy("y","m","d")
.mode(SaveMode.Append)
.parquet("/data/hive/warehouse/db_name.db/" + tableName)
Box 2: ("StoreID", "Year", "Month", "Day", "Hour", "StoreID")
Box 3: parquet("/Purchases")
Reference:
https://intellipaat.com/community/11744/how-to-partition-and-write-dataframe-in-spark-without-deleting-par
titions-with-no-new-data
You are designing a partition strategy for a fact table in an Azure Synapse Analytics dedicated SQL pool. The table
has the following specifications:
✑ Contain sales data for 20,000 products.
Use hash distribution on a column named ProductID.
✑ Contain 2.4 billion records for the years 2019 and 2020.
Which number of partition ranges provides optimal compression and performance for the clustered columnstore
index?
A. 40
B. 240
C. 400
D. 2,400
Answer: A
Explanation:
Each partition should have around 1 millions records. Dedication SQL pools already have 60 partitions.
We have the formula: Records/(Partitions*60)= 1 million
Partitions= Records/(1 million * 60)
Partitions= 2.4 x 1,000,000,000/(1,000,000 * 60) = 40
Note: Having too many partitions can reduce the effectiveness of clustered columnstore indexes if each
partition has fewer than 1 million rows. Dedicated SQL pools automatically partition your data into 60
databases. So, if you create a table with 100 partitions, the result will be 6000 partitions.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool
HOTSPOT -
You are creating dimensions for a data warehouse in an Azure Synapse Analytics dedicated SQL pool.
You create a table by using the Transact-SQL statement shown in the following exhibit.
Hot Area:
Answer:
Explanation:
Box 1: Type 2 -
Incorrect Answers:
A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension
table data is overwritten.
Box 2: A Surrogate key
product key is a surrogate key as it is an identity column
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table
contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.
FactPurchase will have 1 million rows of data added daily and will contain three years of data.
Transact-SQL queries similar to the following query will be executed daily.
SELECT -
SupplierKey, StockItemKey, COUNT(*)
FROM FactPurchase -
WHERE DateKey >= 20210101 -
AND DateKey <= 20210131 -

GROUP By SupplierKey, StockItemKey
Which table distribution will minimize query times?
A. replicated
B. hash-distributed on PurchaseKey
C. round-robin
D. hash-distributed on DateKey
Answer: B
Explanation:
Hash-distributed tables improve query performance on large fact tables, and are the focus of this article.
Round-robin tables are useful for improving loading speed.
Incorrect:
Not D: Do not use a date column. . All data for the same date lands in the same distribution. If several users are
all filtering on the same date, then only 1 of the 60 distributions do all the processing work.
Reference:
distribute
You are implementing a batch dataset in the Parquet format.
Data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will
be consumed by an Azure Synapse Analytics serverless SQL pool.
You need to minimize storage costs for the solution.
What should you do?
A. Use Snappy compression for the files.

B. Use OPENROWSET to query the Parquet files.
C. Create an external table that contains a subset of columns from the Parquet files.
D. Store all data as string in the Parquet files.
Answer: A
Explanation:
The answer is A. Consider the compression codec to use when writing to Parquet files. When reading from
Parquet files, Data Factories automatically determine the compression codec based on the file metadata.
Supported types are "none", "gzip", "snappy" (default), and "lzo".
Also you must consider the question is asking for storage cost and not operational (querying included) cost.
https://learn.microsoft.com/en-us/azure/data-factory/format-
DRAG DROP -
You need to build a solution to ensure that users can query specific files in an Azure Data Lake Storage Gen2
account from an Azure Synapse Analytics serverless SQL pool.
Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of
actions to the answer area and arrange them in the correct order.
NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you
select.
Select and Place:
Answer:
Explanation:
Step 1: Create an external data source
You can create external tables in Synapse SQL pools via the following steps:
1. CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that
should be used to access the storage.
2. CREATE EXTERNAL FILE FORMAT to describe format of CSV or Parquet files.
3. CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format.
Step 2: Create an external file format object
Creating an external file format is a prerequisite for creating an external table.
Step 3: Create an external table
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables
You are designing a data mart for the human resources (HR) department at your company. The data mart will
contain employee information and employee transactions.
From a source system, you have a flat extract that has the following fields:
✑ EmployeeID
FirstName -
✑ LastName
✑ Recipient
✑ GrossAmount
✑ TransactionID
✑ GovernmentID
✑ NetAmountPaid
✑ TransactionDate
You need to design a star schema data model in an Azure Synapse Analytics dedicated SQL pool for the data mart.
Which two tables should you create? Each correct answer presents part of the solution.
A. a dimension table for Transaction

B. a dimension table for EmployeeTransaction
C. a dimension table for Employee
D. a fact table for Employee
E. a fact table for Transaction
Answer: CE
Explanation:
C: Dimension tables contain attribute data that might change but usually changes infrequently. For example, a
customer's name and address are stored in a dimension table and updated only when the customer's profile
changes. To minimize the size of a large fact table, the customer's name and address don't need to be in every
row of a fact table. Instead, the fact table and the dimension table can share a customer ID. A query can join
the two tables to associate a customer's profile and transactions.
E: Fact tables contain quantitative data that are commonly generated in a transactional system, and then
loaded into the dedicated SQL pool. For example, a retail business generates sales transactions every day,
and then loads the data into a dedicated SQL pool fact table for analysis.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-o
verview
You are designing a dimension table for a data warehouse. The table will track the value of the dimension
attributes over time and preserve the history of the data by adding new rows as the data changes.
Which type of slowly changing dimension (SCD) should you use?
A. Type 0
B. Type 1
C. Type 2
D. Type 3
Answer: C
Explanation:
Incorrect Answers:
B: A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the
dimension table data is overwritten.
D: A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table
includes a column for the current value of a member plus either the original or previous value of the member.
So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to
track each change like in a Type 2 SCD.
Reference:
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analyt
ics-pipelines/3-choose-between-dimension-types
DRAG DROP -
You have data stored in thousands of CSV files in Azure Data Lake Storage Gen2. Each file has a header row
followed by a properly formatted carriage return (/ r) and line feed (/n).
You are implementing a pattern that batch loads the files daily into a dedicated SQL pool in Azure Synapse
Analytics by using PolyBase.
You need to skip the header row when you import the files into the data warehouse. Before building the loading
pattern, you need to prepare the required database objects in Azure Synapse Analytics.
Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of
actions to the answer area and arrange them in the correct order.
NOTE: Each correct selection is worth one point
Select and Place:
Answer:
Explanation:
1) create database scoped credentials
2) create external source
3) create file format
4) create external table (it not supports CTAS)
HOTSPOT -
You are building an Azure Synapse Analytics dedicated SQL pool that will contain a fact table for transactions
from the first half of the year 2020.
You need to ensure that the table meets the following requirements:
✑ Minimizes the processing time to delete data that is older than 10 years
✑ Minimizes the I/O for queries that use year-to-date values
How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer
area.
Hot Area:
Answer:
Explanation:
Box 1: PARTITION -
RANGE RIGHT FOR VALUES is used with PARTITION.
Part 2: [TransactionDateID]
Partition on the date column.
Example: Creating a RANGE RIGHT partition function on a datetime column
The following partition function partitions a table or index into 12 partitions, one for each month of a year's
worth of values in a datetime column.
CREATE PARTITION FUNCTION [myDateRangePF1] (datetime)
AS RANGE RIGHT FOR VALUES ('20030201', '20030301', '20030401',
'20030501', '20030601', '20030701', '20030801',
'20030901', '20031001', '20031101', '20031201');
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql
You are performing exploratory analysis of the bus fare data in an Azure Data Lake Storage Gen2 account by using
an Azure Synapse Analytics serverless SQL pool.
You execute the Transact-SQL query shown in the following exhibit.
What do the query results include?
A. Only CSV files in the tripdata_2020 subfolder.

B. All files that have file names that beginning with "tripdata_2020".
C. All CSV files that have file names that contain "tripdata_2020".
D. Only CSV that have file names that beginning with "tripdata_2020".
Answer: D
DRAG DROP -
You use PySpark in Azure Databricks to parse the following JSON input.
You need to output the data in the following tabular format.
How should you complete the PySpark code? To answer, drag the appropriate values to the correct targets. Each
value may be used once, more than once, or not at all. You may need to drag the spit bar between panes or scroll to
view content.
Select and Place:
Answer:
Explanation:
Box 1: select -
Box 2: explode -
Bop 3: alias -
pyspark.sql.Column.alias returns this column aliased with a new name or names (in the case of expressions
that return more than one column, such as explode).
Reference:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.alias.html https://docs.mic
rosoft.com/en-us/azure/databricks/sql/language-manual/functions/explode
HOTSPOT -
You are designing an application that will store petabytes of medical imaging data.
When the data is first created, the data will be accessed frequently during the first week. After one month, the
data must be accessible within 30 seconds, but files will be accessed infrequently. After one year, the data will be
accessed infrequently but must be accessible within five minutes.
You need to select a storage strategy for the data. The solution must minimize costs.
Which storage tier should you use for each time frame? To answer, select the appropriate options in the answer
area.
Hot Area:
Answer:
Explanation:
Box 1: Hot -
Hot tier - An online tier optimized for storing data that is accessed or modified frequently. The Hot tier has the
highest storage costs, but the lowest access costs.
Box 2: Cool -
Cool tier - An online tier optimized for storing data that is infrequently accessed or modified. Data in the Cool
tier should be stored for a minimum of 30 days. The
Cool tier has lower storage costs and higher access costs compared to the Hot tier.
Box 3: Cool -
Not Archive tier - An offline tier optimized for storing data that is rarely accessed, and that has flexible
latency requirements, on the order of hours. Data in the
Archive tier should be stored for a minimum of 180 days.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview https://www.altaro.com/hyper-v/
azure-archive-storage/
You have an Azure Synapse Analytics Apache Spark pool named Pool1.
You plan to load JSON files from an Azure Data Lake Storage Gen2 container into the tables in Pool1. The structure
and data types vary by file.
You need to load the files into the tables. The solution must maintain the source data types.
What should you do?
A. Use a Conditional Split transformation in an Azure Synapse data flow.

B. Use a Get Metadata activity in Azure Data Factory.
C. Load the data by using the OPENROWSET Transact-SQL command in an Azure Synapse Analytics serverless
SQL pool.
D. Load the data by using PySpark.
Answer: D
Explanation:
it's about Apache Spark pool, not serverless SQL pool.
You have an Azure Databricks workspace named workspace1 in the Standard pricing tier. Workspace1 contains an
all-purpose cluster named cluster1.
You need to reduce the time it takes for cluster1 to start and scale up. The solution must minimize costs.
What should you do first?
A. Configure a global init script for workspace1.

B. Create a cluster policy in workspace1.
C. Upgrade workspace1 to the Premium pricing tier.
D. Create a pool in workspace1.
Answer: D
Explanation:
You can use Databricks Pools to Speed up your Data Pipelines and Scale Clusters Quickly.
Databricks Pools, a managed cache of virtual machine instances that enables clusters to start and scale 4
times faster.
Reference:
https://databricks.com/blog/2019/11/11/databricks-pools-speed-up-data-pipelines.html
HOTSPOT -
You are building an Azure Stream Analytics job that queries reference data from a product catalog file. The file is
updated daily.
The reference data input details for the file are shown in the Input exhibit. (Click the Input tab.)
The storage account container view is shown in the Refdata exhibit. (Click the Refdata tab.)
You need to configure the Stream Analytics job to pick up the new reference data.
What should you configure? To answer, select the appropriate options in the answer area.
Hot Area:
Answer:
Explanation:
Box 1: 1. dat / time /product.csv
More detailed things should be put at the last.
Box 2: YYYY-MM-DD -
Note: Date Format [optional]: If you have used date within the Path Pattern that you specified, then you can
select the date format in which your blobs are organized from the drop-down of supported formats.
Example: YYYY/MM/DD, MM/DD/YYYY, etc.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data
HOTSPOT -
You have the following Azure Stream Analytics query.
For each of the following statements, select Yes if the statement is true. Otherwise, select No.
Hot Area:
Answer:
Explanation:
Box 1: YES
Box 2: Yes -
When joining two streams of data explicitly repartitioned, these streams must have the same partition key and
partition count.
Box 3: Yes -
Streaming Units (SUs) represents the computing resources that are allocated to execute a Stream Analytics
job. The higher the number of SUs, the more CPU and memory resources are allocated for your job.
In general, the best practice is to start with 6 SUs for queries that don't use PARTITION BY.
Here there are 10 partitions, so 6x10 = 60 SUs is good.
Note: Remember, Streaming Unit (SU) count, which is the unit of scale for Azure Stream Analytics, must be
adjusted so the number of physical resources available to the job can fit the partitioned flow. In general, six
SUs is a good number to assign to each partition. In case there are insufficient resources assigned to the job,
the system will only apply the repartition if it benefits the job.
Reference:
https://azure.microsoft.com/en-in/blog/maximize-throughput-with-repartitioning-in-azure-stream-analytics/
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-streaming-unit-consumption
HOTSPOT -
You are building a database in an Azure Synapse Analytics serverless SQL pool.
You have data stored in Parquet files in an Azure Data Lake Storege Gen2 container.
Records are structured as shown in the following sample.
"id": 123,
"address_housenumber": "19c",
"address_line": "Memory Lane",
"applicant1_name": "Jane",
"applicant2_name": "Dev"
The records contain two applicants at most.

You need to build a table that includes only the address fields.
area.
Hot Area:
Answer:
Explanation:
Box 1: CREATE EXTERNAL TABLE -
An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External
tables are used to read data from files or write data to files in Azure Storage. With Synapse SQL, you can use
external tables to read external data using dedicated SQL pool or serverless SQL pool.
Syntax:
CREATE EXTERNAL TABLE database_name.schema_name.table_name | schema_name.table_name |
table_name
( <column_definition> [ ,...n ] )
WITH (
LOCATION = 'folder_or_filepath',
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name
Box 2. OPENROWSET -
When using serverless SQL pool, CETAS is used to create an external table and export query results to Azure
Storage Blob or Azure Data Lake Storage Gen2.
Example:
AS -
SELECT decennialTime, stateName, SUM(population) AS population
FROM -
OPENROWSET(BULK
'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=*/*.parq
FORMAT='PARQUET') AS [r]
GROUP BY decennialTime, stateName
GO -
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool named Pool1 and an Azure Data Lake Storage Gen2
account named Account1.
You plan to access the files in Account1 by using an external table.
You need to create a data source in Pool1 that you can reference when you create the external table.
area.
Hot Area:
Answer:
Explanation:
Box 1: 1. dfs
(for Azure Data Lake Storage Gen2)
Box 2: HADOOP -
You have an Azure subscription that contains an Azure Blob Storage account named storage1 and an Azure
Synapse Analytics dedicated SQL pool named
Pool1.
You need to store data in storage1. The data will be read by Pool1. The solution must meet the following
requirements:
Enable Pool1 to skip columns and rows that are unnecessary in a query.
✑ Automatically create column statistics.
✑ Minimize the size of files.
Which type of file should you use?
A. JSON
B. Parquet
C. Avro
D. CSV
Answer: B
Explanation:
Automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics
manually until automatic creation of CSV files statistics is supported.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-statistics
DRAG DROP -
You plan to create a table in an Azure Synapse Analytics dedicated SQL pool.
Data in the table will be retained for five years. Once a year, data that is older than five years will be deleted.
You need to ensure that the data is distributed evenly across partitions. The solution must minimize the amount of
time required to delete old data.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct
targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between
panes or scroll to view content.
Select and Place:
Answer:
Explanation:
Box 1: HASH -
Box 2: Order Date Key -
In most cases, table partitions are created on a date column.
A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management.
For example, rather than execute a DELETE statement to delete all rows in a table where the order_date was
in October of 2001, you could partition your data early. Then you can switch out the partition with data for an
empty partition from another table.
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool
HOTSPOT -
You have an Azure Data Lake Storage Gen2 service.
You need to design a data archiving solution that meets the following requirements:
✑ Data that is older than five years is accessed infrequently but must be available within one second when
requested.
✑ Data that is older than seven years is NOT accessed.
✑ Costs must be minimized while maintaining the required availability.
How should you manage the data? To answer, select the appropriate options in the answer area.
Hot Area:
Answer:
Explanation:
Box 1: Move to cool storage -
Box 2: Move to archive storage -
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible
latency requirements, on the order of hours.
The following table shows a comparison of premium performance block blob storage, and the hot, cool, and
archive access tiers.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers
HOTSPOT -
You plan to create an Azure Data Lake Storage Gen2 account.
You need to recommend a storage solution that meets the following requirements:
✑ Provides the highest degree of data resiliency
✑ Ensures that content remains available for writes if a primary data center fails
What should you include in the recommendation? To answer, select the appropriate options in the answer area.
Hot Area:
Answer:
Explanation:
Ensures that content remains available for writes if a primary data center fails'. RA-GRS and RAGZRS provide
read access only after failover. The correct answer is ZRS as t=stated in the link below "Microsoft
recommends using ZRS in the primary region for Azure Data Lake Storage Gen2 workloads.
" https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy?
toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json
You need to implement a Type 3 slowly changing dimension (SCD) for product category data in an Azure Synapse
Analytics dedicated SQL pool.
You have a table that was created by using the following Transact-SQL statement.
Which two columns should you add to the table? Each correct answer presents part of the solution.
A.
B.
C.
D.
E.
Answer: BE
Explanation:
A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a
column for the current value of a member plus either the original or previous value of the member. So Type 3
uses additional columns to track one key instance of history, rather than storing additional rows to track each
change like in a Type 2 SCD.
This type of tracking may be used for one or two columns in a dimension table. It is not common to use it for
many members of the same table. It is often used in combination with Type 1 or Type 2 members.
Reference:
https://k21academy.com/microsoft-azure/azure-data-engineer-dp203-q-a-day-2-live-session-review/
DRAG DROP -
You have an Azure subscription.
You plan to build a data warehouse in an Azure Synapse Analytics dedicated SQL pool named pool1 that will
contain staging tables and a dimensional model.
Pool1 will contain the following tables.
You need to design the table storage for pool1. The solution must meet the following requirements:
✑ Maximize the performance of data loading operations to Staging.WebSessions.
✑ Minimize query times for reporting queries against the dimensional model.
Which type of table distribution should you use for each table? To answer, drag the appropriate table distribution
types to the correct tables. Each table distribution type may be used once, more than once, or not at all. You may
need to drag the split bar between panes or scroll to view content.
Select and Place:
Answer:
Explanation:
Box 1: Replicated -
The best table storage option for a small table is to replicate it across all the Compute nodes.
Box 2: Hash -
Hash-distribution improves query performance on large fact tables.
Box 3: Round-robin -
Round-robin distribution is useful for improving loading speed.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-di
stribute
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool.
You need to create a table named FactInternetSales that will be a large fact table in a dimensional model.
FactInternetSales will contain 100 million rows and two columns named SalesAmount and OrderQuantity. Queries
executed on FactInternetSales will aggregate the values in SalesAmount and OrderQuantity from the last year for
a specific product. The solution must minimize the data size and query execution time.
How should you complete the code? To answer, select the appropriate options in the answer area.
Hot Area:
Answer:
Explanation:
Box 1: (CLUSTERED COLUMNSTORE INDEX
CLUSTERED COLUMNSTORE INDEX -

Columnstore indexes are the standard for storing and querying large data warehousing fact tables. This index
uses column-based data storage and query processing to achieve gains up to 10 times the query performance
in your data warehouse over traditional row-oriented storage. You can also achieve gains up to
10 times the data compression over the uncompressed data size. Beginning with SQL Server 2016 (13.x) SP1,
columnstore indexes enable operational analytics: the ability to run performant real-time analytics on a
transactional workload.
Note: Clustered columnstore index
A clustered columnstore index is the physical storage for the entire table.
To reduce fragmentation of the column segments and improve performance, the columnstore index might
store some data temporarily into a clustered index called a deltastore and a B-tree list of IDs for deleted rows.
The deltastore operations are handled behind the scenes. To return the correct query results, the clustered
columnstore index combines query results from both the columnstore and the deltastore.
Box 2: HASH([ProductKey])
A hash distributed table distributes rows based on the value in the distribution column. A hash distributed
table is designed to achieve high performance for queries on large tables.
Choose a distribution column with data that distributes evenly
Incorrect:
* Not HASH([OrderDateKey]). Is not a date column. All data for the same date lands in the same distribution. If
several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work
* A replicated table has a full copy of the table available on every Compute node. Queries run fast on
replicated tables since joins on replicated tables don't require data movement. Replication requires extra
storage, though, and isn't practical for large tables.
* A round-robin table distributes table rows evenly across all distributions. The rows are distributed randomly.
Loading data into a round-robin table is fast. Keep in mind that queries can require more data movement than
the other distribution methods.
Reference:
https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview https://do
cs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview ht
tps://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-dist
ribute
You have an Azure Synapse Analytics dedicated SQL pool that contains a table named Table1. Table1 contains the
following:
✑ One billion rows
✑ A clustered columnstore index
✑ A hash-distributed column named Product Key
✑ A column named Sales Date that is of the date data type and cannot be null
Thirty million rows will be added to Table1 each month.
You need to partition Table1 based on the Sales Date column. The solution must optimize query performance and
data loading.
How often should you create a partition?
A. once per month

B. once per year
C. once per day
D. once per week
Answer: B
Explanation:
Need a minimum 1 million rows per distribution. Each table is 60 distributions. 30 millions rows is added each
month. Need 2 months to get a minimum of 1 million rows per distribution in a new partition.
Note: When creating partitions on clustered columnstore tables, it is important to consider how many rows
belong to each partition. For optimal compression and performance of clustered columnstore tables, a
minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated
SQL pool already divides each table into 60 distributions.
Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this
example, if the sales fact table contained 36 monthly partitions, and given that a dedicated SQL pool has 60
distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all
months are populated. If a table contains fewer than the recommended minimum number of rows per
partition, consider using fewer partitions in order to increase the number of rows per partition.
Reference:
partition
You have an Azure Databricks workspace that contains a Delta Lake dimension table named Table1.
Table1 is a Type 2 slowly changing dimension (SCD) table.
You need to apply updates from a source table to Table1.
Which Apache Spark SQL operation should you use?
A. CREATE
B. UPDATE
C. ALTER
D. MERGE
Answer: D
Explanation:
The Delta provides the ability to infer the schema for data input which further reduces the effort required in
managing the schema changes. The Slowly Changing
Data(SCD) Type 2 records all the changes made to each key in the dimensional table. These operations require
updating the existing rows to mark the previous values of the keys as old and then inserting new rows as the
latest values. Also, Given a source table with the updates and the target table with dimensional data,
SCD Type 2 can be expressed with the merge.
Example:
// Implementing SCD Type 2 operation using merge function
customersTable
.as("customers")
.merge(
stagedUpdates.as("staged_updates"),
"customers.customerId = mergeKey")
.whenMatched("customers.current = true AND customers.address <> staged_updates.address")
.updateExpr(Map(
"current" -> "false",
"endDate" -> "staged_updates.effectiveDate"))
.whenNotMatched()
.insertExpr(Map(
"customerid" -> "staged_updates.customerId",
"address" -> "staged_updates.address",
"current" -> "true",
"effectiveDate" -> "staged_updates.effectiveDate",
"endDate" -> "null"))
.execute()
Reference:
https://www.projectpro.io/recipes/what-is-slowly-changing-data-scd-type-2-operation-delta-table-databricks
You are designing an Azure Data Lake Storage solution that will transform raw JSON files for use in an analytical
workload.
You need to recommend a format for the transformed files. The solution must meet the following requirements:
✑ Contain information about the data types of each column in the files.
✑ Support querying a subset of columns in the files.
✑ Support read-heavy analytical workloads.
✑ Minimize the file size.
What should you recommend?
A. JSON
B. CSV
C. Apache Avro
D. Apache Parquet
Answer: D
Explanation:
Parquet, an open-source file format for Hadoop, stores nested data structures in a flat columnar format.
Compared to a traditional approach where data is stored in a row-oriented approach, Parquet file format is
more efficient in terms of storage and performance.
It is especially good for queries that read particular columns from a wide (with many columns) table since only
needed columns are read, and IO is minimized.
Incorrect:
Not C:
The Avro format is the ideal candidate for storing data in a data lake landing zone because:
1. Data from the landing zone is usually read as a whole for further processing by downstream systems (the
row-based format is more efficient in this case).
2. Downstream systems can easily retrieve table schemas from Avro files (there is no need to store the
schemas separately in an external meta store).
3. Any source schema change is easily handled (schema evolution).
Reference:
https://www.clairvoyant.ai/blog/big-data-file-formats
Solution: You modify the files to ensure that each row is less than 1 MB.
A. Yes
B. No
Answer: A
Explanation:
Polybase loads rows that are smaller than 1 MB.
Note on Polybase Load: PolyBase is a technology that accesses external data stored in Azure Blob storage or
Azure Data Lake Store via the T-SQL language.
Extract, Load, and Transform (ELT)
Extract, Load, and Transform (ELT) is a process by which data is extracted from a source system, loaded into a
data warehouse, and then transformed.
The basic steps for implementing a PolyBase ELT for dedicated SQL pool are:
Extract the source data into text files.
Land the data into Azure Blob storage or Azure Data Lake Store.
Prepare the data for loading.
Load the data into dedicated SQL pool staging tables using PolyBase.
Transform the data.
Insert the data into production tables.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-service-
capacity-limits https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview
You plan to create a dimension table in Azure Synapse Analytics that will be less than 1 GB.
You need to create the table to meet the following requirements:
✑ Provide the fastest query time.
✑ Minimize data movement during queries.
Which type of table should you use?
A. replicated
B. hash distributed
C. heap
D. round-robin
Answer: A
Explanation:
A replicated table has a full copy of the table accessible on each Compute node. Replicating a table removes
the need to transfer data among Compute nodes before a join or aggregation. Since the table has multiple
copies, replicated tables work best when the table size is less than 2 GB compressed. 2 GB is not a hard limit.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-guidance-for-replicat
ed-tables
You are designing a dimension table in an Azure Synapse Analytics dedicated SQL pool.
You need to create a surrogate key for the table. The solution must provide the fastest query performance.
What should you use for the surrogate key?
A. a GUID column
B. a sequence object
C. an IDENTITY column
Answer: C
Explanation:
Use IDENTITY to create surrogate keys using dedicated SQL pool in AzureSynapse Analytics.
Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated
from the table data. Data modelers like to create surrogate keys on their tables when they design data
warehouse models. You can use the IDENTITY property to achieve this goal simply and effectively without
affecting load performance.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-id
entity
HOTSPOT
-
You have an Azure Data Lake Storage Gen2 account that contains a container named container1. You have an
Azure Synapse Analytics serverless SQL pool that contains a native external table named dbo.Table1. The source
data for dbo.Table1 is stored in container1. The folder structure of container1 is shown in the following exhibit.
The external data source is defined by using the following statement.
For each of the following statements, select Yes if the statement is true. Otherwise, select No.
Answer:
Explanation:
1.Yes,
2.Yes:
"Unlike Hadoop external tables, native external tables don't return subfolders unless you specify /** at the
end of path" which is the case here.
3. No:
"Both Hadoop and native external tables will skip the files with the names that begin with an underline (_) or a
period (.)", refers to files, not directories, so the last file with the underscore will be exluded.
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?
tabs=hadoop#arguments-create-external-table
You need to create a fact table named Table1 that will store sales data from the last three years. The solution must
be optimized for the following query operations:
• Show order counts by week.

• Calculate sales totals by region.
• Calculate sales totals by product.
• Find all the orders from a given month.
Which data should you use to partition Table1?
A. product
B. month
C. week
D. region
Answer: B
Explanation:
When designing a fact table in a data warehouse, it is important to consider the types of queries that will be
run against it. In this case, the queries that need to be optimized include: show order counts by week,
calculate sales totals by region, calculate sales totals by product, and find all the orders from a given month.
Partitioning the table by month would be the best option in this scenario as it would allow for efficient
querying of data by month, which is necessary for the query operations described above. For example, it
would be easy to find all the orders from a given month by only searching the partition for that specific month.
You are designing the folder structure for an Azure Data Lake Storage Gen2 account.
You identify the following usage patterns:
• Users will query data by using Azure Synapse Analytics serverless SQL pools and Azure Synapse Analytics
serverless Apache Spark pools.
• Most queries will include a filter on the current year or week.
• Data will be secured by data source.
You need to recommend a folder structure that meets the following requirements:
• Supports the usage patterns

• Simplifies folder security
• Minimizes query times
Which folder structure should you recommend?
A. \DataSource\SubjectArea\YYYY\WW\FileData_YYYY_MM_DD.parquet
B. \DataSource\SubjectArea\YYYY-WW\FileData_YYYY_MM_DD.parquet
C. DataSource\SubjectArea\WW\YYYY\FileData_YYYY_MM_DD.parquet
D. \YYYY\WW\DataSource\SubjectArea\FileData_YYYY_MM_DD.parquet
E. WW\YYYY\SubjectArea\DataSource\FileData_YYYY_MM_DD.parquet
Answer: A
Explanation:
The reason is that this folder structure allows for the data to be organized by data source and subject area,
which can help with securing the data by data source. Additionally, it organizes the data by year and week,
which can minimize query times for the queries that include a filter on the current year or week. And also the
file name format is consistent with the folder structure, which makes it easy to understand where the data
comes from.
You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a table named table1.
You load 5 TB of data into table1.
You need to ensure that columnstore compression is maximized for table1.
Which statement should you execute?
A. DBCC INDEXDEFRAG (pool1, table1)

B. DBCC DBREINDEX (table1)
C. ALTER INDEX ALL on table1 REORGANIZE
D. ALTER INDEX ALL on table1 REBUILD
Answer: D
Explanation:
ALTER INDEX ALL on table1 REBUILD
This statement will rebuild all indexes on table1, which can help to maximize columnstore compression. The
other options are not appropriate for this task.
DBCC INDEXDEFRAG (pool1, table1) is for defragmenting the indexes and DBCC DBREINDEX (table1) is for
recreating the indexes. ALTER INDEX ALL on table1 REORGANIZE is for reorganizing the indexes.
You have an Azure Synapse Analytics dedicated SQL pool named pool1.
You plan to implement a star schema in pool and create a new table named DimCustomer by using the following
code.
You need to ensure that DimCustomer has the necessary columns to support a Type 2 slowly changing dimension
(SCD).
Which two columns should you add? Each correct answer presents part of the solution.
A.[HistoricalSalesPerson] [nvarchar] (256) NOT NULL

B.[EffectiveEndDate] [datetime] NOT NULL
C.[PreviousModifiedDate] [datetime] NOT NULL
D.[RowID] [bigint] NOT NULL
E.[EffectiveStartDate] [datetime] NOT NULL
Answer: BE
Explanation:
Surrogate is already there as customerkey column
B and E. I don't think RowID is not needed, as there is already a surrogate key that exists with the
CustomerKey column.
HOTSPOT
-
You have an Azure subscription that contains an Azure Synapse Analytics dedicated SQL pool.
You plan to deploy a solution that will analyze sales data and include the following:
•A table named Country that will contain 195 rows

•A table named Sales that will contain 100 million rows
•A query to identify total sales by country and customer from the past 30 days
You need to create the tables. The solution must maximize query performance.
How should you complete the script? To answer, select the appropriate options in the answer area.

Answer:
Explanation:
1. Hash(CustomerID) 2. Replicate
You have an Azure subscription that contains an Azure Data Lake Storage Gen2 account named account1 and an
Azure Synapse Analytics workspace named workspace1.
You need to create an external table in a serverless SQL pool in workspace1. The external table will reference CSV
files stored in account1. The solution must maximize performance.
How should you configure the external table?
A.Use a native external table and authenticate by using a shared access signature (SAS).
B.Use a native external table and authenticate by using a storage account key.
C.Use an Apache Hadoop external table and authenticate by using a shared access signature (SAS).
D.Use an Apache Hadoop external table and authenticate by using a service principal in Microsoft Azure Active
Directory (Azure AD), part of Microsoft Entra.
Answer: A
Explanation:
Serverless SQL Pools cannot use Hadoop, Only Native. Access Key Auth is never best practice therefore
leaving only A as a viable answer.
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop
HOTSPOT
-
You have an Azure Synapse Analytics serverless SQL pool that contains a database named db1. The data model for
db1 is shown in the following exhibit.
presented in the exhibit.
Answer:
Explanation:
Correct answer should be join DimGeography and DimCustomer and 5 tables.You also need to combine
ProductLine and Product in order for the schema to be considered a star schema. This would result in 5
remaining tables: DimCustomer (DimCustomer JOIN DimGeography), DimStore, Date, Product (Product JOIN
ProductLine) and FactOrders.
You have an Azure Databricks workspace and an Azure Data Lake Storage Gen2 account named storage1.
New files are uploaded daily to storage1.
You need to recommend a solution that configures storage1 as a structured streaming source. The solution must
meet the following requirements:
•Incrementally process new files as they are uploaded to storage1.

•Minimize implementation and maintenance effort.
•Minimize the cost of processing millions of files.
•Support schema inference and schema drift.
Which should you include in the recommendation?
A.COPY INTO
B.Azure Data Factory
C.Auto Loader
D.Apache Spark FileStreamSource
Answer: C
Explanation:
Auto Loader provides a Structured Streaming source called cloudFiles. Plus, it supports schema drift. Hence,
Auto Loader is the correct answer.
https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/
You have an Azure subscription that contains the resources shown in the following table.
You need to read the TSV files by using ad-hoc queries and the OPENROWSET function. The solution must assign a
name and override the inferred data type of each column.
What should you include in the OPENROWSET function?
A.the WITH clause

B.the ROWSET_OPTIONS bulk option
C.the DATAFILETYPE bulk option
D.the DATA_SOURCE parameter
Answer: A
Explanation:
Option A seems correct answer as With clause helps to overwrite data types and assign names for columns
In the Question "The solution must assign a name and override the inferred data type of each column", so we
must need a WITH Clause to define the column names and data types.
You plan to create a fact table named Table1 that will contain a clustered columnstore index.
You need to optimize data compression and query performance for Table1.
What is the minimum number of rows that Table1 should contain before you create partitions?
A.100,000
B.600,000
C.1 million
D.60 million
Answer: D
Explanation:
I've seen in the comments the explanation that this question has something to do with distribution and I don't
think this is the case here. It's just that for a partition to have optimal compression, it has to be of at least 1
million rows, and since the idea of having a partition is to divide the data into smaller chunks, you need at least
2 partitions. Therefore, since there's no '2 mil' option, the only option left is '60M'.
Clustered columnstore has the best compression with 1M rows. So it should be 1M * 60 = 60 million rows
You have an Azure Synapse Analytics dedicated SQL pool that contains a table named DimSalesPerson.
DimSalesPerson contains the following columns:
•RepSourceID
•SalesRepID
•FirstName
•LastName
•StartDate
•EndDate
•Region
You are developing an Azure Synapse Analytics pipeline that includes a mapping data flow named Dataflow1.
Dataflow1 will read sales team data from an external source and use a Type 2 slowly changing dimension (SCD)
when loading the data into DimSalesPerson.
You need to update the last name of a salesperson in DimSalesPerson.
Which two actions should you perform? Each correct answer presents part of the solution.
A.Update three columns of an existing row.

B.Update two columns of an existing row.
C.Insert an extra row.
D.Update one column of an existing row.
Answer: CD
Explanation:
SCD Type 2 will have historical changes hence we will have new row and we need to update the existing row's
end date. Hence - CD
https://www.sqlshack.com/implementing-slowly-changing-dimensions-scds-in-data-warehouses/
HOTSPOT
-
You plan to use an Azure Data Lake Storage Gen2 account to implement a Data Lake development environment
that meets the following requirements:
•Read and write access to data must be maintained if an availability zone becomes unavailable.
•Data that was last modified more than two years ago must be deleted automatically.
•Costs must be minimized.
What should you configure? To answer, select the appropriate options in the answer area.

Answer:
Explanation:
Zone-redundant storage (ZRS) & Lifecycle Policy

Thank you
Thank you for being so interested in the premium exam material.
I'm glad to hear that you found it informative and helpful.
But Wait
I wanted to let you know that there is more content available in the full version.
The full paper contains additional sections and information that you may find helpful,
and I encourage you to download it to get a more comprehensive and detailed view of
all the subject matter.
Download Full Version Now
Total: 317 Questions

Link: https://certyiq.com/papers?provider=microsoft&exam=dp-203

dp-203 Dedb75bd432f

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

dp-203 Dedb75bd432f

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

dp-203 Dedb75bd432f

Uploaded by

Copyright:

Available Formats

Certy IQ

Premium exam material

Data Engineering on Microsoft Azure

Total: 317 Questions

You need to alter the table to meet the following requirements:

A. [ManagerEmployeeID] [smallint] NULL

Here is the error message I got:

Error: spark_catalog requires a single-part namespace, but got [mytestdb, dbo].

A. File2.csv and File3.csv only

In case of a serverless pool a wildcard should be added to the location.

1: Parquet - column-oriented binary file format

A. / SubjectArea / DataSource / DD / MM / YYYY / FileData _ YYYY _ MM _ DD .csv

Box 2: Move to archive storage -

A. Add the managed identity to the Sales group.

Binary - copy files as is in fastest way.

PreserveHierarchy - for saving folder structure.

A. geo-redundant storage (GRS)

A. geo-redundant storage (GRS)

WHERE DateKey >= 20210101 -

AND DateKey <= 20210131 -

Hash-distributed tables improve query performance on large fact tables.

To balance the parallel processing, select a distribution column that:

✑ Does not have NULLs, or has only a few NULLs.

✑ Is not a date column.

A. an ordered clustered columnstore index

Both A and D are correct

A. Azure Event Hubs

A. Convert the files to JSON

A. Insert the data from stg.Sales into dbo.Sales.

Target partition should be empty and answer is not obvious.

A. surrogate primary key

Box 2: New identity columns -

We have the formula: Records/(Partitions*60)= 1 million

Partitions= Records/(1 million * 60)

Partitions= 2.4 x 1,000,000,000/(1,000,000 * 60) = 40

IsCurrent) to easily filter by current dimension members.

product key is a surrogate key as it is an identity column

WHERE DateKey >= 20210101 -

AND DateKey <= 20210131 -

A. Use Snappy compression for the files.

Supported types are "none", "gzip", "snappy" (default), and "lzo".

A. a dimension table for Transaction

2) create external source

3) create file format

4) create external table (it not supports CTAS)

A. Only CSV files in the tripdata_2020 subfolder.

A. Use a Conditional Split transformation in an Azure Synapse data flow.

it's about Apache Spark pool, not serverless SQL pool.

A. Configure a global init script for workspace1.

Box 1: 1. dat / time /product.csv

More detailed things should be put at the last.

Example: YYYY/MM/DD, MM/DD/YYYY, etc.

Here there are 10 partitions, so 6x10 = 60 SUs is good.

The records contain two applicants at most.

(for Azure Data Lake Storage Gen2)

In most cases, table partitions are created on a date column.

Box 1: Move to cool storage -

Box 2: Move to archive storage -