DP-203: Data Engineering on Microsoft Azure

24%

Question 111

You have an enterprise data warehouse in Azure Synapse Analytics.
Using PolyBase, you create an external table named [Ext].[Items] to query Parquet files stored in Azure Data Lake Storage Gen2 without importing the data to the data warehouse.
The external table has three columns.
You discover that the Parquet files have a fourth column named ItemID.

Which command should you run to add the ItemID column to the external table?




ALTER statement is not supported on external table, you need to DROP it and CREATE it again

Reference:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/external-sql-tables
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql

Question 112

You have an Azure Synapse Analytics dedicated SQL pool that contains the users shown in the following table.


User1 executes a query on the database, and the query returns the results shown in the following exhibit.


User1 is the only user who has access to the unmasked data.
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.




Box 1: 0
The YearlyIncome column is of the money data type.
The Default masking function: Full masking according to the data types of the designated fields
Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).

Box 2: the values stored in the database
Users with administrator privileges are always excluded from masking, and see the original data without any mask.

Reference:
https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview

Question 113

You have a table in an Azure Synapse Analytics dedicated SQL pool. The table was created by using the following Transact-SQL statement.

You need to alter the table to meet the following requirements:
● Ensure that users can identify the current manager of employees.
● Support creating an employee reporting hierarchy for your entire company.
● Provide fast lookup of the managers' attributes such as name and job title.

Which column should you add to the table?
[ManagerEmployeeID] [smallint] NULL
[ManagerEmployeeKey] [smallint] NULL
[ManagerEmployeeKey] [int] NULL
[ManagerName] [varchar](200) NULL




Answer is [ManagerEmployeeKey] [int] NULL

We need an extra column to identify the Manager. Use the data type as the EmployeeKey column, an int column.

Reference:
https://docs.microsoft.com/en-us/analysis-services/tabular-models/hierarchies-ssas-tabular

Question 114

You have a SQL pool in Azure Synapse.
You plan to load data from Azure Blob storage to a staging table. Approximately 1 million rows of data will be loaded daily. The table will be truncated before each daily load.
You need to create the staging table. The solution must minimize how long it takes to load the data to the staging table.

How should you configure the table?




Answer is Round-Robin Heap None

Round-robin - this is the simplest distribution model, not great for querying but fast to process

Heap - The term heap basically refers to a table without a clustered index. Adding a clustered index to a temp table makes absolutely no sense and is a waste of compute resources for a table that would be entirely truncated daily. no clustered index = heap.

No partitions - Partitioning by date is useful when stage destination has data because you can hide the inserting data’s new partition (to keep users from hitting it), complete the load and then unhide the new partition.
However, in this question it states, “the table will be truncated before each daily load”, so, it appears it’s a true Staging table and there are no users with access, no existing data, and I see no reason to have a Date partition.

Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/data-loading-best-practices
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-overview

Question 115

You build a data warehouse in an Azure Synapse Analytics dedicated SQL pool.
Analysts write a complex SELECT query that contains multiple JOIN and CASE statements to transform data for use in inventory reports. The inventory reports will use the data and additional WHERE parameters depending on the report. The reports will be produced once daily.
You need to implement a solution to make the dataset available for the reports. The solution must minimize query times.

What should you implement?
an ordered clustered columnstore index
a materialized view
result set caching
a replicated table




Answer is a materialized view

Materialized views for dedicated SQL pools in Azure Synapse provide a low maintenance method for complex analytical queries to get fast performance without any query change.
Incorrect Answers:
C: One daily execution does not make use of result cache caching.
Note: When result set caching is enabled, dedicated SQL pool automatically caches query results in the user database for repetitive use. This allows subsequent query executions to get results directly from the persisted cache so recomputation is not needed. Result set caching improves query performance and reduces compute resource usage. In addition, queries using cached results set do not use any concurrency slots and thus do not count against existing concurrency limits.

Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-materialized-views
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-result-set-caching

Question 116

You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1.
You plan to create a database named DB1 in Pool1.
You need to ensure that when tables are created in DB1, the tables are available automatically as external tables to the built-in serverless SQL pool.
Which formats should you use for the tables in DB1?
CSV
ORC
JSON
Parquet




Answer is CSV and Parquet are both correct answers.

Serverless SQL pool can automatically synchronize metadata from Apache Spark. A serverless SQL pool database will be created for each database existing in serverless Apache Spark pools.
For each Spark external table based on Parquet or CSV and located in Azure Storage, an external table is created in a serverless SQL pool database.

Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-storage-files-spark-tables

Question 117

You are designing a financial transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will have a clustered columnstore index and will include the following columns:
● TransactionType: 40 million rows per transaction type
● CustomerSegment: 4 million per customer segment
● TransactionMonth: 65 million rows per month
AccountType: 500 million per account type

You have the following query requirements:
● Analysts will most commonly analyze transactions for a given month.
● Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type
You need to recommend a partition strategy for the table to minimize query times.

On which column should you recommend partitioning the table?
CustomerSegment
AccountType
TransactionType
TransactionMonth




Answer is TransactionMonth

For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases.
Example: Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact table contained 36 monthly partitions, and given that a dedicated SQL pool has 60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows per partition.

Reference:
https://www.linkedin.com/pulse/partitioning-distribution-azure-synapse-analytics-swapnil-mule

Question 118

You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a partitioned fact table named dbo.Sales and a staging table named stg.Sales that has the matching table and partition definitions.
You need to overwrite the content of the first partition in dbo.Sales with the content of the same partition in stg.Sales. The solution must minimize load times.
What should you do?
Insert the data from stg.Sales into dbo.Sales.
Switch the first partition from dbo.Sales to stg.Sales.
Switch the first partition from stg.Sales to dbo.Sales.
Update dbo.Sales from stg.Sales.




Answer is Switch the first partition from stg.Sales to dbo.Sales.

Since the need is to overwrite dbo.Sales with the content of stg.Sales. SWITCH source TO target

Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool

Question 119

You are designing a slowly changing dimension (SCD) for supplier data in an Azure Synapse Analytics dedicated SQL pool.
You plan to keep a record of changes to the available fields.
The supplier data contains the following columns.

Which three additional columns should you add to the data to create a Type 2
surrogate primary key
effective start date
business key
last modified date
effective end date
foreign key




Answers are surrogate primary key, effective start date, effective end date

A type 2 SCD requires a surrogate key to uniquely identify each record when versioning. See https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types
under SCD Type 2 “ the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member.” A business key is already part of this table - SupplierSystemID. The column is derived from the source data.

Reference:
https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation

Question 120

You have a Microsoft SQL Server database that uses a third normal form schema.
You plan to migrate the data in the database to a star schema in an Azure Synapse Analytics dedicated SQL pool.
You need to design the dimension tables. The solution must optimize read operations.
What should you include in the solution?




Box 1: Denormalize to a second normal form
Denormalization is the process of transforming higher normal forms to lower normal forms via storing the join of higher normal form relations as a base relation.
Denormalization increases the performance in data retrieval at cost of bringing update anomalies to a database.

Box 2: New identity columns
The collapsing relations strategy can be used in this step to collapse classification entities into component entities to obtain flat dimension tables with single-part keys that connect directly to the fact table. The single-part key is a surrogate key generated to ensure it remains unique over time.
Example:

Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the table data. Data modelers like to create surrogate keys on their tables when they design data warehouse models. You can use the IDENTITY property to achieve this goal simply and effectively without affecting load performance.

Reference:
https://www.mssqltips.com/sqlservertip/5614/explore-the-role-of-normal-forms-in-dimensional-modeling/
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity

< Previous PageNext Page >

Quick access to all questions in this exam