DP-203: Data Engineering on Microsoft Azure

40%

Question 191

A company purchases IoT devices to monitor manufacturing machinery. The company uses an Azure IoT Hub to communicate with the IoT devices.
The company must be able to monitor the devices in real-time.
You need to design the solution.
What should you recommend?
Azure Data Factory instance using Azure Portal
Azure Data Factory instance using Azure PowerShell
Azure Stream Analytics cloud job using Azure Portal
Azure Data Factory instance using Microsoft Visual Studio




Answer is Azure Stream Analytics cloud job using Azure Portal

In a real-world scenario, you could have hundreds of these sensors generating events as a stream. Ideally, a gateway device would run code to push these events to Azure Event Hubs or Azure IoT Hubs. Your Stream Analytics job would ingest these events from Event Hubs and run real-time analytics queries against the streams.
Create a Stream Analytics job:
In the Azure portal, select + Create a resource from the left navigation menu. Then, select Stream Analytics job from Analytics.

Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-get-started-with-azure-stream-analytics-to-process-data-from-iot-devices

Question 192

You are designing an anomaly detection solution for streaming data from an Azure IoT hub. The solution must meet the following requirements:
- Send the output to Azure Synapse.
- Identify spikes and dips in time series data.
- Minimize development and configuration effort.

Which should you include in the solution?
Azure Databricks
Azure Stream Analytics
Azure SQL Database




Answer is Azure Stream Analytics

You can identify anomalies by routing data via IoT Hub to a built-in ML model in Azure Stream Analytics.

Reference:
https://docs.microsoft.com/en-us/learn/modules/data-anomaly-detection-using-azure-iot-hub/

Question 193

You are a data engineer. You are designing a Hadoop Distributed File System (HDFS) architecture. You plan to use Microsoft Azure Data Lake as a data storage repository.
You must provision the repository with a resilient data schema. You need to ensure the resiliency of the Azure Data Lake Storage. What should you use?
A - A - A
A - A - B
A - B - B
B - A - A
B - A - B
B - B - A
B - B - B




Answer is B - A - A

Box 1: NameNode
An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.

Box 2: DataNode
The DataNodes are responsible for serving read and write requests from the file system"™s clients.

Box 3: DataNode
The DataNodes perform block creation, deletion, and replication upon instruction from the NameNode.

Note: HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system's clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

References:
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#NameNode+and+DataNodes

Question 194

You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb.
You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace.
CREATE TABLE mytestdb.myParquetTable(
   EmployeeID int,
   EmployeeName string,
   EmployeeStartDate date)
USING Parquet
You then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data.

One minute later, you execute the following query from a serverless SQL pool in MyWorkspace.
SELECT EmployeeID
FROM mytestdb.dbo.myParquetTable
WHERE name = 'Alice';
What will be returned by the query?
24
an error
a null value




Answer is an error

Once a database has been created by a Spark job, you can create tables in it with Spark that use Parquet as the storage format. Table names will be converted to lower case and need to be queried using the lower case name. These tables will immediately become available for querying by any of the Azure Synapse workspace Spark pools. The Spark created, managed, and external tables are also made available as external tables with the same name in the corresponding synchronized database in serverless SQL pool.

Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table

Question 195

You are planning the deployment of Azure Data Lake Storage Gen2.

You have the following two reports that will access the data lake:
- Report1: Reads three columns from a file that contains 50 columns.
- Report2: Queries a single record based on a timestamp.

You need to recommend in which format to store the data in the data lake to support the reports. The solution must minimize read times.

What should you recommend for each report?




1: Parquet - column-oriented binary file format
2: AVRO - Row based format, and has logical type timestamp
Not Parquet, TSV: Not options for Azure Data Lake Storage Gen2.

Reference:
https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Destinations/ADLS-G2-D.html
https://youtu.be/UrWthx8T3UY

Question 196

An application will use Microsoft Azure Cosmos DB as its data solution. The application will use the Cassandra API to support a column-based database type that uses containers to store items.
You need to provision Azure Cosmos DB.

Which container name and item name should you use?
Each correct answer presents part of the solutions.
collection
rows
graph
entities
table




Answer is B & E

B because; Depending on which API you use, an Azure Cosmos item can represent either a document in a collection, a row in a table, or a node or edge in a graph. The following table shows the mapping of API-specific entities to an Azure Cosmos item:
Cosmos entity SQL API Cassandra API Azure Cosmos DB API for MongoDB Gremlin API Table API
Azure Cosmos item Document Row Document Node or edge Item


E because; An Azure Cosmos container is specialized into API-specific entities as shown in the following table:
Azure Cosmos entity SQL API Cassandra API Azure Cosmos DB API for MongoDB Gremlin API Table API
Azure Cosmos container Container Table Collection Graph Table


References: https://docs.microsoft.com/en-us/azure/cosmos-db/databases-containers-items

Question 197

You are developing a data engineering solution for a company. The solution will store a large set of key-value pair data by using Microsoft Azure Cosmos DB.
The solution has the following requirements:

You need to provision Azure Cosmos DB.
Configure account-level throughput.
Provisionan Azure Cosmos DB account with the Azure Table API. Enable geo-redundancy.
Configure table-level throughput.
Replicate the data globally by manually adding regions to the Azure Cosmos DB account.
Provision an Azure Cosmos DB account with the Azure Table API. Enable multi-region writes.




Answer is Provision an Azure Cosmos DB account with the Azure Table API. Enable multi-region writes.

Scale read and write throughput globally. You can enable every region to be writable and elastically scale reads and writes all around the world. The throughput that your application configures on an Azure Cosmos database or a container is guaranteed to be delivered across all regions associated with your Azure Cosmos account. The provisioned throughput is guaranteed up by financially backed SLAs.

References:
https://docs.microsoft.com/en-us/azure/cosmos-db/distribute-data-globally

Question 198

You want to ensure that there is 99.999% availability for the reading and writing of all your data. How can this be achieved?
By configuring reads and writes of data in a single region.
By configuring reads and writes of data for multi-region accounts with multi region writes.
By configuring reads and writes of data for multi-region accounts with a single region writes.




Answer is "By configuring reads and writes of data for multi-region accounts with multi region writes."

By configuring reads and writes of data for multi-region accounts with multi region writes, you can achieve 99.999% availability

Question 199

What are the three main advantages to using Cosmos DB?
Cosmos DB offers global distribution capabilities out of the box.
Cosmos DB provides a minimum of 99.99% availability.
Cosmos DB response times of read/write operations are typically in the order of 10s of milliseconds.
All of the above.




Answer is All of the above.

All of the above. Cosmos DB offers global distribution capabilities out of the box, provides a minimum of 99.99% availability and has response times of read/write operations are typically in the order of 10s of milliseconds.

Question 200

You are a data engineer wanting to make the data that is currently stored in a Table Storage account located in the West US region available globally. Which Cosmos DB model should you migrate to?
Gremlin API
Cassandra API
Table API
Mongo DB API




Answer is Table API

The Table API Cosmos DB model will enable you to provide global availability of your table storage account data. Gremlin API is used to store Graph databases. Cassandra API is used to store date from Cassandra databases and Mongo DB API is used to store Mongo DB databases.

< Previous PageNext Page >

Quick access to all questions in this exam