Professional Data Engineer on Google Cloud Platform Certification Dump Questions Answers Examples

Professional Data Engineer on Google Cloud Platform

82%

Question 221

You need to create a near real-time inventory dashboard that reads the main inventory tables in your BigQuery data warehouse. Historical inventory data is stored as inventory balances by item and location. You have several thousand updates to inventory every hour. You want to maximize performance of the dashboard and ensure that the data is accurate.

What should you do?

Leverage BigQuery UPDATE statements to update the inventory balances as they are changing.

Partition the inventory balance table by item to reduce the amount of data scanned with each inventory update.

Use the BigQuery streaming the stream changes into a daily inventory movement table. Calculate balances in a view that joins it to the historical inventory balance table. Update the inventory balance table nightly.

Use the BigQuery bulk loader to batch load inventory changes into a daily inventory movement table. Calculate balances in a view that joins it to the historical inventory balance table. Update the inventory balance table nightly.

Answer is Use the BigQuery streaming the stream changes into a daily inventory movement table. Calculate balances in a view that joins it to the historical inventory balance table. Update the inventory balance table nightly.

A : Not correct, due to constraint of 1500 updates / table / day
B : Not correct. Same as A, constraint of number of updates to BQ table / day
C : Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table (in the table "Daily Inventory Movement Table). For the near real-time dashboard , balances are calculated in a view that joins "Daily Inventory Movement Table " and "Historical Inventory Balance Table" . Night process updates Historical Inventory Balance Table
D : Not Correct, as it will meet the requirement of near real time

Question 222

You have a data stored in BigQuery. The data in the BigQuery dataset must be highly available. You need to define a storage, backup, and recovery strategy of this data that minimizes cost.

How should you configure the BigQuery table?

Set the BigQuery dataset to be regional. In the event of an emergency, use a point-in-time snapshot to recover the data.

Set the BigQuery dataset to be regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.

Set the BigQuery dataset to be multi-regional. In the event of an emergency, use a point-in-time snapshot to recover the data.

Set the BigQuery dataset to be multi-regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.

Answer is Set the BigQuery dataset to be multi-regional. In the event of an emergency, use a point-in-time snapshot to recover the data.

Highly available = multi-regional:
https://cloud.google.com/bigquery/docs/locations

Recovery strategy of this data that minimizes cost = point-in-time snapshot:
https://cloud.google.com/solutions/bigquery-data-warehouse#backup-and-recovery

Question 223

You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work in progress on your clusters.

What should you do?

Increase the cluster size with more non-preemptible workers.

Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.

Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.

Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.

Answer is Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.

Graceful decommissioning will ensure that the data is processed by worker before it is removed by Yarn

Reference:
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scaling-clusters

Question 224

You have historical data covering the last three years in BigQuery and a data pipeline that delivers new data to BigQuery daily. You have noticed that when the Data Science team runs a query filtered on a date column and limited to 30-90 days of data, the query scans the entire table. You also noticed that your bill is increasing more quickly than you expected. You want to resolve the issue as cost-effectively as possible while maintaining the ability to conduct SQL queries.

What should you do?

Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.

Recommend that the Data Science team export the table to a CSV file on Cloud Storage and use Cloud Datalab to explore the data by reading the files directly.

Modify your pipeline to maintain the last 30""90 days of data in one table and the longer history in a different table to minimize full table scans over the entire history.

Write an Apache Beam pipeline that creates a BigQuery table per day. Recommend that the Data Science team use wildcards on the table name suffixes to select the data they need.

Answer is Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.

With partitions the performance will improve for selecting 30-90 days data. Also the storage cost will reduce as the old partitions (not updated in last 90 days) will qualify for Long-Term storage rates.

Question 225

You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?

Deploy small Kafka clusters in your data centers to buffer events.

Have the data acquisition devices publish data to Cloud Pub/Sub.

Establish a Cloud Interconnect between all remote data centers and Google.

Write a Cloud Dataflow pipeline that aggregates all data in session windows.

Answer is Have the data acquisition devices publish data to Cloud Pub/Sub.

The most cost effective (cheapest) way is to use PubSub. It can handle messages with high latency.

Question 226

You operate a database that stores stock trades and an application that retrieves average stock price for a given company over an adjustable window of time. The data is stored in Cloud Bigtable where the datetime of the stock trade is the beginning of the row key. Your application has thousands of concurrent users, and you notice that performance is starting to degrade as more stocks are added.

What should you do to improve the performance of your application?

Change the row key syntax in your Cloud Bigtable table to begin with the stock symbol.

Change the row key syntax in your Cloud Bigtable table to begin with a random number per second.

Change the data pipeline to use BigQuery for storing stock trades, and update your application.

Use Cloud Dataflow to write summary of each day's stock trades to an Avro file on Cloud Storage. Update your application to read from Cloud Storage and Cloud Bigtable to compute the responses.

Answer is Change the row key syntax in your Cloud Bigtable table to begin with the stock symbol.

Having EXCHANGE and SYMBOL in the leading positions in the row key will naturally distribute activity.

Reference:
https://cloud.google.com/bigtable/docs/schema-design-time-series

Question 227

As your organization expands its usage of GCP, many teams have started to create their own projects. Projects are further multiplied to accommodate different stages of deployments and target audiences. Each project requires unique access control configurations. The central IT team needs to have access to all projects.
Furthermore, data from Cloud Storage buckets and BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access control management by minimizing the number of policies.

Which two steps should you take? (Choose two.)

Use Cloud Deployment Manager to automate access provision.

Introduce resource hierarchy to leverage access control policy inheritance.

Create distinct groups for various teams, and specify groups in Cloud IAM policies.

Only use service accounts when sharing data for Cloud Storage buckets and BigQuery datasets.

For each Cloud Storage bucket or BigQuery dataset, decide which projects need access. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.

Answers are;
B. Introduce resource hierarchy to leverage access control policy inheritance.
C. Create distinct groups for various teams, and specify groups in Cloud IAM policies.

Google suggests that we should provide access by following google hierarchy and groups for users with similar roles

Question 228

You are running a pipeline in Cloud Dataflow that receives messages from a Cloud Pub/Sub topic and writes the results to a BigQuery dataset in the EU. Currently, your pipeline is located in europe-west4 and has a maximum of 3 workers, instance type n1-standard-1. You notice that during peak periods, your pipeline is struggling to process records in a timely fashion, when all 3 workers are at maximum CPU utilization.

Which two actions can you take to increase performance of your pipeline? (Choose two.)

Increase the number of max workers

Use a larger instance type for your Cloud Dataflow workers

Change the zone of your Cloud Dataflow pipeline to run in us-central1

Create a temporary table in Cloud Bigtable that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Bigtable to BigQuery

Create a temporary table in Cloud Spanner that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Spanner to BigQuery

Answers are;
A. Increase the number of max workers
B. Use a larger instance type for your Cloud Dataflow workers

With autoscaling enabled, the Dataflow service does not allow user control of the exact number of worker instances allocated to your job. You might still cap the number of workers by specifying the --max_num_workers option when you run your pipeline. Here as per question CAP is 3, So we can change that CAP.

For batch jobs, the default machine type is n1-standard-1. For streaming jobs, the default machine type for Streaming Engine-enabled jobs is n1-standard-2 and the default machine type for non-Streaming Engine jobs is n1-standard-4. When using the default machine types, the Dataflow service can therefore allocate up to 4000 cores per job. If you need more cores for your job, you can select a larger machine type.

Question 229

You have a data pipeline with a Cloud Dataflow job that aggregates and writes time series metrics to Cloud Bigtable. This data feeds a dashboard used by thousands of users across the organization. You need to support additional concurrent users and reduce the amount of time required to write the data.

Which two actions should you take? (Choose two.)

Configure your Cloud Dataflow pipeline to use local execution

Increase the maximum number of Cloud Dataflow workers by setting maxNumWorkers in PipelineOptions

Increase the number of nodes in the Cloud Bigtable cluster

Modify your Cloud Dataflow pipeline to use the Flatten transform before writing to Cloud Bigtable

Modify your Cloud Dataflow pipeline to use the CoGroupByKey transform before writing to Cloud Bigtable

Answers are; B. Increase the maximum number of Cloud Dataflow workers by setting maxNumWorkers in PipelineOptions
C. Increase the number of nodes in the Cloud Bigtable cluster

The maximum number of Compute Engine instances to be made available to your pipeline during execution. Note that this can be higher than the initial number of workers (specified by num_workers to allow your job to scale up, automatically or otherwise.

Adding nodes to the original cluster: You can add 3 nodes to the cluster, for a total of 6 nodes. The write throughput for the instance doubles, but the instance's data is available in only one zone:

Reference:
https://cloud.google.com/bigtable/docs/performance#performance-write-throughput
https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#setting-other-cloud-pipeline-options

Question 230

You need to create a new transaction table in Cloud Spanner that stores product sales data. You are deciding what to use as a primary key. From a performance perspective, which strategy should you choose?

The current epoch time

A concatenation of the product name and the current epoch time

A random universally unique identifier number (version 4 UUID)

The original order identification number from the sales system, which is a monotonically increasing integer

Answer is A random universally unique identifier number (version 4 UUID)

Reference:
https://cloud.google.com/spanner/docs/schema-and-data-model#choosing_a_primary_key

< Previous Page Next Page >

Professional Data Engineer on Google Cloud Platform

278 QUESTIONS AS TOTAL

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Quick access to all questions in this exam