Professional Data Engineer on Google Cloud Platform

93%

Question 251

Which of these is NOT a way to customize the software on Dataproc cluster instances?
Set initialization actions
Modify configuration files using cluster properties
Configure the cluster using Cloud Deployment Manager
Log into the master node and make changes from there




Answer is Configure the cluster using Cloud Deployment Manager

You can access the master node of the cluster by clicking the SSH button next to it in the Cloud Console.
You can easily use the --properties option of the dataproc command in the Google Cloud SDK to modify many common configuration files when creating a cluster.
When creating a Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud
Dataproc cluster immediately after the cluster is set up.

Reference:
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties

Question 252

You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom
HTTPS endpoint keeps getting an inordinate amount of duplicate messages.

What is the most likely cause of these duplicate messages?
The message body for the sensor event is too large.
Your custom endpoint has an out-of-date SSL certificate.
The Cloud Pub/Sub topic has too many messages published to it.
Your custom endpoint is not acknowledging messages within the acknowledgement deadline.




Answer is Your custom endpoint is not acknowledging messages within the acknowledgement deadline.

The custom endpoint is not acknowledging the message, that is the reason for Pub/Sub to send the message again and again.
A managed certificate for HTTPS connections is automatically issued and renewed when you map a service to a custom domain.

Reference:
https://cloud.google.com/run/docs/mapping-custom-domains

Question 253

Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the data.

How should you deduplicate the data most efficiency?
Assign global unique identifiers (GUID) to each data entry.
Compute the hash value of each data entry, and compare it with all historical data.
Store each data entry as the primary key in a separate database and apply an index.
Maintain a database table to store the hash value and other metadata for each data entry.




Answer is Assign global unique identifiers (GUID) to each data entry.

Answer "D" is not as efficient or error-proof due to two reasons
1. You need to calculate hash at sender as well as at receiver end to do the comparison. Waste of computing power.
2. Even if we discount the computing power, we should note that the system is sending inventory information. Two messages sent at different can denote same inventory level (and thus have same hash). Adding sender time stamp to hash will defeat the purpose of using hash as now retried messages will have different timestamp and a different hash.
if timestamp is used as message creation timestamp than that can also be used as a UUID.

Question 254

The marketing team at your organization provides regular updates of a segment of your customer dataset. The marketing team has given you a CSV with 1 million records that must be updated in BigQuery. When you use the UPDATE statement in BigQuery, you receive a quotaExceeded error.

What should you do?
Reduce the number of records updated each day to stay within the BigQuery UPDATE DML statement limit.
Increase the BigQuery UPDATE DML statement limit in the Quota management section of the Google Cloud Platform Console.
Split the source CSV file into smaller CSV files in Cloud Storage to reduce the number of BigQuery UPDATE DML statements per BigQuery job.
Import the new records from the CSV file into a new BigQuery table. Create a BigQuery job that merges the new records with the existing records and writes the results to a new BigQuery table.




Answer is Import the new records from the CSV file into a new BigQuery table. Create a BigQuery job that merges the new records with the existing records and writes the results to a new BigQuery table.

BigQuery DML statements have no quota limits.
However, DML statements are counted toward the maximum number of table operations per day and partition modifications per day. DML statements will not fail due to these limits.
In addition, DML statements are subject to the maximum rate of table metadata update operations. If you exceed this limit, retry the operation using exponential backoff between retries.

Reference:
https://cloud.google.com/bigquery/quotas#data-manipulation-language-statements

Question 255

You have data pipelines running on BigQuery, Dataflow, and Dataproc. You need to perform health checks and monitor their behavior, and then notify the team managing the pipelines if they fail. You also need to be able to work across multiple projects. Your preference is to use managed products or features of the platform.

What should you do?
Export the information to Cloud Monitoring, and set up an Alerting policy
Run a Virtual Machine in Compute Engine with Airflow, and export the information to Cloud Monitoring
Export the logs to BigQuery, and set up App Engine to read that information and send emails if you find a failure in the logs
Develop an App Engine application to consume logs using GCP API calls, and send emails if you find a failure in the logs




Answer is Export the information to Cloud Monitoring, and set up an Alerting policy

Cloud Monitoring (formerly known as Stackdriver) is a fully managed monitoring service provided by GCP, which can collect metrics, logs, and other telemetry data from various GCP services, including BigQuery, Dataflow, and Dataproc.

Alerting Policies: Cloud Monitoring allows you to define alerting policies based on specific conditions or thresholds, such as pipeline failures, latency spikes, or other custom metrics. When these conditions are met, Cloud Monitoring can trigger notifications (e.g., emails) to alert the team managing the pipelines.

Cross-Project Monitoring: Cloud Monitoring supports monitoring resources across multiple GCP projects, making it suitable for your requirement to monitor pipelines in multiple projects.

Managed Solution: Cloud Monitoring is a managed service, reducing the operational overhead compared to running your own virtual machine instances or building custom solutions.

Question 256

Your company currently runs a large on-premises cluster using Spark, Hive, and HDFS in a colocation facility. The cluster is designed to accommodate peak usage on the system; however, many jobs are batch in nature, and usage of the cluster fluctuates quite dramatically. Your company is eager to move to the cloud to reduce the overhead associated with on-premises infrastructure and maintenance and to benefit from the cost savings. They are also hoping to modernize their existing infrastructure to use more serverless offerings in order to take advantage of the cloud. Because of the timing of their contract renewal with the colocation facility, they have only 2 months for their initial migration.

How would you recommend they approach their upcoming migration strategy so they can maximize their cost savings in the cloud while still executing the migration in time?
Migrate the workloads to Dataproc plus HDFS; modernize later.
Migrate the workloads to Dataproc plus Cloud Storage; modernize later.
Migrate the Spark workload to Dataproc plus HDFS, and modernize the Hive workload for BigQuery.
Modernize the Spark workload for Dataflow and the Hive workload for BigQuery.




Answer is Migrate the workloads to Dataproc plus Cloud Storage; modernize later.

Based on the time constraint of 2 months and the goal to maximize cost savings, we would recommend this option. The key reasons are:
- Dataproc provides a fast, native migration path from on-prem Spark and Hive to the cloud. This allows meeting the 2 month timeline.
- Using Cloud Storage instead of HDFS avoids managing clusters for variable workloads and provides cost savings.
- Further optimizations and modernization to serverless (Dataflow, BigQuery) can happen incrementally later without time pressure.

Option A still requires managing HDFS.
Option C and D require full modernization of workloads in 2 months which is likely infeasible.

Therefore, migrating to Dataproc with Cloud Storage fast tracks the migration within 2 months while realizing immediate cost savings, enabling the flexibility to iteratively modernize and optimize the workloads over time.

Reference:
https://cloud.google.com/bigquery/docs/migration/hive#data_migration
https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#overview

Question 257

You are migrating a table to BigQuery and are deciding on the data model. Your table stores information related to purchases made across several store locations and includes information like the time of the transaction, items purchased, the store ID, and the city and state in which the store is located. You frequently query this table to see how many of each item were sold over the past 30 days and to look at purchasing trends by state, city, and individual store.

How would you model this table for the best query performance?
Partition by transaction time; cluster by state first, then city, then store ID.
Partition by transaction time; cluster by store ID first, then city, then state.
Top-level cluster by state first, then city, then store ID.
Top-level cluster by store ID first, then city, then state.




Answer is Partition by transaction time; cluster by state first, then city, then store ID.

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.
You can partition BigQuery tables by:
- Time-unit column: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column in the table.

Clustered tables in BigQuery are tables that have a user-defined column sort order using clustered columns. Clustered tables can improve query performance and reduce query costs.

Reference:
https://cloud.google.com/bigquery/docs/partitioned-tables
https://cloud.google.com/bigquery/docs/clustered-tables

Question 258

Your company is implementing a data warehouse using BigQuery, and you have been tasked with designing the data model.
You move your on-premises sales data warehouse with a star data schema to BigQuery but notice performance issues when querying the data of the past 30 days.

Based on Google's recommended practices, what should you do to speed up the query without increasing storage costs?
Denormalize the data.
Shard the data by customer ID.
Materialize the dimensional data in views.
Partition the data by transaction date.




Answer is Partition the data by transaction date.

BigQuery supports partitioned tables, where the data is divided into smaller, manageable portions based on a chosen column (e.g., transaction date). By partitioning the data based on the transaction date, BigQuery can efficiently query only the relevant partitions that contain data for the past 30 days, reducing the amount of data that needs to be scanned.Partitioning does not increase storage costs. It organizes existing data in a more structured manner, allowing for better query performance without any additional storage expenses.

Reference:
https://cloud.google.com/bigquery/docs/partitioned-tables

Question 259

You have uploaded 5 years of log data to Cloud Storage. A user reported that some data points in the log data are outside of their expected ranges, which indicates errors.
You need to address this issue and be able to run the process again in the future while keeping the original data for compliance reasons.

What should you do?
Import the data from Cloud Storage into BigQuery. Create a new BigQuery table, and skip the rows with errors.
Create a Compute Engine instance and create a new copy of the data in Cloud Storage. Skip the rows with errors.
Create a Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to a new dataset in Cloud Storage.
Create a Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to the same dataset in Cloud Storage.




Answer is Create a Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to a new dataset in Cloud Storage.

Option A would remove data which may be needed for compliance reasons. Keeping the original data is preferred.
Option B makes a copy of the data but still removes potentially useful records. Additional storage costs would be incurred as well.
Option C uses Dataflow to clean the data by setting out of range values while keeping the original data intact. The fixed records are written to a new location for further analysis. This meets the requirements.
Option D writes the fixed data back to the original location, overwriting the original data. This would violate the compliance needs to keep the original data untouched.

So option C leverages Dataflow to properly clean the data while preserving the original data for compliance, at reasonable operational costs. This best achieves the stated requirements.

Question 260

You are testing a Dataflow pipeline to ingest and transform text files.
The files are compressed gzip, errors are written to a dead-letter queue, and you are using SideInputs to join data.
You noticed that the pipeline is taking longer to complete than expected.

what should you do to expedite the Dataflow job?
Switch to compressed Avro files.
Reduce the batch size.
Retry records that throw an error.
Use CoGroupByKey instead of the SideInput.




Answer is Use CoGroupByKey instead of the SideInput.

Here's why this approach is beneficial:
1. Efficiency in Handling Large Datasets: SideInputs are not optimal for large datasets because they require that the entire dataset be available to each worker. This can lead to performance bottlenecks, especially if the dataset is large. CoGroupByKey, on the other hand, is more efficient for joining large datasets because it groups elements by key and allows the pipeline to process each key-group separately.
2. Scalability: CoGroupByKey is more scalable than SideInputs for large-scale data processing. It distributes the workload more evenly across the Dataflow workers, which can significantly improve the performance of your pipeline.
3. Better Resource Utilization: By using CoGroupByKey, the Dataflow job can make better use of its resources, as it doesn't need to replicate the entire dataset to each worker. This results in faster processing times and better overall efficiency.

The other options may not be as effective:
A (Switch to compressed Avro files): While Avro is a good format for certain types of data processing, simply changing the file format from gzip to Avro may not address the underlying issue causing the delay, especially if the problem is related to the way data is being joined or processed.
B (Reduce the batch size): Reducing the batch size could potentially increase overhead and might not significantly improve the processing time, especially if the bottleneck is due to the method of data joining.
C (Retry records that throw an error): Retrying errors could be useful in certain contexts, but it's unlikely to speed up the pipeline if the delay is due to inefficiencies in data processing methods like the use of SideInputs.

Reference:
https://cloud.google.com/architecture/building-production-ready-data-pipelines-using-dataflow-developing-and-testing#choose_correctly_between_side_inputs_or_cogroupbykey_for_joins

< Previous PageNext Page >

Quick access to all questions in this exam

Warning: file_get_contents(http://www.geoplugin.net/php.gp?ip=216.73.216.106): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden in /home/passnexa/public_html/view/question.php on line 243