Professional Data Engineer on Google Cloud Platform

43%

Question 111

You have developed three data processing jobs. One executes a Cloud Dataflow pipeline that transforms data uploaded to Cloud Storage and writes results to BigQuery. The second ingests data from on-premises servers and uploads it to Cloud Storage. The third is a Cloud Dataflow pipeline that gets information from third-party data providers and uploads the information to Cloud Storage. You need to be able to schedule and monitor the execution of these three workflows and manually execute them when needed.

What should you do?
Create a Direct Acyclic Graph in Cloud Composer to schedule and monitor the jobs.
Use Stackdriver Monitoring and set up an alert with a Webhook notification to trigger the jobs.
Develop an App Engine application to schedule and request the status of the jobs using GCP API calls.
Set up cron jobs in a Compute Engine instance to schedule and monitor the pipelines using GCP API calls.




Answer is Create a Direct Acyclic Graph in Cloud Composer to schedule and monitor the jobs.

Cloud composer is used to schedule the interdependent jobs

Question 112

You have Cloud Functions written in Node.js that pull messages from Cloud Pub/Sub and send the data to BigQuery. You observe that the message processing rate on the Pub/Sub topic is orders of magnitude higher than anticipated, but there is no error logged in Stackdriver Log Viewer.

What are the two most likely causes of this problem? (Choose two.)
Publisher throughput quota is too small.
Total outstanding messages exceed the 10-MB maximum.
Error handling in the subscriber code is not handling run-time errors properly.
The subscriber code cannot keep up with the messages.
The subscriber code does not acknowledge the messages that it pulls.




Answers are;
C. Error handling in the subscriber code is not handling run-time errors properly.
E. The subscriber code does not acknowledge the messages that it pulls.


By not acknowleding the pulled message, this result in it be putted back in Cloud Pub/Sub, meaning the messages accumulate instead of being consumed and removed from Pub/Sub. The same thing can happen ig the subscriber maintains the lease on the message it receives in case of an error. This reduces the overall rate of processing because messages get stuck on the first subscriber. Also, errors in Cloud Function do not show up in Stackdriver Log Viewer if they are not correctly handled.
A: No problem with publisher rate as the observed result is a higher number of messages and not a lower number.
B: if messages exceed the 10MB maximum, they cannot be published.
D: Cloud Functions automatically scales so they should be able to keep up.

Question 113

Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers.

Which cloud-native service should you use to orchestrate the entire pipeline?
Cloud Dataflow
Cloud Composer
Cloud Dataprep
Cloud Dataproc




Answer is Cloud Composer

Hybrid and multi-cloud
Ease your transition to the cloud or maintain a hybrid data environment by orchestrating workflows that cross between on-premises and the public cloud. Create workflows that connect data, processing, and services across clouds to give you a unified data environment.

Reference:
https://cloud.google.com/composer#section-2

Question 114

You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to the same dataset. You need to keep the costs of data sharing low and ensure that the data is current. Which solution should you choose?
Create an authorized view on the BigQuery table to control data access, and provide third-party companies with access to that view.
Use Cloud Scheduler to export the data on a regular basis to Cloud Storage, and provide third-party companies with access to the bucket.
Create a separate dataset in BigQuery that contains the relevant data to share, and provide third-party companies with access to the new dataset.
Create a Cloud Dataflow job that reads the data in frequent time intervals, and writes it to the relevant BigQuery dataset or Cloud Storage bucket for third-party companies to use.




Answer is Create an authorized view on the BigQuery table to control data access, and provide third-party companies with access to that view.

By creating an authorized view one assures that the data is current and avoids taking more storage space (and cost) in order to share a dataset. B and D are not cost optimal and C does not guarantee that the data is kept updated

Question 115

A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time. This is then loaded into BigQuery. Analysts in your company want to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to implement a change that would improve query performance in BigQuery.

What should you do?
Implement clustering in BigQuery on the ingest date column.
Implement clustering in BigQuery on the package-tracking ID column.
Tier older data onto Cloud Storage files, and leverage extended tables.
Re-create the table using data partitioning on the package delivery date.




Answer is Implement clustering in BigQuery on the package-tracking ID column.

In general, there are two typical usage patterns for clustering within a data warehouse:

Clustering on columns that have a very high number of distinct values, like userId or transactionId.

Clustering on multiple columns that are frequently used together. When clustering by multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data. You can filter by any prefix of the clustering columns and get the benefits of clustering, like regionId, shopId and productId together; or regionId and shopId; or just regionId.

Reference:
https://cloud.google.com/blog/products/data-analytics/skip-the-maintenance-speed-up-queries-with-bigquerys-clustering

Question 116

You are operating a Cloud Dataflow streaming pipeline. The pipeline aggregates events from a Cloud Pub/Sub subscription source, within a window, and sinks the resulting aggregation to a Cloud Storage bucket. The source has consistent throughput. You want to monitor an alert on behavior of the pipeline with Cloud Stackdriver to ensure that it is processing data. Which Stackdriver alerts should you create?
An alert based on a decrease of subscription/num_undelivered_messages for the source and a rate of change increase of instance/storage/ used_bytes for the destination
An alert based on an increase of subscription/num_undelivered_messages for the source and a rate of change decrease of instance/storage/ used_bytes for the destination
An alert based on a decrease of instance/storage/used_bytes for the source and a rate of change increase of subscription/ num_undelivered_messages for the destination
An alert based on an increase of instance/storage/used_bytes for the source and a rate of change decrease of subscription/ num_undelivered_messages for the destination




Answer is An alert based on an increase of subscription/num_undelivered_messages for the source and a rate of change decrease of instance/storage/ used_bytes for the destination

Need to create alert when pipeline data processing is not as per the expectation also need to monitor the storage

Question 117

You currently have a single on-premises Kafka cluster in a data center in the us-east region that is responsible for ingesting messages from IoT devices globally. Because large parts of globe have poor internet connectivity, messages sometimes batch at the edge, come in all at once, and cause a spike in load on your Kafka cluster. This is becoming difficult to manage and prohibitively expensive.

What is the Google-recommended cloud native architecture for this scenario?
Edge TPUs as sensor devices for storing and transmitting the messages.
Cloud Dataflow connected to the Kafka cluster to scale the processing of incoming messages.
An IoT gateway connected to Cloud Pub/Sub, with Cloud Dataflow to read and process the messages from Cloud Pub/Sub.
A Kafka cluster virtualized on Compute Engine in us-east with Cloud Load Balancing to connect to the devices around the world.




Answer is An IoT gateway connected to Cloud Pub/Sub, with Cloud Dataflow to read and process the messages from Cloud Pub/Sub.

Alterative to Kafka in google cloud native service is Pub/Sub and Dataflow punched with Pub/Sub is the google recommended option

Question 118

You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers.
What should you do?
Store and process the entire dataset in BigQuery.
Store and process the entire dataset in Cloud Bigtable.
Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket.
Store the warm data as files in Cloud Storage, and store the active data in BigQuery. Keep this ratio as 80% warm and 20% active.




Answer is Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket.

BigQuery for analytics processing and Cloud Storage for exposing the data as files

Question 119

You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs.
How should you organize your data in BigQuery and store your backups?
Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.
Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.
Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.
Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.




Answer is Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.

Store your data in different tables for specific time periods. This method ensures that you will need to restore only a subset of data to a new table, rather than a whole dataset.

Reference:
https://cloud.google.com/architecture/dr-scenarios-for-data#managed-database-services-on-gcp

Question 120

You are building a new application that you need to collect data from in a scalable way. Data arrives continuously from the application throughout the day, and you expect to generate approximately 150 GB of JSON data per day by the end of the year. Your requirements are:
- Decoupling producer from consumer
- Space and cost-efficient storage of the raw ingested data, which is to be stored indefinitely
- Near real-time SQL query
- Maintain at least 2 years of historical data, which will be queried with SQL

Which pipeline should you use to meet these requirements?
Create an application that provides an API. Write a tool to poll the API and write data to Cloud Storage as gzipped JSON files.
Create an application that writes to a Cloud SQL database to store the data. Set up periodic exports of the database to write to Cloud Storage and load into BigQuery.
Create an application that publishes events to Cloud Pub/Sub, and create Spark jobs on Cloud Dataproc to convert the JSON data to Avro format, stored on HDFS on Persistent Disk.
Create an application that publishes events to Cloud Pub/Sub, and create a Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Cloud Storage and BigQuery.




Answer is Create an application that publishes events to Cloud Pub/Sub, and create a Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Cloud Storage and BigQuery.

Because we have to be able to query over historical 2 years data only BigQuery address this issue and because we have lots of input data we have to use Dataflow for processing.

< Previous PageNext Page >

Quick access to all questions in this exam