Professional Data Engineer on Google Cloud Platform

4%

Question 1

Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?
Threading
Serialization
Dropout Methods
Dimensionality Reduction




Answer is Dropout Methods

There are various ways to prevent overfitting when dealing with DNNs. In this post, we’ll review these techniques and then apply them specifically to TensorFlow models:
- Early Stopping
- L1 and L2 Regularization
- Dropout
- Max-Norm Regularization
- Data Augmentation

D is not correct here because it is used for normal ml models whereas dropout methods is used for neural networks.

Reference:
https://medium.com/mlreview/a-simple-deep-learning-model-for-stock-price-prediction-using-tensorflow-30505541d877

Question 2

An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values (CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?
Use federated data sources, and check data in the SQL query.
Enable BigQuery monitoring in Google Stackdriver and create an alert.
Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.




Answer is Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

A. Use federated data sources, and check data in the SQL query. - WRONG (Because we are changing source itself, i.e. SQL, MySQL, PstgresSQL) instead of correcting the problem
B. Enable BigQuery monitoring in Google Stackdriver and create an alert. (WRONG - Because setting and creating an alert will not solve the corrupted data problem)
C. Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0. (wrong - here we are saying set max_bad_records = 0 (i.e let's load all bad records into bi-query)
D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis. (CORRECT - Dataflow is used for this pupose only i.e transform the data and dead letter queue pupose is to write any invalid records - so that it can be analyzed later (rather than ignoring))

Reference:
https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow

Question 3

Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered by Google App Engine and server millions of users. How should you design the frontend to respond to a database failure?
Issue a command to restart the database servers.
Retry the query with exponential backoff, up to a cap of 15 minutes.
Retry the query every second until it comes back online to minimize staleness of data.
Reduce the query frequency to once every hour until the database comes back online.




Answer is Retry the query with exponential backoff, up to a cap of 15 minutes.

App engine create applications that use Cloud SQL database connections effectively. Below is what is written in google cloud documnetation.

If your application attempts to connect to the database and does not succeed, the database could be temporarily unavailable. In this case, sending too many simultaneous connection requests might waste additional database resources and increase the time needed to recover. Using exponential backoff prevents your application from sending an unresponsive number of connection requests when it can't connect to the database.

This retry only makes sense when first connecting, or when first grabbing a connection from the pool. If errors happen in the middle of a transaction, the application must do the retrying, and it must retry from the beginning of a transaction. So even if your pool is configured properly, the application might still see errors if connections are lost.

Reference:
https://cloud.google.com/sql/docs/mysql/manage-connections#backoff

Question 4

You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?
Include ORDER BY DESK on timestamp column and LIMIT to 1.
Use GROUP BY on the unique ID column and timestamp column and SUM on the values.
Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.
Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.




Answer is Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.

Row Number equals 1 with partitioning will ensure only one record is fetched per partition

Reference:
https://www.youtube.com/watch?v=ysArdMImULo&list=PLQMsfKRZZviSLraRoqXulcMKFvIXQkHdA&index=3

Question 5

You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
• No interaction by the user on the site for 1 hour
• Has added more than $30 worth of products to the basket
• Has not completed a transaction

You use Google Cloud Dataflow to process the data and decide if a message should be sent.

How should you design the pipeline?
Use a fixed-time window with a duration of 60 minutes.
Use a sliding time window with a duration of 60 minutes.
Use a session window with a gap time duration of 60 minutes.
Use a global window with a time based trigger with a delay of 60 minutes.




Answer is

There are 3 windowing concepts in dataflow and each can be used for below use case
1) Fixed window
2) Sliding window and
3) Session window.

Fixed window = any aggregation use cases, any batch analysis of data, relatively simple use cases.

Sliding window = Moving averages of data
Session window = user session data, click data and real time gaming analysis.

The question here is about user session data and hence session window.

Reference:
https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines

Question 6

Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster.

What should you do?
Create a Google Cloud Dataflow job to process the data.
Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
Create a Hadoop cluster on Google Compute Engine that uses persistent disks.
Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.




Answer is Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.

Dataproc is used to migrate Hadoop and Spark jobs on GCP. Dataproc with GCS connected through Google Cloud Storage connector helps store data after the life of the cluster. When the job is high I/O intensive, then we need to create a small persistent disk.

Question 7

Your company's on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for-like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration.

What should you do?
Put the data into Google Cloud Storage.
Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster.
Tune the Cloud Dataproc cluster so that there is just enough disk for all data.
Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.




Answer is Put the data into Google Cloud Storage.

A is correct because Google recommends using Cloud Storage instead of HDFS as it is much more cost effective especially when jobs aren't running.
B is not correct because this will decrease the compute cost but not the storage cost.
C is not correct because while this will reduce cost somewhat, it will not be as cost effective as using Cloud Storage.
D is not correct because while this will reduce cost somewhat, it will not be as cost effective as using Cloud Storage.

Question 8

You are deploying 10,000 new Internet of Things devices to collect temperature data in your warehouses globally. You need to process, store and analyze these very large datasets in real time.

What should you do?
Send the data to Google Cloud Datastore and then export to BigQuery.
Send the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.
Send the data to Cloud Storage and then spin up an Apache Hadoop cluster as needed in Google Cloud Dataproc whenever analysis is required.
Export logs in batch to Google Cloud Storage and then spin up a Google Cloud SQL instance, import the data from Cloud Storage, and run an analysis as needed.




Answer is Send the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.

You can use cloud data flow for both batch and streaming pipelines. Bigquery for analytics. Pub sub will be used to stream data into cloud data flow.

Question 9

You are working on a sensitive project involving private user data. You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project.

How should you maintain users' privacy?
Grant the consultant the Viewer role on the project.
Grant the consultant the Cloud Dataflow Developer role on the project.
Create a service account and allow the consultant to log on with it.
Create an anonymized sample of the data for the consultant to work with in a different project.




Answer is Grant the consultant the Cloud Dataflow Developer role on the project.

The Dataflow developer role will not provide access to the underlying data.

Reference:
https://cloud.google.com/dataflow/docs/concepts/access-control#example_role_assignment

Question 10

You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible.

What should you do?
Load the data every 30 minutes into a new partitioned table in BigQuery.
Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery
Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore
Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.




Answer is Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery

Use cases for external data sources include:

Loading and cleaning your data in one pass by querying the data from an external data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
Having a small amount of frequently changing data that you join with other tables. As an external data source, the frequently changing data does not need to be reloaded every time it is updated.

Reference:
https://cloud.google.com/bigquery/external-data-sources

Next Page >

Quick access to all questions in this exam