Professional Data Engineer on Google Cloud Platform Certification Dump Questions Answers Examples

Professional Data Engineer on Google Cloud Platform

64%

Question 171

You are developing an application on Google Cloud that will automatically generate subject labels for users' blog posts. You are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your team has experience with machine learning. What should you do?

Call the Cloud Natural Language API from your application. Process the generated Entity Analysis as labels.

Call the Cloud Natural Language API from your application. Process the generated Sentiment Analysis as labels.

Build and train a text classification model using TensorFlow. Deploy the model using Cloud Machine Learning Engine. Call the model from your application and process the results as labels.

Build and train a text classification model using TensorFlow. Deploy the model using a Kubernetes Engine cluster. Call the model from your application and process the results as labels.

Answer is Call the Cloud Natural Language API from your application. Process the generated Entity Analysis as labels.

Entity analysis -> Identify entities within documents receipts, invoices, and contracts and label them by types such as date, person, contact information, organization, location, events, products, and media.

Sentiment analysis -> Understand the overall opinion, feeling, or attitude sentiment expressed in a block of text.

Question 172

You're training a model to predict housing prices based on an available dataset with real estate properties. Your plan is to train a fully connected neural net, and you've discovered that the dataset contains latitude and longitude of the property. Real estate professionals have told you that the location of the property is highly influential on price, so you'd like to engineer a feature that incorporates this physical dependency.

What should you do?

Provide latitude and longitude as input vectors to your neural net.

Create a numeric column from a feature cross of latitude and longitude.

Create a feature cross of latitude and longitude, bucketize at the minute level and use L1 regularization during optimization.

Create a feature cross of latitude and longitude, bucketize it at the minute level and use L2 regularization during optimization.

Answer is Create a feature cross of latitude and longitude, bucketize at the minute level and use L1 regularization during optimization.

Use L1 regularization when you need to assign greater importance to more influential features. It shrinks less important feature to 0.
L2 regularization performs better when all input features influence the output & all with the weights are of equal size.

Reference:
https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/l1-regularization

Question 173

You work for a bank. You have a labelled dataset that contains information on already granted loan application and whether these applications have been defaulted. You have been asked to train a model to predict default rates for credit applicants.

What should you do?

Increase the size of the dataset by collecting additional data.

Train a linear regression to predict a credit default risk score.

Remove the bias from the data and collect applications that have been declined loans.

Match loan applicants with their social profiles to enable feature engineering.

Answer is Match loan applicants with their social profiles to enable feature engineering.

Don't think its B because you don't have the defaulted rate in hand only defaulted or not. it's classification. you have to do feature engineering to prepare data for Linear regression first.

Question 174

You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data.

What should you do?

Add a SideInput that returns a Boolean if the element is corrupt.

Add a ParDo transform in Cloud Dataflow to discard corrupt elements.

Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.

Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.

Answer is Add a ParDo transform in Cloud Dataflow to discard corrupt elements.

ParDo is used to do transformation and create side output

Reference:
https://beam.apache.org/documentation/programming-guide/

Question 175

You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis.
Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team.

Which two strategies should you adopt?
(Choose two.)

Denormalize the data as must as possible.

Preserve the structure of the data as much as possible.

Use BigQuery UPDATE to further reduce the size of the dataset.

Develop a data pipeline where status updates are appended to BigQuery instead of updated.

Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query.

Answers are;
A. Denormalize the data as must as possible.
D. Develop a data pipeline where status updates are appended to BigQuery instead of updated.

Using BigQuery as an OLTP store is considered an anti-pattern. Because OLTP stores have a high volume of updates and deletes, they are a mismatch for the data warehouse use case. To decide which storage option best fits your use case, review the Cloud storage products table.
BigQuery is built for scale and can scale out as the size of the warehouse grows, so there is no need to delete older data. By keeping the entire history, you can deliver more insight on your business. If the storage cost is a concern, you can take advantage of BigQuery's long term storage pricing by archiving older data and using it for special analysis when the need arises. If you still have good reasons for dropping older data, you can use BigQuery's native support for date-partitioned tables and partition expiration. In other words, BigQuery can automatically delete older data.

Reference:
https://cloud.google.com/solutions/bigquery-data-warehouse#handling_change

Question 176

You work for a manufacturing company that sources up to 750 different components, each from a different supplier. You've collected a labeled dataset that has on average 1000 examples for each unique component. Your team wants to implement an app to help warehouse workers recognize incoming components based on a photo of the component.
You want to implement the first working version of this app (as Proof-Of-Concept) within a few working days.
What should you do?

Use Cloud Vision AutoML with the existing dataset.

Use Cloud Vision AutoML, but reduce your dataset twice.

Use Cloud Vision API by providing custom labels as recognition hints.

Train your own image recognition model leveraging transfer learning techniques.

Answer is Use Cloud Vision AutoML, but reduce your dataset twice.

POC is a small-scale experiments

Question 177

You are working on a niche product in the image recognition domain. Your team has developed a model that is dominated by custom C++ TensorFlow ops your team has implemented. These ops are used inside your main training loop and are performing bulky matrix multiplications. It currently takes up to several days to train a model. You want to decrease this time significantly and keep the cost low by using an accelerator on Google Cloud.
What should you do?

Use Cloud TPUs without any additional adjustment to your code.

Use Cloud TPUs after implementing GPU kernel support for your customs ops.

Use Cloud GPUs after implementing GPU kernel support for your customs ops.

Stay on CPUs, and increase the size of the cluster you're training your model on.

Answer is Use Cloud GPUs after implementing GPU kernel support for your customs ops.

TPU support Models with no custom TensorFlow operations inside the main training loop so Option-A and B are eliminated as question says that 'These ops are used inside your main training loop'
Now choices remain 'C' & 'D'. CPU is for Simple models that do not take long to train. Since question says that currently its taking up to several days to train a model and hence existing infra may be CPU and taking so many days. GPUs are for "Models with a significant number of custom TensorFlow operations that must run at least partially on CPUs" as question says that model is dominated by TensorFlow ops leading to correct option as 'C'

Reference:
https://cloud.google.com/tpu/docs/tpus
https://www.tensorflow.org/guide/create_op#gpu_kernels

Question 178

You work on a regression problem in a natural language processing domain, and you have 100M labeled exmaples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set.

How should you improve the performance of your model?

Increase the share of the test sample in the train-test split.

Try to collect more data and increase the size of your dataset.

Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.

Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.

Answer is Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.

"Training error is small and test error is big" is an indication of overfitting.

Question 179

A data scientist has created a BigQuery ML model and asks you to create an ML pipeline to serve predictions. You have a REST API application with the requirement to serve predictions for an individual user ID with latency under 100 milliseconds. You use the following query to generate predictions: SELECT predicted_label, user_id FROM ML.PREDICT (MODEL "dataset.model', table user_features).
How should you create the ML pipeline?

Add a WHERE clause to the query, and grant the BigQuery Data Viewer role to the application service account.

Create an Authorized View with the provided query. Share the dataset that contains the view with the application service account.

Create a Cloud Dataflow pipeline using BigQueryIO to read results from the query. Grant the Dataflow Worker role to the application service account.

Create a Cloud Dataflow pipeline using BigQueryIO to read predictions for all users from the query. Write the results to Cloud Bigtable using BigtableIO. Grant the Bigtable Reader role to the application service account so that the application can read predictions for individual users from Cloud Bigtable.

Answer is Create a Cloud Dataflow pipeline using BigQueryIO to read predictions for all users from the query. Write the results to Cloud Bigtable using BigtableIO. Grant the Bigtable Reader role to the application service account so that the application can read predictions for individual users from Cloud Bigtable.

The key reason for pick D is the 100ms requirement. Bigtable provides lowest latency

Question 180

You work for an advertising company, and you've developed a Spark ML model to predict click-through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be closing soon, so a rapid lift-and-shift migration is necessary. However, the data you've been using will be migrated to migrated to BigQuery.
You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud.

What should you do?

Use Cloud ML Engine for training existing Spark ML models

Rewrite your models on TensorFlow, and start using Cloud ML Engine

Use Cloud Dataproc for training existing Spark ML models, but start reading data directly from BigQuery

Spin up a Spark cluster on Compute Engine, and train Spark ML models on the data exported from BigQuery

Answer is Use Cloud Dataproc for training existing Spark ML models, but start reading data directly from BigQuery

Use Cloud Dataproc for training existing Spark ML models, but start reading data directly from BigQuery

< Previous Page Next Page >

Professional Data Engineer on Google Cloud Platform

278 QUESTIONS AS TOTAL

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Quick access to all questions in this exam