Professional Data Engineer on Google Cloud Platform Certification Dump Questions Answers Examples

Professional Data Engineer on Google Cloud Platform

71%

Question 191

What are two of the characteristics of using online prediction rather than batch prediction?

It is optimized to handle a high volume of data instances in a job and to run more complex models.

Predictions are returned in the response message.

Predictions are written to output files in a Cloud Storage location that you specify.

It is optimized to minimize the latency of serving predictions.

Answers are;
B. Predictions are returned in the response message.
D. It is optimized to minimize the latency of serving predictions.

Realtime, quick response and low latency

Reference:
https://cloud.google.com/ai-platform/prediction/docs/online-vs-batch-prediction

Question 192

Which of these are examples of a value in a sparse vector? (Select 2 answers.)

[0, 5, 0, 0, 0, 0]

[0, 0, 0, 1, 0, 0, 1]

[0, 1]

[1, 0, 0, 0, 0, 0, 0]

Answers are;
C. [0, 1]
D. [1, 0, 0, 0, 0, 0, 0]

Sparse vector contains only 0 and 1, whereas only one 1

Question 193

How can you get a neural network to learn about relationships between categories in a categorical feature?

Create a multi-hot column

Create a one-hot column

Create a hash bucket

Create an embedding column

Answer is Create an embedding column

References:
https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/
https://www.studyblue.com/notes/note/n/data-engineer-practice-exam/deck/21261127

Question 194

Which Google Cloud Platform service is an alternative to Hadoop with Hive?

Cloud Dataflow

Cloud Bigtable

BigQuery

Cloud Datastore

Answer is BigQuery

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis.
Google BigQuery is an enterprise data warehouse.

Reference:
https://en.wikipedia.org/wiki/Apache_Hive

Question 195

You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

There are very few occurrences of mutations relative to normal samples.

There are roughly equal occurrences of both normal and mutated samples in the database.

You expect future mutations to have different features from the mutated samples in the database.

You expect future mutations to have similar features to the mutated samples in the database.

You already have labels for which samples are mutated and which are normal in the database.

Answer is A and D

A: In any anomaly detection algorithm it is assumed a priori that you have much more "normal" samples than mutated ones, so that you can model normal patterns and detect patterns that are "off" that normal pattern. For that you will always need the no. of normal samples to be much bigger than the no. of mutated samples. D: You expect future mutations to have similar features to the mutated samples in the database." - in other words: Expect future anomalies to be similar to the anomalies that we already have in database

Question 196

You are working on a linear regression model on BigQuery ML to predict a customer's likelihood of purchasing your company's products. Your model uses a city name variable as a key predictive component. In order to train and serve the model, your data must be organized in columns. You want to prepare your data using the least amount of coding while maintaining the predictable variables.

What should you do?

Create a new view with BigQuery that does not include a column with city information.

Use SQL in BigQuery to transform the state column using a one-hot encoding method, and make each city a column with binary values.

Use TensorFlow to create a categorical variable with a vocabulary list. Create the vocabulary file and upload that as part of your model to BigQuery ML.

Use Cloud Data Fusion to assign each city to a region that is labeled as 1, 2, 3, 4, or 5, and then use that number to represent the city in the model.

Answer is Use SQL in BigQuery to transform the state column using a one-hot encoding method, and make each city a column with binary values.

One-hot encoding is a common technique used to handle categorical data in machine learning. This approach will transform the city name variable into a series of binary columns, one for each city. Each row will have a "1" in the column corresponding to the city it represents and "0" in all other city columns. This method is effective for linear regression models as it enables the model to use city data as a series of numeric, binary variables. BigQuery supports SQL operations that can easily implement one-hot encoding, thus minimizing the amount of coding required and efficiently preparing the data for the model.

A removes the city information completely, losing a key predictive component.

C requires additional coding and infrastructure with TensorFlow and vocabulary files outside of what BigQuery already provides. D transforms the distinct city values into numeric regions, losing granularity of the city data.

By using SQL within BigQuery to one-hot encode cities into multiple yes/no columns, the city data is maintained and formatted appropriately for the BigQuery ML linear regression model with minimal additional coding. This aligns with the requirements stated in the question.

Reference:
https://cloud.google.com/bigquery/docs/auto-preprocessing#one_hot_encoding

Question 197

You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company's mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer's stated intention for contacting customer service.
About 70% of customer requests are simple requests that are solved within 10 intents.
The remaining 30% of inquiries require much longer, more complicated requests.

Which intents should you automate first?

Automate the 10 intents that cover 70% of the requests so that live agents can handle more complicated requests.

Automate the more complicated requests first because those require more of the agents' time.

Automate a blend of the shortest and longest intents to be representative of all intents.

Automate intents in places where common words such as 'payment' appear only once so the software isn't confused.

Answer is Automate the 10 intents that cover 70% of the requests so that live agents can handle more complicated requests.

If your agent will be large or complex, start by building a dialog that only addresses the top level requests. Once the basic structure is established, iterate on the conversation paths to ensure you're covering all of the possible routes an end-user may take.

Therefore, you should initally automate the 70 % of the requests that are simpler before automating the more complicated ones.

Reference:
https://cloud.google.com/dialogflow/cx/docs/concept/agent-design#build-iteratively

Question 198

You work on a regression problem in a natural language processing domain, and you have 100M labeled examples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set.

How should you improve the performance of your model?

Increase the share of the test sample in the train-test split.

Try to collect more data and increase the size of your dataset.

Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.

Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.

Answer is Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.

"root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set." means the RMSE of training set is two time of RMSE of test set, which indicates the training is not as good as test, then underfitting.

Reference:
https://stats.stackexchange.com/questions/497050/how-big-a-difference-for-test-train-rmse-is-considered-as-overfit#:~:text=RMSE%20of%20test%20%3C%20RMSE%20of
,is%20always%20overfit%20or%20underfit.

Question 199

You are developing a new deep learning model that predicts a customer's likelihood to buy on your ecommerce site.
After running an evaluation of the model against both the original training data and new test data, you find that your model is overfitting the data.
You want to improve the accuracy of the model when predicting new data.

What should you do?

Increase the size of the training dataset, and increase the number of input features.

Increase the size of the training dataset, and decrease the number of input features.

Reduce the size of the training dataset, and increase the number of input features.

Reduce the size of the training dataset, and decrease the number of input features.

Answer is Increase the size of the training dataset, and decrease the number of input features.

There 2 parts and they are relevant to each other
1. Overfit is fixed by decreasing the number of input features (select only essential features)
2. Accuracy is improved by increasing the amount of training data examples.

Reference:
https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

Question 200

You are implementing a chatbot to help an online retailer streamline their customer service. The chatbot must be able to respond to both text and voice inquiries.
You are looking for a low-code or no-cade option, and you want to be able to easily train the chatbot to provide answers to keywords.

What should you do?

Use the Cloud Speech-to-Text API to build a Python application in App Engine.

Use the Cloud Speech-to-Text API to build a Python application in a Compute Engine instance.

Use Dialogflow for simple queries and the Cloud Speech-to-Text API for complex queries.

Use Dialogflow to implement the chatbot, defining the intents based on the most common queries collected.

Answer is Use Dialogflow to implement the chatbot, defining the intents based on the most common queries collected.

Dialogflow is a conversational AI platform that allows for easy implementation of chatbots without needing to code. It has built-in integration for both text and voice input via APIs like Cloud Speech-to-Text. Defining intents and entity types allows you to map common queries and keywords to responses. This would provide a low/no-code way to quickly build and iteratively improve the chatbot capabilities.

Option A and B would require more heavy coding to handle speech input/output. Option C still requires coding the complex query handling. Only option D leverages the full capabilities of Dialogflow to enable no-code chatbot development and ongoing improvements as more conversational data is collected. Hence, option D is the best approach given the requirements.

Reference:
https://cloud.google.com/dialogflow/es/docs/how/detect-intent-tts#:~:text=Dialogflow%20can%20use%20Cloud%20Text
,to%2Dspeech%2C%20or%20TTS.

< Previous Page Next Page >

Professional Data Engineer on Google Cloud Platform

278 QUESTIONS AS TOTAL

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Click here for the answer

Quick access to all questions in this exam