Professional Data Engineer on Google Cloud Platform

25%

Question 61

Which role must be assigned to a service account used by the virtual machines in a Dataproc cluster so they can execute jobs?
Dataproc Worker
Dataproc Viewer
Dataproc Runner
Dataproc Editor




Answer is Dataproc Worker

Service accounts used with Cloud Dataproc must have Dataproc/Dataproc Worker role (or have all the permissions granted by Dataproc Worker role).

Reference:
https://cloud.google.com/dataproc/docs/concepts/service-accounts#important_notes
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts#service_account_requirements_and_limitations

Question 62

What are the minimum permissions needed for a service account used with Google Dataproc?
Execute to Google Cloud Storage; write to Google Cloud Logging
Write to Google Cloud Storage; read to Google Cloud Logging
Execute to Google Cloud Storage; execute to Google Cloud Logging
Read and write to Google Cloud Storage; write to Google Cloud Logging




Answer is Read and write to Google Cloud Storage; write to Google Cloud Logging

Service accounts authenticate applications running on your virtual machine instances to other Google Cloud Platform services. For example, if you write an application that reads and writes files on Google Cloud Storage, it must first authenticate to the Google Cloud Storage API. At a minimum, service accounts used with Cloud Dataproc need permissions to read and write to Google Cloud Storage, and to write to Google Cloud Logging.

Reference:
https://cloud.google.com/dataproc/docs/concepts/service-accounts#important_notes
https://www.googleapis.com/auth/cloud.useraccounts.readonly
https://www.googleapis.com/auth/devstorage.read_write
https://www.googleapis.com/auth/logging.write

Question 63

Which of the following job types are supported by Cloud Dataproc (select 3 answers)?
Hive
Pig
YARN
Spark




Answers are; Hive, Pig and Spark

Cloud Dataproc provides out-of-the box and end-to-end support for many of the most popular job types, including Spark, Spark SQL, PySpark, MapReduce, Hive, and Pig jobs.

YARN is a resource manager.

Reference:
https://cloud.google.com/dataproc/docs/resources/faq#what_type_of_jobs_can_i_run

Question 64

By default, which of the following windowing behavior does Dataflow apply to unbounded data sets?
Windows at every 100 MB of data
Single, Global Window
Windows at every 1 minute
Windows at every 10 minutes




Answer is Single, Global Window

Dataflow's default windowing behavior is to assign all elements of a PCollection to a single, global window, even for unbounded PCollections

Reference:
https://cloud.google.com/dataflow/model/pcollection

Question 65

Which of the following is not true about Dataflow pipelines?
Pipelines are a set of operations
Pipelines represent a data processing job
Pipelines represent a directed graph of steps
Pipelines can share data between instances




Answer is Pipelines can share data between instances

In the Dataflow SDKs, a pipeline represents a data processing job. You build a pipeline by writing a program using a Dataflow SDK. A pipeline consists of a set of operations that can read a source of input data, transform that data, and write out the resulting output. The data and transforms in a pipeline are unique to, and owned by, that pipeline. While your program can create multiple pipelines, pipelines cannot share data or transforms.

Reference:
https://cloud.google.com/dataflow/model/pipelines

Question 66

Which of the following IAM roles does your Compute Engine account require to be able to run pipeline jobs?
dataflow.worker
dataflow.compute
dataflow.developer
dataflow.viewer




Answer is dataflow.worker

The dataflow.worker role provides the permissions necessary for a Compute Engine service account to execute work units for a Dataflow pipeline

Reference:
https://cloud.google.com/dataflow/access-control

Question 67

You are developing a software application using Google's Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline. Which component will be used for the data processing operation?
PCollection
Transform
Pipeline
Sink API




Answer is Transform

In Google Cloud, the Dataflow SDK provides a transform component. It is responsible for the data processing operation. You can use conditional, for loops, and other complex programming structure to create a branching pipeline.

Reference:
https://cloud.google.com/dataflow/model/programming-model
https://cloud.google.com/dataflow/docs/concepts/beam-programming-model#concepts

Question 68

Which of the following is NOT true about Dataflow pipelines?
Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner
Dataflow pipelines can consume data from other Google Cloud services
Dataflow pipelines can be programmed in Java
Dataflow pipelines use a unified programming model, so can work both with streaming and batch data sources




Answer is Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner

Dataflow pipelines can also run on alternate runtimes like Spark and Flink, as they are built using the Apache Beam SDKs

Reference:
https://cloud.google.com/dataflow/

Question 69

You want to process payment transactions in a point-of-sale application that will run on Google Cloud Platform. Your user base could grow exponentially, but you do not want to manage infrastructure scaling.

Which Google database service should you use?
Cloud SQL
BigQuery
Cloud Bigtable
Cloud Datastore




Answer is Cloud Datastore

It is a fully managed and serverless solution that allows for transactions and will autoscale (storage and compute) without the need to manage any infrastructure.
A is wrong: Cloud SQL is fully a managed transactional DB, but only the storage grows automatically. As your user base increases, you will need to increase the CPU/memory of the instance, and to do that you must edit the instance manually (and the questions specifically says "you do not want to manage infrastructure scaling")
B is wrong: Bigquery is OLAP (for analytics). NoOps, fully managed, autoscales and allows transactions, but it is not designed for this use case.
C is wrong: Bigtable is a NoSQL database for massive writes, and to scale (storage and CPU) you must add nodes, so it is completely out of this use case.

Reference:
https://cloud.google.com/datastore/docs/concepts/overview

Question 70

You work for a large bank that operates in locations throughout North America. You are setting up a data storage system that will handle bank account transactions. You require ACID compliance and the ability to access data with SQL.

Which solution is appropriate?
Store transaction data in Cloud Spanner. Enable stale reads to reduce latency.
Store transaction in Cloud Spanner. Use locking read-write transactions.
Store transaction data in BigQuery. Disabled the query cache to ensure consistency.
Store transaction data in Cloud SQL. Use a federated query BigQuery for analysis.




Answer is Store transaction in Cloud Spanner. Use locking read-write transactions.

Since the banking transaction system requires ACID compliance and SQL access to the data, Cloud Spanner is the most appropriate solution. Unlike Cloud SQL, Cloud Spanner natively provides ACID transactions and horizontal scalability.

Enabling stale reads in Spanner (option A) would reduce data consistency, violating the ACID compliance requirement of banking transactions.

BigQuery (option C) does not natively support ACID transactions or SQL writes which are necessary for a banking transactions system.

Cloud SQL (option D) provides ACID compliance but does not scale horizontally like Cloud Spanner can to handle large transaction volumes.

By using Cloud Spanner and specifically locking read-write transactions, ACID compliance is ensured while providing fast, horizontally scalable SQL processing of banking transactions.

Reference:
https://cloud.google.com/blog/topics/developers-practitioners/your-google-cloud-database-options-explained

< Previous PageNext Page >

Quick access to all questions in this exam