Professional Data Engineer on Google Cloud Platform

46%

Question 121

You have several Spark jobs that run on a Cloud Dataproc cluster on a schedule. Some of the jobs run in sequence, and some of the jobs run concurrently. You need to automate this process. What should you do?
Create a Cloud Dataproc Workflow Template
Create an initialization action to execute the jobs
Create a Directed Acyclic Graph in Cloud Composer
Create a Bash script that uses the Cloud SDK to create a cluster, execute jobs, and then tear down the cluster




Answer is Create a Directed Acyclic Graph in Cloud Composer

Wokflow template is a template with no scheduling capabilities. For scheduling Composer will required to be used.

Question 122

Data Analysts in your company have the Cloud IAM Owner role assigned to them in their projects to allow them to work with multiple GCP products in their projects. Your organization requires that all BigQuery data access logs be retained for 6 months. You need to ensure that only audit personnel in your company can access the data access logs for all projects.

What should you do?
Enable data access logs in each Data Analyst's project. Restrict access to Stackdriver Logging via Cloud IAM roles.
Export the data access logs via a project-level export sink to a Cloud Storage bucket in the Data Analysts' projects. Restrict access to the Cloud Storage bucket.
Export the data access logs via a project-level export sink to a Cloud Storage bucket in a newly created projects for audit logs. Restrict access to the project with the exported logs.
Export the data access logs via an aggregated export sink to a Cloud Storage bucket in a newly created project for audit logs. Restrict access to the project that contains the exported logs.




Answer is Export the data access logs via an aggregated export sink to a Cloud Storage bucket in a newly created project for audit logs. Restrict access to the project that contains the exported logs.

Aggregated log sink will create a single sink for all projects, the destination can be a google cloud storage, pub/sub topic, bigquery table or a cloud logging bucket. without aggregated sink this will be required to be done for each project individually which will be cumbersome.

Reference:
https://cloud.google.com/logging/docs/export/aggregated_sinks

Question 123

You are operating a streaming Cloud Dataflow pipeline. Your engineers have a new version of the pipeline with a different windowing algorithm and triggering strategy. You want to update the running pipeline with the new version. You want to ensure that no data is lost during the update. What should you do?
Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to the existing job name
Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to a new unique job name
Stop the Cloud Dataflow pipeline with the Cancel option. Create a new Cloud Dataflow job with the updated code
Stop the Cloud Dataflow pipeline with the Drain option. Create a new Cloud Dataflow job with the updated code




Answer is Stop the Cloud Dataflow pipeline with the Drain option. Create a new Cloud Dataflow job with the updated code

The pipe algorithm change considered as a major code change. So the job should be drained first then start a new job.

Reference:
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline

Question 124

You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times.

Which service should you use to manage the execution of these jobs?
Cloud Scheduler
Cloud Dataflow
Cloud Functions
Cloud Composer




Answer is Cloud Composer

All interdependent tasks need to be run through cloud composer whereas small/adhoc tasks need to be run via scheduler.
Cloud Scheduler is a fully managed enterprise-grade cron job scheduler.

Reference:
https://cloud.google.com/scheduler

Question 125

You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence.
To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary data. However, you noticed that one Hadoop job runs very slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue.

What should you do?
Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory
Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS
Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up
Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage




Answer is B. Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS

Local HDFS storage is a good option if:

Your jobs require a lot of metadata operations—for example, you have thousands of partitions and directories, and each file size is relatively small.
You modify the HDFS data frequently or you rename directories. (Cloud Storage objects are immutable, so renaming a directory is an expensive operation because it consists of copying all objects to a new key and deleting them afterwards.)
You heavily use the append operation on HDFS files.
You have workloads that involve heavy I/O. For example, you have a lot of partitioned writes, such as the following:

spark.read().write.partitionBy(...).parquet("gs://")

You have I/O workloads that are especially sensitive to latency. For example, you require single-digit millisecond latency per storage operation.

Question 126

You need to deploy additional dependencies to all of a Cloud Dataproc cluster at startup using an existing initialization action. Company security policies require that Cloud Dataproc nodes do not have access to the Internet so public initialization actions cannot fetch resources.

What should you do?
Deploy the Cloud SQL Proxy on the Cloud Dataproc master
Use an SSH tunnel to give the Cloud Dataproc cluster access to the Internet
Copy all dependencies to a Cloud Storage bucket within your VPC security perimeter
Use Resource Manager to add the service account used by the Cloud Dataproc cluster to the Network User role




Answer is Copy all dependencies to a Cloud Storage bucket within your VPC security perimeter

If you create a Dataproc cluster with internal IP addresses only, attempts to access the Internet in an initialization action will fail unless you have configured routes to direct the traffic through a NAT or a VPN gateway. Without access to the Internet, you can enable Private Google Access, and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.

Reference:
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions

Question 127

You work for a mid-sized enterprise that needs to move its operational system transaction data from an on-premises database to GCP. The database is about 20 TB in size.

Which database should you choose?
Cloud SQL
Cloud Bigtable
Cloud Spanner
Cloud Datastore


Question 128

You have data pipelines running on BigQuery, Cloud Dataflow, and Cloud Dataproc. You need to perform health checks and monitor their behavior, and then notify the team managing the pipelines if they fail. You also need to be able to work across multiple projects. Your preference is to use managed products of features of the platform.

What should you do?
Export the information to Cloud Stackdriver, and set up an Alerting policy
Run a Virtual Machine in Compute Engine with Airflow, and export the information to Stackdriver
Export the logs to BigQuery, and set up App Engine to read that information and send emails if you find a failure in the logs
Develop an App Engine application to consume logs using GCP API calls, and send emails if you find a failure in the logs




Answer is Export the information to Cloud Stackdriver, and set up an Alerting policy

Monitoring does not only provide you with access to Dataflow-related metrics, but also lets you to create alerting policies and dashboards so you can chart time series of metrics and choose to be notified when these metrics reach specified values.

Question 129

Suppose you have a table that includes a nested column called "city" inside a column called "person", but when you try to submit the following query in BigQuery, it gives you an error. SELECT person FROM `project1.example.table1` WHERE city = "London"
How would you correct the error?
Add ", UNNEST(person)" before the WHERE clause.
Change "city" to "person.city".
Change "person" to "city.person".
Add ", UNNEST(city)" before the WHERE clause.




Answer is Change "city" to "person.city".

The qestion is about nest nor repeated. Nested doesn't need unnest, while repeated do.

WITH table1 AS ( SELECT STRUCT('Elvis Presley' AS name, 'Buenos Aires' AS city, "town1" AS town) AS person UNION ALL SELECT STRUCT('Johnny Depp', 'London', 'Paris')) SELECT person FROM table1 WHERE person.city="London"
Reference:
https://cloud.google.com/bigquery/docs/nested-repeated

Question 130

Which of these statements about exporting data from BigQuery is false?
To export more than 1 GB of data, you need to put a wildcard in the destination filename.
The only supported export destination is Google Cloud Storage.
Data can only be exported in JSON or Avro format.
The only compression option available is GZIP.




Answer is Data can only be exported in JSON or Avro format.

When you export data from BigQuery, note the following:

You cannot export table data to a local file, to Google Sheets, or to Google Drive. The only supported export location is Cloud Storage. For information on saving query results, see Downloading and saving query results.
You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.
You cannot export nested and repeated data in CSV format. Nested and repeated data is supported for Avro and JSON exports.
When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
You cannot export data from multiple tables in a single export job.
You cannot choose a compression type other than GZIP when you export data using the Cloud Console or the classic BigQuery web UI.

Reference:
https://cloud.google.com/bigquery/docs/exporting-data#export_limitations

< Previous PageNext Page >

Quick access to all questions in this exam