Professional Data Engineer on Google Cloud Platform

28%

Question 71

You work for a financial institution that lets customers register online. As new customers register, their user data is sent to Pub/Sub before being ingested into BigQuery. For security reasons, you decide to redact your customers' Government issued Identification Number while allowing customer service representatives to view the original values when necessary.

What should you do?
Use BigQuery's built-in AEAD encryption to encrypt the SSN column. Save the keys to a new table that is only viewable by permissioned users.
Use BigQuery column-level security. Set the table permissions so that only members of the Customer Service user group can see the SSN column.
Before loading the data into BigQuery, use Cloud Data Loss Prevention (DLP) to replace input values with a cryptographic hash.
Before loading the data into BigQuery, use Cloud Data Loss Prevention (DLP) to replace input values with a cryptographic format-preserving encryption token.




Answer is Before loading the data into BigQuery, use Cloud Data Loss Prevention (DLP) to replace input values with a cryptographic format-preserving encryption token.

Before loading the data into BigQuery, use Cloud Data Loss Prevention (DLP) to replace input values with a cryptographic format-preserving encryption token.
The key reasons are:
- DLP allows redacting sensitive PII like SSNs before loading into BigQuery. This provides security by default for the raw SSN values.
- Using format-preserving encryption keeps the column format intact while still encrypting, allowing business logic relying on SSN format to continue functioning.
- The encrypted tokens can be reversed to view original SSNs when required, meeting the access requirement for customer service reps.

Option A does encrypt SSN but requires managing keys separately.
Option B relies on complex IAM policy changes instead of encrypting by default.
Doesn't truly redact the data. The SSN values are still stored in BigQuery, even if hidden from unauthorized users. A potential security breach could expose them.
Option C hashes irreversibly, preventing customer service reps from viewing original SSNs when required.
Therefore, using DLP format-preserving encryption before BigQuery ingestion balances both security and analytics requirements for SSN data.

Reference:
https://cloud.google.com/dlp/docs/classification-redaction
https://cloud.google.com/dlp/docs/transformations-reference

Question 72

You need to deploy additional dependencies to all nodes of a Cloud Dataproc cluster at startup using an existing initialization action.
Company security policies require that Cloud Dataproc nodes do not have access to the Internet so public initialization actions cannot fetch resources.

What should you do?
Deploy the Cloud SQL Proxy on the Cloud Dataproc master
Use an SSH tunnel to give the Cloud Dataproc cluster access to the Internet
Copy all dependencies to a Cloud Storage bucket within your VPC security perimeter
Use Resource Manager to add the service account used by the Cloud Dataproc cluster to the Network User role




Answer is Copy all dependencies to a Cloud Storage bucket within your VPC security perimeter

If you create a Dataproc cluster with internal IP addresses only, attempts to access the Internet in an initialization action will fail unless you have configured routes to direct the traffic through a NAT or a VPN gateway. Without access to the Internet, you can enable Private Google Access, and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.

Reference:
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions

Question 73

You want to rebuild your batch pipeline for structured data on Google Cloud. You are using PySpark to conduct data transformations at scale, but your pipelines are taking over twelve hours to run. To expedite development and pipeline run time, you want to use a serverless tool and SOL syntax.
You have already moved your raw data into Cloud Storage.

How should you build the pipeline on Google Cloud while meeting speed and processing requirements?
Convert your PySpark commands into SparkSQL queries to transform the data, and then run your pipeline on Dataproc to write the data into BigQuery.
Ingest your data into Cloud SQL, convert your PySpark commands into SparkSQL queries to transform the data, and then use federated quenes from BigQuery for machine learning.
Ingest your data into BigQuery from Cloud Storage, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table.
Use Apache Beam Python SDK to build the transformation pipelines, and write the data into BigQuery.




Answer is Ingest your data into BigQuery from Cloud Storage, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table.

BigQuery SQL provides a fast, scalable, and serverless method for transforming structured data, easier to develop than PySpark.
Directly ingesting the raw Cloud Storage data into BigQuery avoids needing an intermediate processing cluster like Dataproc.
Transforming the data via BigQuery SQL queries will be faster than PySpark, especially since the data is already loaded into BigQuery.
Writing the transformed results to a new BigQuery table keeps the original raw data intact and provides a clean output.
So migrating to BigQuery SQL for transformations provides a fully managed serverless architecture that can significantly expedite development and reduce pipeline runtime versus PySpark. The ability to avoid clusters and conduct transformations completely within BigQuery is the most efficient approach here.

Reference:
https://cloud.google.com/dataproc-serverless/docs/overview

Question 74

You are building a real-time prediction engine that streams files, which may contain PII (personal identifiable information) data, into Cloud Storage and eventually into BigQuery. You want to ensure that the sensitive data is masked but still maintains referential integrity, because names and emails are often used as join keys.

How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the PII data is not accessible by unauthorized individuals?
Create a pseudonym by replacing the PII data with cryptogenic tokens, and store the non-tokenized data in a locked-down button.
Redact all PII data, and store a version of the unredacted data in a locked-down bucket.
Scan every table in BigQuery, and mask the data it finds that has PII.
Create a pseudonym by replacing PII data with a cryptographic format-preserving token.




Answer is Create a pseudonym by replacing PII data with a cryptographic format-preserving token.

Format preserving encryption: An input value is replaced with a value that has been encrypted using the FPE-FFX encryption algorithm with a cryptographic key, and then prepended with a surrogate annotation, if specified. By design, both the character set and the length of the input value are preserved in the output value. Encrypted values can be re-identified using the original cryptographic key and the entire output value, including surrogate annotation.

Reference:
https://cloud.google.com/dlp/docs/pseudonymization#supported-methods

Question 75

You need to give new website users a globally unique identifier (GUID) using a service that takes in data points and returns a GUID. This data is sourced from both internal and external systems via HTTP calls that you will make via microservices within your pipeline.
There will be tens of thousands of messages per second and that can be multi-threaded. and you worry about the backpressure on the system.

How should you design your pipeline to minimize that backpressure?
Call out to the service via HTTP.
Create the pipeline statically in the class definition.
Create a new object in the startBundle method of DoFn.
Batch the job into ten-second increments.




Answer is Batch the job into ten-second increments.

Option D is the best approach to minimize backpressure in this scenario. By batching the jobs into 10-second increments, you can throttle the rate at which requests are made to the external GUID service. This prevents too many simultaneous requests from overloading the service.

Option A would not help with backpressure since it just makes synchronous HTTP requests as messages arrive. Similarly, options B and C don't provide any inherent batching or throttling mechanism.

Batching into time windows is a common strategy in stream processing to deal with high velocity data. The 10-second windows allow some buffering to happen, rather than making a call immediately for each message. This provides a natural throttling that can be tuned based on the external service's capacity.

To design a pipeline that minimizes backpressure, especially when dealing with tens of thousands of messages per second in a multi-threaded environment, it's important to consider how each option affects system performance and scalability. Let's examine each of your options:

A. Call out to the service via HTTP: Making HTTP calls to an external service for each message can introduce significant latency and backpressure, especially at high throughput. This is due to the overhead of establishing a connection, waiting for the response, and handling potential network delays or failures.

B. Create the pipeline statically in the class definition: While this approach can improve initialization time and reduce overhead during execution, it doesn't directly address the issue of backpressure caused by high message throughput.

C. Create a new object in the startBundle method of DoFn: This approach is typically used in Apache Beam to initialize resources before processing a bundle of elements. While it can optimize resource usage and performance within each bundle, it doesn't inherently solve the backpressure issue caused by high message rates.

D. Batch the job into ten-second increments: Batching messages can be an effective way to reduce backpressure. By grouping multiple messages into larger batches, you can reduce the frequency of external calls and distribute the processing load more evenly over time. This can lead to more efficient use of resources and potentially lower latency, as the system spends less time waiting on external services.

Given these considerations, option D (Batch the job into ten-second increments) seems to be the most effective strategy for minimizing backpressure in your scenario. By batching messages, you can reduce the strain on your pipeline and external services, making the system more resilient and scalable under high load. However, the exact batch size and interval should be fine-tuned based on the specific characteristics of your workload and the capabilities of the external systems you are interacting with.

Additionally, it's important to consider other strategies in conjunction with batching, such as implementing efficient error handling, load balancing, and potentially using asynchronous I/O for external HTTP calls to further optimize performance and minimize backpressure.

Question 76

You are migrating your data warehouse to Google Cloud and decommissioning your on-premises data center. Because this is a priority for your company, you know that bandwidth will be made available for the initial data load to the cloud. The files being transferred are not large in number, but each file is 90 GB.
Additionally, you want your transactional systems to continually update the warehouse on Google Cloud in real time.

What tools should you use to migrate the data and ensure that it continues to write to your warehouse?
Storage Transfer Service for the migration; Pub/Sub and Cloud Data Fusion for the real-time updates
BigQuery Data Transfer Service for the migration; Pub/Sub and Dataproc for the real-time updates
gsutil for the migration; Pub/Sub and Dataflow for the real-time updates
gsutil for both the migration and the real-time updates




Answer is gsutil for the migration; Pub/Sub and Dataflow for the real-time updates

Option C is the best choice given the large file sizes for the initial migration and the need for real-time updates after migration.

Specifically:
gsutil can transfer large files in parallel over multiple TCP connections to maximize bandwidth. This works well for the 90GB files during initial migration. Pub/Sub allows real-time messaging of updates that can then be streamed into Cloud Dataflow. Dataflow provides scalable stream processing to handle transforming and writing those updates into BigQuery or other sinks.

Option A is incorrect because Storage Transfer Service is better for scheduled batch transfers, not ad hoc large migrations.
Option B is incorrect because BigQuery Data Transfer Service is more focused on scheduled replication jobs, not ad hoc migrations.
Option D would not work well for real-time updates after migration is complete.

Reference:
https://cloud.google.com/architecture/migration-to-google-cloud-transferring-your-large-datasets#gsutil_for_smaller_transfers_of_on-premises_data

Question 77

You work for a shipping company that uses handheld scanners to read shipping labels. Your company has strict data privacy standards that require scanners to only transmit tracking numbers when events are sent to Kafka topics.
A recent software update caused the scanners to accidentally transmit recipients' personally identifiable information (PII) to analytics systems, which violates user privacy rules. You want to quickly build a scalable solution using cloud-native managed services to prevent exposure of PII to the analytics systems.

What should you do?
Create an authorized view in BigQuery to restrict access to tables with sensitive data.
Install a third-party data validation tool on Compute Engine virtual machines to check the incoming data for sensitive information.
Use Cloud Logging to analyze the data passed through the total pipeline to identify transactions that may contain sensitive information.
Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention (Cloud DLP) API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.




Answer is Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention (Cloud DLP) API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.

DLP is required

Question 78

You want to migrate an on-premises Hadoop system to Cloud Dataproc. Hive is the primary tool in use, and the data format is Optimized Row Columnar (ORC).
All ORC files have been successfully copied to a Cloud Storage bucket. You need to replicate some data to the cluster's local Hadoop Distributed File System (HDFS) to maximize performance.

What are two ways to start using Hive in Cloud Dataproc? (Choose two.)
Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to HDFS. Mount the Hive tables locally.
Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally.
Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS.
Leverage Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables. Replicate external Hive tables to the native ones.
Load the ORC files into BigQuery. Leverage BigQuery connector for Hadoop to mount the BigQuery tables as external Hive tables. Replicate external Hive tables to the native ones.




Answers are;
C. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS.
D. Leverage Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables. Replicate external Hive tables to the native ones.


I know it says to transfer all the files but with the options provided c is the best choice.
Explaination
A and B cannot be true as gsutil can copy data to master node and the to hdfs from master node.
C -> works
D->works Recommended by google. You can use the Cloud Storage connector for Hadoop to mount the ORC files stored in Cloud Storage as external Hive tables. This allows you to query the data without copying it to HDFS. You can replicate these external Hive tables to native Hive tables in Cloud Dataproc if needed.
E-> Will work but as the question says maximize performance this is not a case. As bigquery hadoop connecter stores all the BQ data to GCS as temp and then processes it to HDFS. As data is already in GCS we donot need to load it to bq and use a connector then unloads it back to GCS and then processes it.

Reference:
https://cloud.google.com/blog/products/data-analytics/new-release-of-cloud-storage-connector-for-hadoop-improving-performance-throughput-and-more

Question 79

You are building a report-only data warehouse where the data is streamed into BigQuery via the streaming API. Following Google's best practices, you have both a staging and a production table for the data.

How should you design your data loading to ensure that there is only one master dataset without affecting performance on either the ingestion or reporting pieces?
Have a staging table that is an append-only model, and then update the production table every three hours with the changes written to staging.
Have a staging table that is an append-only model, and then update the production table every ninety minutes with the changes written to staging.
Have a staging table that moves the staged data over to the production table and deletes the contents of the staging table every three hours.
Have a staging table that moves the staged data over to the production table and deletes the contents of the staging table every thirty minutes.




Answer is Have a staging table that moves the staged data over to the production table and deletes the contents of the staging table every three hours.

Following common extract, transform, load (ETL) best practices, we used a staging table and a separate production table so that we could load data into the staging table without impacting users of the data. The design we created based on ETL best practices called for first deleting all the records from the staging table, loading the staging table, and then replacing the production table with the contents.

When using the streaming API, the BigQuery streaming buffer remains active for about 30 to 60 minutes or more after use, which means that you can’t delete or change data during that time. Since we used the streaming API, we scheduled the load every three hours to balance getting data into BigQuery quickly and being able to subsequently delete the data from the staging table during the load process.

Reference:
https://cloud.google.com/blog/products/data-analytics/moving-a-publishing-workflow-to-bigquery-for-new-data-insights

Question 80

Your new customer has requested daily reports that show their net consumption of Google Cloud compute resources and who used the resources. You need to quickly and efficiently generate these daily reports.

What should you do?
Do daily exports of Cloud Logging data to BigQuery. Create views filtering by project, log type, resource, and user.
Filter data in Cloud Logging by project, resource, and user; then export the data in CSV format.
Filter data in Cloud Logging by project, log type, resource, and user, then import the data into BigQuery.
Export Cloud Logging data to Cloud Storage in CSV format. Cleanse the data using Dataprep, filtering by project, resource, and user.




Answer is Do daily exports of Cloud Logging data to BigQuery. Create views filtering by project, log type, resource, and user.

You cannot import custom or filtered billing criteria into BigQuery. There are three types of Cloud Billing data tables with a fixed schema that must further drilled-down via BigQuery views.

Reference:
https://cloud.google.com/billing/docs/how-to/export-data-bigquery#setup

< Previous PageNext Page >

Quick access to all questions in this exam