Professional Data Engineer on Google Cloud Platform
79%
278 QUESTIONS AS TOTAL
Question 211
You are implementing security best practices on your data pipeline. Currently, you are manually executing jobs as the Project Owner. You want to automate these jobs by taking nightly batch files containing non-public information from Google Cloud Storage, processing them with a Spark Scala job on a Google Cloud Dataproc cluster, and depositing the results into Google BigQuery.
How should you securely run this workload?
Restrict the Google Cloud Storage bucket so only you can see the files
Grant the Project Owner role to a service account, and run the job with it
Use a service account with the ability to read the batch files and to write to BigQuery
Use a user account with the Project Viewer role on the Cloud Dataproc cluster to read the batch files and write to BigQuery
Answer is Use a service account with the ability to read the batch files and to write to BigQuery
As service account invoked to read the data into GCS and write to BQ once transformed via Data Proc. Assumes DataProc can inherit SA authorisation to perform transform and propagate.
Your company receives both batch- and stream-based event data. You want to process the data using Google Cloud Dataflow over a predictable time period.
However, you realize that in some instances data can arrive late or out of order.
How should you design your Cloud Dataflow pipeline to handle data that is late or out of order?
Set a single global window to capture all the data.
Set sliding windows to capture all the lagged data.
Use watermarks and timestamps to capture the lagged data.
Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.
Answer is Use watermarks and timestamps to capture the lagged data.
A watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived. If new data arrives with a timestamp that's in the window but older than the watermark, the data is considered late data.
Question 213
You set up a streaming data insert into a Redis cluster via a Kafka cluster. Both clusters are running on Compute Engine instances. You need to encrypt data at rest with encryption keys that you can create, rotate, and destroy as needed.
What should you do?
Create a dedicated service account, and use encryption at rest to reference your data stored in your Compute Engine cluster instances as part of your API service calls.
Create encryption keys in Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
Create encryption keys locally. Upload your encryption keys to Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
Create encryption keys in Cloud Key Management Service. Reference those keys in your API service calls when accessing the data in your Compute Engine cluster instances.
Answer is Create encryption keys in Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
KMS stored on cloud helps in creating, rotating and destroying keys as needed and also it can be used to encrypt compute engines.
An organization maintains a Google BigQuery dataset that contains tables with user-level data. They want to expose aggregates of this data to other Google Cloud projects, while still controlling access to the user-level data. Additionally, they need to minimize their overall storage cost and ensure the analysis cost for other projects is assigned to those projects.
What should they do?
Create and share an authorized view that provides the aggregate results. Most Voted
Create and share a new dataset and view that provides the aggregate results.
Create and share a new dataset and table that contains the aggregate results.
Create dataViewer Identity and Access Management (IAM) roles on the dataset to enable sharing.
Answer is Create and share a new dataset and view that provides the aggregate results.
Control access on a view we need to store in a separate dataset
Your neural network model is taking days to train. You want to increase the training speed.
What can you do?
Subsample your test dataset.
Subsample your training dataset.
Increase the number of input features to your model.
Increase the number of layers in your neural network.
Answer is Subsample your training dataset.
Subsampling is the method to increase the training speed
Question 216
Your company maintains a hybrid deployment with GCP, where analytics are performed on your anonymized customer data. The data are imported to Cloud
Storage from your data center through parallel uploads to a data transfer server running on GCP. Management informs you that the daily transfers take too long and have asked you to fix the problem. You want to maximize transfer speeds. Which action should you take?
Increase the CPU size on your server.
Increase the size of the Google Persistent Disk on your server.
Increase your network bandwidth from your datacenter to GCP.
Increase your network bandwidth from Compute Engine to Cloud Storage.
Answer is Increase your network bandwidth from your datacenter to GCP.
Speed of data transfer depends on Bandwidth
Few things in computing highlight the hardware limitations of networks as transferring large amounts of data. Typically you can transfer 1 GB in eight seconds over a 1 Gbps network. If you scale that up to a huge dataset (for example, 100 TB), the transfer time is 12 days. Transferring huge datasets can test the limits of your infrastructure and potentially cause problems for your business.
Question 217
You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?
Increase the size of your parquet files to ensure them to be 1 GB minimum.
Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
Answer is Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
From GCP Documentation:
1) "As a default, preemptible VMs are created with a smaller boot disk size, and you might want to override this configuration if you are running shuffle-heavy workloads"
2) If you perform many shuffling operations or partitioned writes, switch to SSDs to boost performance.
You have a data pipeline that writes data to Cloud Bigtable using well-designed row keys. You want to monitor your pipeline to determine when to increase the size of you Cloud Bigtable cluster.
Which two actions can you take to accomplish this? (Choose two.)
Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Read pressure index is above 100.
Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Write pressure index is above 100.
Monitor the latency of write operations. Increase the size of the Cloud Bigtable cluster when there is a sustained increase in write latency.
Monitor storage utilization. Increase the size of the Cloud Bigtable cluster when utilization increases above 70% of max capacity.
Monitor latency of read operations. Increase the size of the Cloud Bigtable cluster of read operations take longer than 100 ms.
Answers are;
C. Monitor the latency of write operations. Increase the size of the Cloud Bigtable cluster when there is a sustained increase in write latency.
D. Monitor storage utilization. Increase the size of the Cloud Bigtable cluster when utilization increases above 70% of max capacity.
You have a query that filters a BigQuery table using a WHERE clause on timestamp and ID columns. By using bq query -dry_run you learn that the query triggers a full scan of the table, even though the filter on timestamp and ID select a tiny fraction of the overall data. You want to reduce the amount of data scanned by BigQuery with minimal changes to existing SQL queries.
What should you do?
Create a separate table for each ID.
Use the LIMIT keyword to reduce the number of rows returned.
Recreate the table with a partitioning column and clustering column.
Use the bq query - -maximum_bytes_billed flag to restrict the number of bytes billed.
Answer is Recreate the table with a partitioning column and clustering column.
LIMIT keyword is applied only at the end, i.e., only to limit the results already calculated. Therefore, a full table scan will have already happened. The where clause on the other hand would provide the desired filtering depending on the case.
Question 220
You have a requirement to insert minute-resolution data from 50,000 sensors into a BigQuery table. You expect significant growth in data volume and need the data to be available within 1 minute of ingestion for real-time analysis of aggregated trends.
What should you do?
Use bq load to load a batch of sensor data every 60 seconds.
Use a Cloud Dataflow pipeline to stream data into the BigQuery table.
Use the INSERT statement to insert a batch of data every 60 seconds.
Use the MERGE statement to apply updates in batch every 60 seconds.
Answer is Use a Cloud Dataflow pipeline to stream data into the BigQuery table.
You need a pipeline because this type of operation can be easily parallelized, as the ingestion can be divided between into chunks (PCollections) and handled by many workers.