Professional Data Engineer on Google Cloud Platform

57%

Question 151

You are updating the code for a subscriber to a Pub/Sub feed. You are concerned that upon deployment the subscriber may erroneously acknowledge messages, leading to message loss. Your subscriber is not set up to retain acknowledged messages.

What should you do to ensure that you can recover from errors after deployment?
Set up the Pub/Sub emulator on your local machine. Validate the behavior of your new subscriber logic before deploying it to production.
Create a Pub/Sub snapshot before deploying new subscriber code. Use a Seek operation to re-deliver messages that became available after the snapshot was created.
Use Cloud Build for your deployment. If an error occurs after deployment, use a Seek operation to locate a timestamp logged by Cloud Build at the start of the deployment.
Enable dead-lettering on the Pub/Sub topic to capture messages that aren't successfully acknowledged. If an error occurs after deployment, re-deliver any messages captured by the dead-letter queue.




Answer is Create a Pub/Sub snapshot before deploying new subscriber code. Use a Seek operation to re-deliver messages that became available after the snapshot was created.

According to the second reference in the list below, a concern with deploying new subscriber code is that the new executable may erroneously acknowledge messages, leading to message loss. Incorporating snapshots into your deployment process gives you a way to recover from bugs in new subscriber code.

Answer cannot be C because To seek to a timestamp, you must first configure the subscription to retain acknowledged messages using retain-acked-messages. If retain-acked-messages is set, Pub/Sub retains acknowledged messages for 7 days.

References:
https://cloud.google.com/pubsub/docs/replay-message
https://cloud.google.com/pubsub/docs/replay-overview#seek_use_cases

Question 152

You work for a large real estate firm and are preparing 6 TB of home sales data to be used for machine learning. You will use SQL to transform the data and use BigQuery ML to create a machine learning model. You plan to use the model for predictions against a raw dataset that has not been transformed.

How should you set up your workflow in order to prevent skew at prediction time?
When creating your model, use BigQuery's TRANSFORM clause to define preprocessing steps. At prediction time, use BigQuery's ML.EVALUATE clause without specifying any transformations on the raw input data.
When creating your model, use BigQuery's TRANSFORM clause to define preprocessing steps. Before requesting predictions, use a saved query to transform your raw input data, and then use ML.EVALUATE.
Use a BigQuery view to define your preprocessing logic. When creating your model, use the view as your model training data. At prediction time, use BigQuery's ML.EVALUATE clause without specifying any transformations on the raw input data.
Preprocess all data using Dataflow. At prediction time, use BigQuery's ML.EVALUATE clause without specifying any further transformations on the input data.




Answer is When creating your model, use BigQuery's TRANSFORM clause to define preprocessing steps. At prediction time, use BigQuery's ML.EVALUATE clause without specifying any transformations on the raw input data.

Using the TRANSFORM clause, you can specify all preprocessing during model creation. The preprocessing is automatically applied during the prediction and evaluation phases of machine learning.

Reference:
https://cloud.google.com/bigquery-ml/docs/bigqueryml-transform

Question 153

You are analyzing the price of a company's stock. Every 5 seconds, you need to compute a moving average of the past 30 seconds' worth of data. You are reading data from Pub/Sub and using DataFlow to conduct the analysis.

How should you set up your windowed pipeline?
Use a fixed window with a duration of 5 seconds. Emit results by setting the following trigger: AfterProcessingTime.pastFirstElementInPane().plusDelayOf (Duration.standardSeconds(30))
Use a fixed window with a duration of 30 seconds. Emit results by setting the following trigger: AfterWatermark.pastEndOfWindow().plusDelayOf (Duration.standardSeconds(5))
Use a sliding window with a duration of 5 seconds. Emit results by setting the following trigger: AfterProcessingTime.pastFirstElementInPane().plusDelayOf (Duration.standardSeconds(30))
Use a sliding window with a duration of 30 seconds and a period of 5 seconds. Emit results by setting the following trigger: AfterWatermark.pastEndOfWindow ()




Answer is Use a sliding window with a duration of 30 seconds and a period of 5 seconds. Emit results by setting the following trigger: AfterWatermark.pastEndOfWindow ()

Sliding Window: Since you need to compute a moving average of the past 30 seconds' worth of data every 5 seconds, a sliding window is appropriate. A sliding window allows overlapping intervals and is well-suited for computing rolling aggregates. Window Duration: The window duration should be set to 30 seconds to cover the required 30 seconds' worth of data for the moving average calculation. Window Period: The window period or sliding interval should be set to 5 seconds to move the window every 5 seconds and recalculate the moving average with the latest data. Trigger: The trigger should be set to AfterWatermark.pastEndOfWindow() to emit the computed moving average results when the watermark advances past the end of the window. This ensures that all data within the window is considered before emitting the result.

Reference:
https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#hopping-windows

Question 154

You are designing a pipeline that publishes application events to a Pub/Sub topic. Although message ordering is not important, you need to be able to aggregate events across disjoint hourly intervals before loading the results to BigQuery for analysis.

What technology should you use to process and load this data to BigQuery while ensuring that it will scale with large volumes of events?
Create a Cloud Function to perform the necessary data processing that executes using the Pub/Sub trigger every time a new message is published to the topic.
Schedule a Cloud Function to run hourly, pulling all available messages from the Pub/Sub topic and performing the necessary aggregations.
Schedule a batch Dataflow job to run hourly, pulling all available messages from the Pub/Sub topic and performing the necessary aggregations.
Create a streaming Dataflow job that reads continually from the Pub/Sub topic and performs the necessary aggregations using tumbling windows.




Answer is Create a streaming Dataflow job that reads continually from the Pub/Sub topic and performs the necessary aggregations using tumbling windows.

TUMBLE=> fixed windows.
HOP=> sliding windows.
SESSION=> session windows.

A streaming Dataflow job is the best way to process and load data from Pub/Sub to BigQuery in real time. This is because streaming Dataflow jobs can scale to handle large volumes of data, and they can perform aggregations using tumbling windows.

Reference:
https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines#tumbling-windows

Question 155

You are migrating an application that tracks library books and information about each book, such as author or year published, from an on-premises data warehouse to BigQuery.
In your current relational database, the author information is kept in a separate table and joined to the book information on a common key.

Based on Google's recommended practice for schema design, how would you structure the data to ensure optimal speed of queries about the author of each book that has been borrowed?
Keep the schema the same, maintain the different tables for the book and each of the attributes, and query as you are doing today.
Create a table that is wide and includes a column for each attribute, including the author's first name, last name, date of birth, etc.
Create a table that includes information about the books and authors, but nest the author fields inside the author column.
Keep the schema the same, create a view that joins all of the tables, and always query the view.




Answer is Create a table that includes information about the books and authors, but nest the author fields inside the author column.

Use nested and repeated fields to denormalize data storage which will increase query performance. BigQuery doesn't require a completely flat denormalization. You can use nested and repeated fields to maintain relationships.

Reference:
https://cloud.google.com/bigquery/docs/best-practices-performance-nested

Question 156

You are using Bigtable to persist and serve stock market data for each of the major indices. To serve the trading application, you need to access only the most recent stock prices that are streaming in.

How should you design your row key and tables to ensure that you can access the data with the simplest query?
Create one unique table for all of the indices, and then use the index and timestamp as the row key design.
Create one unique table for all of the indices, and then use a reverse timestamp as the row key design.
For each index, have a separate table and use a timestamp as the row key design.
For each index, have a separate table and use a reverse timestamp as the row key design.




Answer is Create one unique table for all of the indices, and then use a reverse timestamp as the row key design.

If you usually retrieve the most recent records first, you can use a reversed timestamp in the row key by subtracting the timestamp from your programming language's maximum value for long integers (in Java, java.lang.Long.MAX_VALUE). With a reversed timestamp, the records will be ordered from most recent to least recent.

Reference:
https://cloud.google.com/bigtable/docs/schema-design#time-based

Question 157

Your company is in the process of migrating its on-premises data warehousing solutions to BigQuery.
The existing data warehouse uses trigger-based change data capture (CDC) to apply updates from multiple transactional database sources on a daily basis.
With BigQuery, your company hopes to improve its handling of CDC so that changes to the source systems are available to query in BigQuery in near-real time using log-based CDC streams, while also optimizing for the performance of applying changes to the data warehouse.

Which two steps should they take to ensure that changes are available in the BigQuery reporting table with minimal latency while reducing compute overhead? (Choose two.)
Perform a DML INSERT, UPDATE, or DELETE to replicate each individual CDC record in real time directly on the reporting table.
Insert each new CDC record and corresponding operation type to a staging table in real time.
Periodically DELETE outdated records from the reporting table.
Periodically use a DML MERGE to perform several DML INSERT, UPDATE, and DELETE operations at the same time on the reporting table.
Insert each new CDC record and corresponding operation type in real time to the reporting table, and use a materialized view to expose only the newest version of each unique record.




Answers are;
B. Insert each new CDC record and corresponding operation type to a staging table in real time. D. Periodically use a DML MERGE to perform several DML INSERT, UPDATE, and DELETE operations at the same time on the reporting table.

Delta tables contain all change events for a particular table since the initial load. Having all change events available can be valuable for identifying trends, the state of the entities that a table represents at a particular moment, or change frequency.

The best way to merge data frequently and consistently is to use a MERGE statement, which lets you combine multiple INSERT, UPDATE, and DELETE statements into a single atomic operation.

Reference:
https://cloud.google.com/architecture/database-replication-to-bigquery-using-change-data-capture#overview_of_cdc_data_replication

Question 158

You are loading CSV files from Cloud Storage to BigQuery. The files have known data quality issues, including mismatched data types, such as STRINGs and INT64s in the same column, and inconsistent formatting of values such as phone numbers or addresses. You need to create the data pipeline to maintain data quality and perform the required cleansing and transformation. What should you do?
Use Data Fusion to transform the data before loading it into BigQuery.
Use Data Fusion to convert the CSV files to a self-describing data format, such as AVRO, before loading the data to BigQuery.
Load the CSV files into a staging table with the desired schema, perform the transformations with SQL, and then write the results to the final destination table.
Create a table with the desired schema, load the CSV files into the table, and perform the transformations in place using SQL.




Answer is Use Data Fusion to transform the data before loading it into BigQuery.

Data Fusion's advantages:

Visual interface: Offers a user-friendly interface for designing data pipelines without extensive coding, making it accessible to a wider range of users.
Built-in transformations: Includes a wide range of pre-built transformations to handle common data quality issues, such as:
Data type conversions
Data cleansing (e.g., removing invalid characters, correcting formatting)
Data validation (e.g., checking for missing values, enforcing constraints)
Data enrichment (e.g., adding derived fields, joining with other datasets)
Custom transformations: Allows for custom transformations using SQL or Java code for more complex cleaning tasks.
Scalability: Can handle large datasets efficiently, making it suitable for processing CSV files with potential data quality issues.
Integration with BigQuery: Integrates seamlessly with BigQuery, allowing for direct loading of transformed data.


Why other options are less suitable:

B. Converting to AVRO: While AVRO is a self-describing format, it doesn't inherently address data quality issues. Transformations would still be needed, and Data Fusion provides a more comprehensive environment for this.
C. Staging table: Requires manual SQL transformations, which can be time-consuming and error-prone for large datasets with complex data quality issues.
D. Transformations in place: Modifying data directly in the destination table can lead to data loss or corruption if errors occur. It's generally safer to keep raw data intact and perform transformations separately.
By using Data Fusion, you can create a robust and efficient data pipeline that addresses data quality issues upfront, ensuring that only clean and consistent data is loaded into BigQuery for accurate analysis and insights.

Reference:
https://cloud.google.com/data-fusion/docs/concepts/overview#:~:text=The%20Cloud%20Data%20Fusion%20web%20UI%20lets%20you%20to%20build%20scalable%20data%20integration%20solutions%20to%20clean%2C%20prepare%2C%20blend%2C%20transfer%2C%20and%20transform%20data%2C%20without%20having%20to%20manage%20the%20infrastructure
.

Question 159

You are building a model to make clothing recommendations. You know a user's fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?
Continuously retrain the model on just the new data.
Continuously retrain the model on a combination of existing data and the new data.
Train on the existing data while using the new data as your test set.
Train on the new data while using the existing data as your test set.




Answer is Continuously retrain the model on a combination of existing data and the new data.

In Recommendation System - Matrix stored in database - which store the users rating and other details like how much time user spent on that page.We generally recommend on basis of matrix. And In production that matrix updated (or model get retrained once a day or once a week) - Because there could be new users rating. or we can say new data.

So model retrained fully i.e. on Existing Data + New Data => and generate a new matrix (user/item matrix)

Reference:
https://docs.aws.amazon.com/machine-learning/latest/dg/retraining-models-on-new-data.html

Question 160

You are creating a model to predict housing prices. Due to budget constraints, you must run it on a single resource-constrained virtual machine. Which learning algorithm should you use?
Linear regression
Logistic classification
Recurrent neural network
Feedforward neural network




Answer is Linear regression

A tip here to decide when a liner regression should be used or logistics regression needs to be used. If you are forecasting that is the values in the column that you are predicting is numeric, it is always liner regression. If you are classifying, that is buy or no buy, yes or no, you will be using logistics regression.

Reference:
https://towardsdatascience.com/predicting-house-prices-with-linear-regression-machine-learning-from-scratch-part-ii-47a0238aeac1

< Previous PageNext Page >

Quick access to all questions in this exam

Warning: file_get_contents(http://www.geoplugin.net/php.gp?ip=216.73.216.106): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden in /home/passnexa/public_html/view/question.php on line 243