Professional Data Engineer on Google Cloud Platform

35%

Question 91

You have Google Cloud Dataflow streaming pipeline running with a Google Cloud Pub/Sub subscription as the source. You need to make an update to the code that will make the new Cloud Dataflow pipeline incompatible with the current version. You do not want to lose any data when making this update.

What should you do?
Update the current pipeline and use the drain flag.
Update the current pipeline and provide the transform mapping JSON object.
Create a new pipeline that has the same Cloud Pub/Sub subscription and cancel the old pipeline.
Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline.




Answer is Update the current pipeline and use the drain flag.

Question mentions not to lose any data. All other options may lead to some data loss. If you want to prevent data loss as you bring down your streaming pipelines. the Dataflow pipeline can be stopped using the Drain option.
Drain options would cause Dataflow to stop any new processing, but would also allow the existing processing to complete.

Reference:
https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#drain

Question 92

Your software uses a simple JSON format for all messages. These messages are published to Google Cloud Pub/Sub, then processed with Google Cloud Dataflow to create a real-time dashboard for the CFO. During testing, you notice that some messages are missing in the dashboard.
You check the logs, and all messages are being published to Cloud Pub/Sub successfully.

What should you do next?
Check the dashboard application to see if it is not displaying correctly.
Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.
Use Google Stackdriver Monitoring on Cloud Pub/Sub to find the missing messages.
Switch Cloud Dataflow to pull messages from Cloud Pub/Sub instead of Cloud Pub/Sub pushing messages to Cloud Dataflow.




Answer is Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.

Stack driver could tell us about performance but not logging of missing data.

Reference:
https://cloud.google.com/pubsub/docs/monitoring

Question 93

You work for a large fast food restaurant chain with over 400,000 employees. You store employee information in Google BigQuery in a Users table consisting of a FirstName field and a LastName field. A member of IT is building an application and asks you to modify the schema and data in BigQuery so the application can query a FullName field consisting of the value of the FirstName field concatenated with a space, followed by the value of the LastName field for each employee.

How can you make that data available while minimizing cost?
Create a view in BigQuery that concatenates the FirstName and LastName field values to produce the FullName.
Add a new column called FullName to the Users table. Run an UPDATE statement that updates the FullName column for each user with the concatenation of the FirstName and LastName values.
Create a Google Cloud Dataflow job that queries BigQuery for the entire Users table, concatenates the FirstName value and LastName value for each user, and loads the proper values for FirstName, LastName, and FullName into a new table in BigQuery.
Use BigQuery to export the data for the table to a CSV file. Create a Google Cloud Dataproc job to process the CSV file and output a new CSV file containing the proper values for FirstName, LastName and FullName. Run a BigQuery load job to load the new CSV file into BigQuery.




Answer is Create a Google Cloud Dataflow job that queries BigQuery for the entire Users table, concatenates the FirstName value and LastName value for each user, and loads the proper values for FirstName, LastName, and FullName into a new table in BigQuery.

There are two ways to manually rename a column:

Using a SQL query: choose this option if you are more concerned about simplicity and ease of use, and you are less concerned about costs. Recreating the table: choose this option if you are more concerned about costs, and you are less concerned about simplicity and ease of use.

Similarily, if you need to create a new column, using Update query is not cost-effective. Creating a new table is cost-effective way.

Reference:
https://cloud.google.com/bigquery/docs/manually-changing-schemas?hl=en#changing_a_columns_name

Question 94

You have enabled the free integration between Firebase Analytics and Google BigQuery. Firebase now automatically creates a new table daily in BigQuery in the format app_events_YYYYMMDD. You want to query all of the tables for the past 30 days in legacy SQL.

What should you do?
Use the TABLE_DATE_RANGE function
Use the WHERE_PARTITIONTIME pseudo column
Use WHERE date BETWEEN YYYY-MM-DD AND YYYY-MM-DD
Use SELECT IF.(date >= YYYY-MM-DD AND date <= YYYY-MM-DD




Answer is Use the TABLE_DATE_RANGE function

Legacy sql uses table date range whereas standard sql uses table_sufix for wildcard

Reference:
https://cloud.google.com/bigquery/docs/reference/legacy-sql#table-date-range
https://cloud.google.com/blog/products/gcp/using-bigquery-and-firebase-analytics-to-understand-your-mobile-app?hl=am

Question 95

Your company is currently setting up data pipelines for their campaign. For all the Google Cloud Pub/Sub streaming data, one of the important business requirements is to be able to periodically identify the inputs and their timings during their campaign.
Engineers have decided to use windowing and transformation in Google Cloud Dataflow for this purpose. However, when testing this feature, they find that the Cloud Dataflow job fails for the all streaming insert.

What is the most likely cause of this problem?
They have not assigned the timestamp, which causes the job to fail
They have not set the triggers to accommodate the data coming in late, which causes the job to fail
They have not applied a global windowing function, which causes the job to fail when the pipeline is created
They have not applied a non-global windowing function, which causes the job to fail when the pipeline is created




Answer is They have not applied a non-global windowing function, which causes the job to fail when the pipeline is created

Caution: Beam’s default windowing behavior is to assign all elements of a PCollection to a single, global window and discard late data, even for unbounded PCollections. Before you use a grouping transform such as GroupByKey on an unbounded PCollection, you must do at least one of the following:

—->>>>>>Set a non-global windowing function. See Setting your PCollection’s windowing function. Set a non-default trigger. This allows the global window to emit results under other conditions, since the default windowing behavior (waiting for all data to arrive) will never occur.

—->>>>If you don’t set a non-global windowing function or a non-default trigger for your unbounded PCollection and subsequently use a grouping transform such as GroupByKey or Combine, your pipeline will generate an error upon construction and your job will fail.

Reference:
https://beam.apache.org/documentation/programming-guide/#windowing

Question 96

You architect a system to analyze seismic data. Your extract, transform, and load (ETL) process runs as a series of MapReduce jobs on an Apache Hadoop cluster. The ETL process takes days to process a data set because some steps are computationally expensive.
Then you discover that a sensor calibration step has been omitted.

How should you change your ETL process to carry out sensor calibration systematically in the future?
Modify the transformMapReduce jobs to apply sensor calibration before they do anything else.
Introduce a new MapReduce job to apply sensor calibration to raw data, and ensure all other MapReduce jobs are chained after this.
Add sensor calibration data to the output of the ETL process, and document that all users need to apply sensor calibration themselves.
Develop an algorithm through simulation to predict variance of data output from the last MapReduce job based on calibration factors, and apply the correction to all data.




Answer is Introduce a new MapReduce job to apply sensor calibration to raw data, and ensure all other MapReduce jobs are chained after this.

It is a cleaner approach with single job to handle the calibration before the data is used in the pipeline. Second, doing this step in later stages can be complex and maintenance of those jobs in the future will become challenging.

Question 97

An online retailer has built their current application on Google App Engine. A new initiative at the company mandates that they extend their application to allow their customers to transact directly via the application. They need to manage their shopping transactions and analyze combined data from multiple datasets using a business intelligence (BI) tool. They want to use only a single database for this purpose.

Which Google Cloud database should they choose?
BigQuery
Cloud SQL
Cloud BigTable
Cloud Datastore




Answer is Cloud SQL

Cloud SQL supports transactions as well as analysis through a BI tool. Firestore/Datastore does not support SQL syntax typically needed to do analysis done by a BI tool. BigQuery is not suitable for transactional use case. BigTable does not support SQL.

Question 98

You launched a new gaming app almost three years ago. You have been uploading log files from the previous day to a separate Google BigQuery table with the table name format LOGS_yyyymmdd. You have been using table wildcard functions to generate daily and monthly reports for all time ranges. Recently, you discovered that some queries that cover long date ranges are exceeding the limit of 1,000 tables and failing.

How can you resolve this issue?
Convert all daily log tables into date-partitioned tables
Convert the sharded tables into a single partitioned table
Enable query caching so you can cache data from previous months
Create separate views to cover each month, and query from these views




Answer is Convert the sharded tables into a single partitioned table

Google says that when you have multiple wildcard tables, best option is to shard it into single partitioned table. Time and cost efficient

Reference:
https://cloud.google.com/bigquery/docs/partitioned-tables#dt_partition_shard

Question 99

You are integrating one of your internal IT applications and Google BigQuery, so users can query BigQuery from the application's interface. You do not want individual users to authenticate to BigQuery and you do not want to give them access to the dataset. You need to securely access BigQuery from your IT application.

What should you do?
Create groups for your users and give those groups access to the dataset
Integrate with a single sign-on (SSO) platform, and pass each user's credentials along with the query request
Create a service account and grant dataset access to that account. Use the service account's private key to access the dataset
Create a dummy user and grant dataset access to that user. Store the username and password for that user in a file on the files system, and use those credentials to access the BigQuery dataset




Answer is Create a service account and grant dataset access to that account. Use the service account's private key to access the dataset

Service Account are best option when granting access from tools/appllications

Question 100

You are selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You want to minimize service costs. You also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention.

What should you do?
Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.
Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.
Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.




Answer is Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.

Dataflow provides a cost-effective solution to perform transformations on the streaming data, with autoscaling provides scaling without any intervention. System lag with Stackdriver provides monitoring for the streaming data. With autoscaling enabled, the Cloud Dataflow service automatically chooses the appropriate number of worker instances required to run your job.

< Previous PageNext Page >

Quick access to all questions in this exam