Professional Data Engineer on Google Cloud Platform
75%
278 QUESTIONS AS TOTAL
Question 201
You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources.
How should you adjust the database design?
Add capacity (memory and disk space) to the database server by the order of 200.
Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.
Answer is Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
C is correct because this option provides the least amount of inconvenience over using pre-specified date ranges or one table per clinic while also increasing performance due to avoiding self-joins.
Your company is using WILDCARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:
# Syntax error : Expected end of statement but got "-" at [4:11]
SELECT age
FROM bigquery-public-data.noaa_gsod.gsod
WHERE
age != 99
AND_TABLE_SUFFIX = "˜1929'
ORDER BY age DESC
Which table name will make the SQL statement work correctly?
"˜bigquery-public-data.noaa_gsod.gsod"˜
bigquery-public-data.noaa_gsod.gsod*
"˜bigquery-public-data.noaa_gsod.gsod'*
`bigquery-public-data.noaa_gsod.gsod*`
Answer is
it follows the correct wildcard syntax of enclosing the table name in backticks and including the * wildcard character.
Question 203
Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery. Which three approaches can you take? (Choose three.)
Disable writes to certain tables.
Restrict access to tables by role.
Ensure that the data is encrypted at all times.
Restrict BigQuery API access to approved users.
Segregate data across multiple tables or databases.
Use Google Stackdriver Audit Logging to determine policy violations.
Answer is
B. Restrict access to tables by role.
D. Restrict BigQuery API access to approved users.
E. Segregate data across multiple tables or databases.
B. Restrict access to tables by role: This approach can be used to control access to tables based on user roles. Access controls can be set at the project, dataset, and table level, and roles can be customized to provide granular access controls to different groups of users.
D. Restrict BigQuery API access to approved users: This approach involves using IAM (Identity and Access Management) to control access to the BigQuery API. Access can be granted or revoked at the project or dataset level, and policies can be customized to control access based on user roles, IP addresses, and other factors.
E. Segregate data across multiple tables or databases: This approach involves breaking down large datasets into smaller, more manageable tables or databases. This helps to ensure that individual users have access only to the minimum amount of information required to do their jobs, and reduces the risk of data breaches or policy violations.
Option A is incorrect because disabling writes to certain tables would prevent users from updating the data which is not in line with the goal of providing access to the minimum amount of information required to do their jobs.
Option C is incorrect because while data encryption is important for security it doesn't specifically help with providing users access to the minimum amount of information required to do their jobs.
Option F is incorrect because while Google Stackdriver Audit Logging can help to determine policy violations it does not help to enforce the access controls and segregation of data.
Question 204
Your company handles data processing for a number of different clients. Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other's data. You want to ensure appropriate access to the data.
Which three steps should you take? (Choose three.)
Load data into different partitions.
Load data into a different dataset for each client.
Put each client's BigQuery dataset into a different table.
Restrict a client's dataset to approved users.
Only allow a service account to access the datasets.
F. Use the appropriate identity and access management (IAM) roles for each client's users.
Answer is B-D-F
Question 205
You want to use Google Stackdriver Logging to monitor Google BigQuery usage. You need an instant notification to be sent to your monitoring tool when new data is appended to a certain table using an insert job, but you do not want to receive notifications for other tables.
What should you do?
Make a call to the Stackdriver API to list all logs, and apply an advanced filter.
In the Stackdriver logging admin interface, and enable a log sink export to BigQuery.
In the Stackdriver logging admin interface, enable a log sink export to Google Cloud Pub/Sub, and subscribe to the topic from your monitoring tool.
Using the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.
Answer is Using the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.
Using a Logging sink, you can build an event-driven system to detect and respond to log events in real time. Cloud Logging can help you to build this event-driven architecture through its integration with Cloud Pub/Sub and a serverless computing service such as Cloud Functions or Cloud Run.
However as we need to create an alert only for certain table we will need to go with advanced log queries to filter only required log.
Your company's customer and order databases are often under heavy load. This makes performing analytics against them difficult without harming operations.
The databases are in a MySQL cluster, with nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations.
What should you do?
Add a node to the MySQL cluster and build an OLAP cube there.
Use an ETL tool to load the data from MySQL into Google BigQuery. Most Voted
Connect an on-premises Apache Hadoop cluster to MySQL and perform ETL.
Mount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc.
Answer is Mount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc.
A: OLAP on MySQL performs poorly.
B: ETL consumes lot of MySQL resources, to read the data, as per question MySQL is under pressure already.
C: Similar to B.
D: By mounting backup can avoid reading from MySQL, data freshness is not an issue as per the question (and is not mention in the question).
Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable.
The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data. They want to improve this performance while minimizing cost.
What should they do?
Redefine the schema by evenly distributing reads and writes across the row space of the table.
The performance issue should be resolved over time as the site of the BigDate cluster is increased.
Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.
Redesign the schema to use row keys based on numeric IDs that increase sequentially per user viewing the offers.
Answer is Redefine the schema by evenly distributing reads and writes across the row space of the table.
If you find that you're reading and writing only a small number of rows, you might need to redesign your schema so that reads and writes are more evenly distributed.
Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs.
What should you recommend they do?
Rewrite the job in Pig.
Rewrite the job in Apache Spark.
Increase the size of the Hadoop cluster.
Decrease the size of the Hadoop cluster but also rewrite the job in Hive.
Answer is Rewrite the job in Apache Spark.
The objective is to not increase the cost at the sametime do the analyitics required. Mapreduce jobs are not efficient and fast as spark so it will avoid failing the jobs.
Question 209
You work for a manufacturing plant that batches application log files together into a single log file once a day at 2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible.
What should you do?
Change the processing job to use Google Cloud Dataproc instead.
Manually start the Cloud Dataflow job each morning when you get into the office.
Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.
Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.
Answer is Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.
Scheduler for adhoc jobs – 3 jobs free and $0.10 per job
Your company is loading comma-separated values (CSV) files into Google BigQuery. The data is fully imported successfully; however, the imported data is not matching byte-to-byte to the source file.
What is the most likely cause of this problem?
The CSV data loaded in BigQuery is not flagged as CSV.
The CSV data has invalid rows that were skipped on import.
The CSV data loaded in BigQuery is not using BigQuery's default encoding.
The CSV data has not gone through an ETL phase before loading into BigQuery.
Answer is The CSV data loaded in BigQuery is not using BigQuery's default encoding.