AWS Certified Data Analytics - Specialty

100%

Question 1

An Internet of Things business is developing a new gadget that will gather data on sleep patterns while sleeping on an intelligent mattress. Sensors will transmit data to an Amazon S3 bucket. Each night, around 2 MB of data is created for each bed. Each user's data must be analyzed and summarized, and the findings must be made accessible as quickly as feasible. Time windowing and other operations are included as part of the procedure. Each run, based on testing using a Python script, will need around 1 GB of RAM and will take a few minutes to finish.

Which option is the MOST cost-effective approach to execute the script?
AWS Lambda with a Python script
AWS Glue with a Scala job
Amazon EMR with an Apache Spark script
AWS Glue with a PySpark job




Answer is AWS Glue with a PySpark job

Glue DataBrew supports window functions

Reference:
https://docs.aws.amazon.com/databrew/latest/dg/recipe-actions.functions.window.html

Question 2

Utilizing Amazon Kinesis Data Streams, an online store is redesigning its inventory management and inventory reordering systems to automate product reordering. Kinesis Producer Library (KPL) is used to publish data to a stream by the inventory management system. Kinesis Client Library (KCL) is used to ingest data from the stream by the inventory reordering mechanism. The stream is set to scale up or down as necessary. The merchant realizes that the inventory reordering system is getting duplicate data just before production deployment.

What causes may be responsible for the duplicated data? (Make a selection of at least two.)
The producer has a network-related timeout.
The stream's value for the IteratorAgeMilliseconds metric is too high.
There was a change in the number of shards, record processors, or both.
The AggregationEnabled configuration property was set to true.
The max_records configuration property was set to a number that is too high.




Answer is
A. The producer has a network-related timeout.
C. There was a change in the number of shards, record processors, or both.


Reference:
https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html

Question 3

A real estate business uses Apache HBase on Amazon EMR to power a mission-critical application. A single master node is setup for Amazon EMR. The company's data is kept on a Hadoop Distributed File System in excess of 5 TB (HDFS). The organization is looking for a cost-effective way to increase the availability of its HBase data.

Which architectural design best fulfills the needs of the business?
Use Spot Instances for core and task nodes and a Reserved Instance for the EMR master node. Configure the EMR cluster with multiple master nodes. Schedule automated snapshots using Amazon EventBridge.
Store the data on an EMR File System (EMRFS) instead of HDFS. Enable EMRFS consistent view. Create an EMR HBase cluster with multiple master nodes. Point the HBase root directory to an Amazon S3 bucket.
Store the data on an EMR File System (EMRFS) instead of HDFS and enable EMRFS consistent view. Run two separate EMR clusters in two different Availability Zones. Point both clusters to the same HBase root directory in the same Amazon S3 bucket.
Store the data on an EMR File System (EMRFS) instead of HDFS and enable EMRFS consistent view. Create a primary EMR HBase cluster with multiple master nodes. Create a secondary EMR HBase read-replica cluster in a separate Availability Zone. Point both clusters to the same HBase root directory in the same Amazon S3 bucket.




Answer is Store the data on an EMR File System (EMRFS) instead of HDFS and enable EMRFS consistent view. Create a primary EMR HBase cluster with multiple master nodes. Create a secondary EMR HBase read-replica cluster in a separate Availability Zone. Point both clusters to the same HBase root directory in the same Amazon S3 bucket.

Use cases for HBase on S3 read replica clusters
Using HBase on S3 allows your data to be stored safely and durably. It persists the data off-cluster, which eliminates the dangers of data loss for persisted writes when the cluster is terminated. However, there can situations where you want to make sure that your data on HBase is highly available, even in the rare event of the cluster or Availability Zone failure. Another case could be when you want the ability to have multiple clusters access the same root directory in S3. If you have a primary cluster that goes under heavy load during bulk loads, writes, and compactions, this feature allows you to create secondary clusters that off-load and separate the read load from the write load, ensuring that you meet your read SLAs while optimizing around cost and performance.

Question 4

A corporation gets a 100 MB.csv file compressed using gzip once a month. The file is hosted in Amazon S3 Glacier and comprises 50,000 property listing records.
The company's data analyst is required to query a portion of the company's data for a certain vendor.

Which approach is the most cost-effective?
Load the data into Amazon S3 and query it with Amazon S3 Select.
Query the data from Amazon S3 Glacier directly with Amazon Glacier Select.
Load the data to Amazon S3 and query it with Amazon Athena.
Load the data to Amazon S3 and query it with Amazon Redshift Spectrum.




Answer is Load the data into Amazon S3 and query it with Amazon S3 Select.

Since we are talking about compressed file, Amazon Glacier Select cannot be used. S3 Glacier can query uncompressed JSON,CSV only. Athena is best when you have multiple objects. Data is in a single compressed file So we need to transfer data to S3 and then use S3 select.

Question 5

A human resources organization runs analytics queries on the company's data using a 10-node Amazon Redshift cluster. The Amazon Redshift cluster comprises two tables: one for products and one for transactions, both of which have a product sku field. The tables span more than 100 GB. Both tables are used in the majority of queries.

Which distribution pattern should the organization adopt to optimize query speed for the two tables?
An EVEN distribution style for both tables
A KEY distribution style for both tables
An ALL distribution style for the product table and an EVEN distribution style for the transactions table
An EVEN distribution style for the product table and an KEY distribution style for the transactions table




Answer is A KEY distribution style for both tables

As both tables have a common key and are used widely in reports.

Reference:
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html

Question 6

Amazon S3 is being used by a marketing organization to store campaign response data. Each campaign's data was compiled from a consistent set of sources. The data is uploaded to Amazon S3 in the form of.csv files. A business analyst will examine the data from each campaign using Amazon Athena. The organization requires a reduction in the cost of continuing data analysis using Athena.

Which steps should a data analytics professional perform in combination to achieve these requirements? (Select two.)
Convert the .csv files to Apache Parquet.
Convert the .csv files to Apache Avro.
Partition the data by campaign.
Partition the data by source.
Compress the .csv files.




Answers are;
A. Convert the .csv files to Apache Parquet.
C. Partition the data by campaign.


Apache Parquet has data compressed as default.

Reference:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

Question 7

A business has an application that reads records from a Kinesis data stream using the Amazon Kinesis Client Library (KCL). The application saw a considerable rise in use after a successful marketing effort. As a consequence, a data analyst was forced to separate certain data shards. When the shards were divided, the program began intermittently issuing ExpiredIteratorExceptions.

What is the data analyst's role in resolving this?
Increase the number of threads that process the stream records.
Increase the provisioned read capacity units assigned to the streamג€™s Amazon DynamoDB table.
Increase the provisioned write capacity units assigned to the stream's Amazon DynamoDB table.
Decrease the provisioned write capacity units assigned to the streamג€™s Amazon DynamoDB table.




Answer is Increase the provisioned write capacity units assigned to the stream's Amazon DynamoDB table.

If the shard iterator expires immediately, before you can use it, this might indicate that the DynamoDB table used by Kinesis does not have enough capacity to store the lease data. This situation is more likely to happen if you have a large number of shards. To solve this problem, increase the write capacity assigned to the shard table.

Reference:
https://docs.aws.amazon.com/streams/latest/dev/troubleshooting-consumers.html#shard-iterator-expires-unexpectedly

Question 8

A company ingests a large set of clickstream data in nested JSON format from different sources and stores it in Amazon S3. Data analysts need to analyze this data in combination with data stored in an Amazon Redshift cluster. Data analysts want to build a cost-effective and automated solution for this need.

Which solution meets these requirements?
Use Apache Spark SQL on Amazon EMR to convert the clickstream data to a tabular format. Use the Amazon Redshift COPY command to load the data into the Amazon Redshift cluster.
Use AWS Lambda to convert the data to a tabular format and write it to Amazon S3. Use the Amazon Redshift COPY command to load the data into the Amazon Redshift cluster.
Use the Relationalize class in an AWS Glue ETL job to transform the data and write the data back to Amazon S3. Use Amazon Redshift Spectrum to create external tables and join with the internal tables.
Use the Amazon Redshift COPY command to move the clickstream data directly into new tables in the Amazon Redshift cluster.




Answer is Use the Relationalize class in an AWS Glue ETL job to transform the data and write the data back to Amazon S3. Use Amazon Redshift Spectrum to create external tables and join with the internal tables.

The Relationalize PySpark transform can be used to flatten the nested data into a structured format. Amazon Redshift Spectrum can join the external tables and query the transformed clickstream data in place rather than needing to scale the cluster to accommodate the large dataset.

Question 9

Amazon Athena is used by a business to do ad-hoc searches on data stored in Amazon S3. To comply with internal security regulations, the organization wishes to incorporate additional restrictions to isolate query execution and history among individuals, teams, or apps operating in the same AWS account.

Which solution satisfies these criteria?
Create an S3 bucket for each given use case, create an S3 bucket policy that grants permissions to appropriate individual IAM users. and apply the S3 bucket policy to the S3 bucket.
Create an Athena workgroup for each given use case, apply tags to the workgroup, and create an IAM policy using the tags to apply appropriate permissions to the workgroup.
Create an IAM role for each given use case, assign appropriate permissions to the role for the given use case, and add the role to associate the role with Athena.
Create an AWS Glue Data Catalog resource policy for each given use case that grants permissions to appropriate individual IAM users, and apply the resource policy to the specific tables used by Athena.




Answer is Create an Athena workgroup for each given use case, apply tags to the workgroup, and create an IAM policy using the tags to apply appropriate permissions to the workgroup.

Amazon Athena Workgroups - A new resource type that can be used to separate query execution and query history between Users, Teams, or Applications running under the same AWS account

Reference:
https://aws.amazon.com/about-aws/whats-new/2019/02/athena_workgroups/



Quick access to all questions in this exam