You have already completed the Test before. Hence you can not start it again.
Test is loading...
You must sign in or sign up to start the Test.
You have to finish following quiz, to start this Test:
Your results are here!! for" AWS Data Analytics Specialty Practice Test 8 "
0 of 65 questions answered correctly
Your time:
Time has elapsed
Your Final Score is : 0
You have attempted : 0
Number of Correct Questions : 0 and scored 0
Number of Incorrect Questions : 0 and Negative marks 0
Average score
Your score
AWS Certified Data Analytics Specialty
You have attempted: 0
Number of Correct Questions: 0 and scored 0
Number of Incorrect Questions: 0 and Negative marks 0
You can review your answers by clicking on “View Answers” option. Important Note : Open Reference Documentation Links in New Tab (Right Click and Open in New Tab).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Answered
Review
Question 1 of 65
1. Question
An oil and gas company plans to monitor the pressure in the pipelines using a gas pressure sensor. To monitor the data in real-time, sensors are placed all over the pipelines. The system will send an alert to open the valve when an anomaly is detected. A Data Analyst has been tasked to set up an Amazon Kinesis Data Stream to collect the sensors data and an AWS Lambda function to open the valve.
Which of the following options would be the most cost-effective solution for anomaly detection?
Correct
Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real-time. Amazon Kinesis Data Analytics reduces the complexity of building, managing, and integrating streaming applications with other AWS services. You can quickly build SQL queries and sophisticated Apache Flink applications in a supported language such as Java or Scala using built-in templates and operators for common processing functions to organize, transform, aggregate, and analyze data at any scale.
To detect the anomalies in your data stream, you can use the RANDOM_CUT_FOREST function. RCF is an unsupervised algorithm for detecting anomalous data points within a data set. With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered normal, while high values indicate the presence of an anomaly in the data.
Hence, the correct answer is: Use the RANDOM_CUT_FOREST function in the Kinesis Data Analytics application to detect the anomaly and send an alert to open the valve.
The option that says: Use Amazon Kinesis Data Firehose and choose Amazon S3 as the storage of the sensors data. Create a Lambda function to schedule the query of Amazon Athena in the S3 bucket. The Lambda function will send an alert to open the valve if an anomaly is discovered is incorrect. This solution wont be able to monitor the data and detect anomalies in real-time. You also have to invest time and effort to implement a custom algorithm in Lambda to detect anomalies from the data stored in Amazon S3, which can entail additional cost.
The option that says: Launch a Spark Streaming application in an Amazon EMR cluster to connect to Amazon Kinesis streams and Spark machine learning to identify the anomaly. The Spark application will send an alert to open the valve if an anomaly is discovered is incorrect. An Apache Spark Streaming application doesnt have a built-in capability to detect anomalies, unlike the RCF machine learning query that is readily available in Amazon Kinesis Data Analytics.
The option that says: Provision an Amazon EC2 Fleet with KCL application to consume the stream and aggregate the data collected by the sensors to detect the anomaly. The application will send an alert to open the valve if an anomaly is discovered is incorrect. An EC2 Fleet is a group of On-demand and Spot EC2 instances. Note that it is stated in the scenario that you should create a cost-effective solution. It also takes a lot of time to develop an algorithm that can detect anomalies.
References: https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest.html https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html
Incorrect
Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real-time. Amazon Kinesis Data Analytics reduces the complexity of building, managing, and integrating streaming applications with other AWS services. You can quickly build SQL queries and sophisticated Apache Flink applications in a supported language such as Java or Scala using built-in templates and operators for common processing functions to organize, transform, aggregate, and analyze data at any scale.
To detect the anomalies in your data stream, you can use the RANDOM_CUT_FOREST function. RCF is an unsupervised algorithm for detecting anomalous data points within a data set. With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered normal, while high values indicate the presence of an anomaly in the data.
Hence, the correct answer is: Use the RANDOM_CUT_FOREST function in the Kinesis Data Analytics application to detect the anomaly and send an alert to open the valve.
The option that says: Use Amazon Kinesis Data Firehose and choose Amazon S3 as the storage of the sensors data. Create a Lambda function to schedule the query of Amazon Athena in the S3 bucket. The Lambda function will send an alert to open the valve if an anomaly is discovered is incorrect. This solution wont be able to monitor the data and detect anomalies in real-time. You also have to invest time and effort to implement a custom algorithm in Lambda to detect anomalies from the data stored in Amazon S3, which can entail additional cost.
The option that says: Launch a Spark Streaming application in an Amazon EMR cluster to connect to Amazon Kinesis streams and Spark machine learning to identify the anomaly. The Spark application will send an alert to open the valve if an anomaly is discovered is incorrect. An Apache Spark Streaming application doesnt have a built-in capability to detect anomalies, unlike the RCF machine learning query that is readily available in Amazon Kinesis Data Analytics.
The option that says: Provision an Amazon EC2 Fleet with KCL application to consume the stream and aggregate the data collected by the sensors to detect the anomaly. The application will send an alert to open the valve if an anomaly is discovered is incorrect. An EC2 Fleet is a group of On-demand and Spot EC2 instances. Note that it is stated in the scenario that you should create a cost-effective solution. It also takes a lot of time to develop an algorithm that can detect anomalies.
References: https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest.html https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html
Unattempted
Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real-time. Amazon Kinesis Data Analytics reduces the complexity of building, managing, and integrating streaming applications with other AWS services. You can quickly build SQL queries and sophisticated Apache Flink applications in a supported language such as Java or Scala using built-in templates and operators for common processing functions to organize, transform, aggregate, and analyze data at any scale.
To detect the anomalies in your data stream, you can use the RANDOM_CUT_FOREST function. RCF is an unsupervised algorithm for detecting anomalous data points within a data set. With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered normal, while high values indicate the presence of an anomaly in the data.
Hence, the correct answer is: Use the RANDOM_CUT_FOREST function in the Kinesis Data Analytics application to detect the anomaly and send an alert to open the valve.
The option that says: Use Amazon Kinesis Data Firehose and choose Amazon S3 as the storage of the sensors data. Create a Lambda function to schedule the query of Amazon Athena in the S3 bucket. The Lambda function will send an alert to open the valve if an anomaly is discovered is incorrect. This solution wont be able to monitor the data and detect anomalies in real-time. You also have to invest time and effort to implement a custom algorithm in Lambda to detect anomalies from the data stored in Amazon S3, which can entail additional cost.
The option that says: Launch a Spark Streaming application in an Amazon EMR cluster to connect to Amazon Kinesis streams and Spark machine learning to identify the anomaly. The Spark application will send an alert to open the valve if an anomaly is discovered is incorrect. An Apache Spark Streaming application doesnt have a built-in capability to detect anomalies, unlike the RCF machine learning query that is readily available in Amazon Kinesis Data Analytics.
The option that says: Provision an Amazon EC2 Fleet with KCL application to consume the stream and aggregate the data collected by the sensors to detect the anomaly. The application will send an alert to open the valve if an anomaly is discovered is incorrect. An EC2 Fleet is a group of On-demand and Spot EC2 instances. Note that it is stated in the scenario that you should create a cost-effective solution. It also takes a lot of time to develop an algorithm that can detect anomalies.
References: https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest.html https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html
Question 2 of 65
2. Question
An energy company is constructing two weather stations to collect and record temperature data from a solar farm.
Weather station A has 20 sensors with each unique ID
Weather station B has 10 sensors with each unique ID
Onsite engineers strategically determined the placement and orientation of the weather stations. The company plans to use Amazon Kinesis Data Streams to collect data from each sensor.
A single Kinesis data stream with two shards was created based on the total data throughput gathered from the initial testing. The partition keys were created based on the two station names. A bottleneck on data coming from Station A has been spotted during the dry-run. However, there were no problems with Station B. Upon checking, it was inferred that the currently allocated throughput for the Kinesis Data Streams is still greater than the total stream throughput of the sensor data.
Which solution will resolve the bottleneck issue without increasing the overall cost?
Correct
Partition key is used to segregate and route records to different shards of a data stream. A partition key is specified by your data producer while adding data to an Amazon Kinesis data stream. For example, assuming you have a data stream with two shards (shard 1 and shard 2). You can configure your data producer to use two partition keys (key A and key B) so that all records with key A are added to shard 1 and all records with key B are added to shard 2.
If your use cases do not require data stored in a shard to have high affinity, you can achieve high overall throughput by using a random partition key to distribute data. Random partition keys help distribute the incoming data records evenly across all the shards in the stream and reduce the likelihood of one or more shards getting hit with a disproportionate number of records. You can use a universally unique identifier (UUID) as a partition key to achieve this uniform distribution of records across shards. This strategy can increase the latency of record processing if the consumer application has to aggregate data from multiple shards.
The most common reasons for write throughput being slower than expected are as follows.
Service Limits Exceeded
Producer Optimization
If these calls arent the issue, make sure youve selected a partition key that allows you to distribute put operations evenly across all shards, and that you dont have a particular partition key thats bumping into the service limits when the rest are not. This requires that you measure peak throughput and take into account the number of shards in your stream.
Hence, the correct answer is: Assign the sensor ID as the partition key instead of the station name.
The option that says: Provision a different Kinesis data stream with two shards to stream sensor data coming from Station A is incorrect because provisioning a different Kinesis data stream for Station A is not needed since a single data stream can collect data from multiple resources.
The option that says: Decrease the number of sensors in Station A from 20 to 10 sensors is incorrect because decreasing the number of sensors might adversely affect the weather data collection. Moreover, this course of action is unwarranted since it is clearly mentioned in the scenario that the allocated Data Streams throughput is still more than the total stream throughput. Therefore, the number of sensors is not the culprit of this bottleneck issue.
The option that says: Increase the level of parallelism for greater throughput by increasing the number of shards in Amazon Kinesis Data Steams is incorrect because increasing the number of shards will increase the overall cost of the solution. This is also unnecessary since the scenario mentioned that the allocated Data Streams throughput is still more than the total stream throughput.
References: https://docs.aws.amazon.com/streams/latest/dev/troubleshooting-producers.html https://aws.amazon.com/blogs/big-data/under-the-hood-scaling-your-kinesis-data-streams/ https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html
Incorrect
Partition key is used to segregate and route records to different shards of a data stream. A partition key is specified by your data producer while adding data to an Amazon Kinesis data stream. For example, assuming you have a data stream with two shards (shard 1 and shard 2). You can configure your data producer to use two partition keys (key A and key B) so that all records with key A are added to shard 1 and all records with key B are added to shard 2.
If your use cases do not require data stored in a shard to have high affinity, you can achieve high overall throughput by using a random partition key to distribute data. Random partition keys help distribute the incoming data records evenly across all the shards in the stream and reduce the likelihood of one or more shards getting hit with a disproportionate number of records. You can use a universally unique identifier (UUID) as a partition key to achieve this uniform distribution of records across shards. This strategy can increase the latency of record processing if the consumer application has to aggregate data from multiple shards.
The most common reasons for write throughput being slower than expected are as follows.
Service Limits Exceeded
Producer Optimization
If these calls arent the issue, make sure youve selected a partition key that allows you to distribute put operations evenly across all shards, and that you dont have a particular partition key thats bumping into the service limits when the rest are not. This requires that you measure peak throughput and take into account the number of shards in your stream.
Hence, the correct answer is: Assign the sensor ID as the partition key instead of the station name.
The option that says: Provision a different Kinesis data stream with two shards to stream sensor data coming from Station A is incorrect because provisioning a different Kinesis data stream for Station A is not needed since a single data stream can collect data from multiple resources.
The option that says: Decrease the number of sensors in Station A from 20 to 10 sensors is incorrect because decreasing the number of sensors might adversely affect the weather data collection. Moreover, this course of action is unwarranted since it is clearly mentioned in the scenario that the allocated Data Streams throughput is still more than the total stream throughput. Therefore, the number of sensors is not the culprit of this bottleneck issue.
The option that says: Increase the level of parallelism for greater throughput by increasing the number of shards in Amazon Kinesis Data Steams is incorrect because increasing the number of shards will increase the overall cost of the solution. This is also unnecessary since the scenario mentioned that the allocated Data Streams throughput is still more than the total stream throughput.
References: https://docs.aws.amazon.com/streams/latest/dev/troubleshooting-producers.html https://aws.amazon.com/blogs/big-data/under-the-hood-scaling-your-kinesis-data-streams/ https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html
Unattempted
Partition key is used to segregate and route records to different shards of a data stream. A partition key is specified by your data producer while adding data to an Amazon Kinesis data stream. For example, assuming you have a data stream with two shards (shard 1 and shard 2). You can configure your data producer to use two partition keys (key A and key B) so that all records with key A are added to shard 1 and all records with key B are added to shard 2.
If your use cases do not require data stored in a shard to have high affinity, you can achieve high overall throughput by using a random partition key to distribute data. Random partition keys help distribute the incoming data records evenly across all the shards in the stream and reduce the likelihood of one or more shards getting hit with a disproportionate number of records. You can use a universally unique identifier (UUID) as a partition key to achieve this uniform distribution of records across shards. This strategy can increase the latency of record processing if the consumer application has to aggregate data from multiple shards.
The most common reasons for write throughput being slower than expected are as follows.
Service Limits Exceeded
Producer Optimization
If these calls arent the issue, make sure youve selected a partition key that allows you to distribute put operations evenly across all shards, and that you dont have a particular partition key thats bumping into the service limits when the rest are not. This requires that you measure peak throughput and take into account the number of shards in your stream.
Hence, the correct answer is: Assign the sensor ID as the partition key instead of the station name.
The option that says: Provision a different Kinesis data stream with two shards to stream sensor data coming from Station A is incorrect because provisioning a different Kinesis data stream for Station A is not needed since a single data stream can collect data from multiple resources.
The option that says: Decrease the number of sensors in Station A from 20 to 10 sensors is incorrect because decreasing the number of sensors might adversely affect the weather data collection. Moreover, this course of action is unwarranted since it is clearly mentioned in the scenario that the allocated Data Streams throughput is still more than the total stream throughput. Therefore, the number of sensors is not the culprit of this bottleneck issue.
The option that says: Increase the level of parallelism for greater throughput by increasing the number of shards in Amazon Kinesis Data Steams is incorrect because increasing the number of shards will increase the overall cost of the solution. This is also unnecessary since the scenario mentioned that the allocated Data Streams throughput is still more than the total stream throughput.
References: https://docs.aws.amazon.com/streams/latest/dev/troubleshooting-producers.html https://aws.amazon.com/blogs/big-data/under-the-hood-scaling-your-kinesis-data-streams/ https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html
Question 3 of 65
3. Question
A digital marketing company uses Amazon DynamoDB and highly-available Amazon EC2 instances for one of its solutions. Its application logs are pushed to Amazon CloudWatch logs. The team of data analysts wants to enrich these logs with data from DynamoDB in near-real-time and use the output for further study.
Which among these steps will enable collection and enrichment based on the requirements stated above?
Correct
Amazon Kinesis Data Firehose captures, transforms, and loads streaming data from sources such as a Kinesis data stream, the Kinesis Agent, or Amazon CloudWatch Logs into downstream services such as Kinesis Data Analytics or Amazon S3. You can write Lambda functions to request additional, customized processing of the data before it is sent downstream. AWS Lambda can perform data enrichment like looking up data from a DynamoDB table, and then produce the enriched data onto another stream. Lambda is commonly used for preprocessing the analytics app to handle more complicated data formats.
There are blueprints that you can use to create a Lambda function for data transformation. It includes a blueprint that reads CloudWatch Logs. Data sent from CloudWatch Logs to Amazon Kinesis Data Firehose is already compressed with gzip level 6 compression, so you do not need to use compression within your Kinesis Data Firehose delivery stream.
Thus, the correct answer is: Write an AWS Lambda function that will enrich the logs with the DynamoDB data. Create an Amazon Kinesis Data Firehose delivery stream, configure it to subscribe to Amazon CloudWatch Logs, and set an Amazon S3 bucket as its destination. Create a CloudWatch Logs subscription that sends log events to your delivery stream.
The option that says: Export the EC2 application logs to Amazon S3 on an hourly basis using AWS CLI. Use AWS Glue crawlers to catalog the logs. Configure an AWS Glue connection to the DynamoDB table and an AWS Glue ETL job to enrich the data. Store the enriched data in an Amazon S3 bucket is incorrect. It does not fulfill the near real-time analysis requirement since the data is only exported on an hourly basis.
The option that says: Write an AWS Lambda function that will export the EC2 application logs to Amazon S3 on an hourly basis. Use Apache Spark SQL on Amazon EMR to read the logs from Amazon S3 and enrich the records with the data from DynamoDB. Store the enriched data in an Amazon S3 bucket is incorrect. This does not fulfill the near real-time analysis requirement. For cost-saving matters, it is more strategic to avoid using Amazon EMR as it entails additional costs to run its underlying EC2 instances.
The option that says: Install Amazon Kinesis Agent on the EC2 instance. Configure the application to write the logs in a local filesystem and configure Amazon Kinesis Agent to send the data to Amazon Kinesis Data Streams. Configure a Kinesis Data Analytics SQL application with the Kinesis data stream as the source and enrich it with data from the DynamoDB table. Store the enriched output stream in an Amazon S3 bucket using Amazon Kinesis Data Firehose is incorrect. Installing Kinesis agent to the EC2 instance is unwarranted since there is already a CloudWatch Logs integration that can deliver the logs. Creating a Kinesis Data Analytics SQL application is also unnecessary and quite costly.
References: https://docs.aws.amazon.com/firehose/latest/dev/writing-with-cloudwatch-logs.html https://aws.amazon.com/blogs/big-data/joining-and-enriching-streaming-data-on-amazon-kinesis/ https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html
Incorrect
Amazon Kinesis Data Firehose captures, transforms, and loads streaming data from sources such as a Kinesis data stream, the Kinesis Agent, or Amazon CloudWatch Logs into downstream services such as Kinesis Data Analytics or Amazon S3. You can write Lambda functions to request additional, customized processing of the data before it is sent downstream. AWS Lambda can perform data enrichment like looking up data from a DynamoDB table, and then produce the enriched data onto another stream. Lambda is commonly used for preprocessing the analytics app to handle more complicated data formats.
There are blueprints that you can use to create a Lambda function for data transformation. It includes a blueprint that reads CloudWatch Logs. Data sent from CloudWatch Logs to Amazon Kinesis Data Firehose is already compressed with gzip level 6 compression, so you do not need to use compression within your Kinesis Data Firehose delivery stream.
Thus, the correct answer is: Write an AWS Lambda function that will enrich the logs with the DynamoDB data. Create an Amazon Kinesis Data Firehose delivery stream, configure it to subscribe to Amazon CloudWatch Logs, and set an Amazon S3 bucket as its destination. Create a CloudWatch Logs subscription that sends log events to your delivery stream.
The option that says: Export the EC2 application logs to Amazon S3 on an hourly basis using AWS CLI. Use AWS Glue crawlers to catalog the logs. Configure an AWS Glue connection to the DynamoDB table and an AWS Glue ETL job to enrich the data. Store the enriched data in an Amazon S3 bucket is incorrect. It does not fulfill the near real-time analysis requirement since the data is only exported on an hourly basis.
The option that says: Write an AWS Lambda function that will export the EC2 application logs to Amazon S3 on an hourly basis. Use Apache Spark SQL on Amazon EMR to read the logs from Amazon S3 and enrich the records with the data from DynamoDB. Store the enriched data in an Amazon S3 bucket is incorrect. This does not fulfill the near real-time analysis requirement. For cost-saving matters, it is more strategic to avoid using Amazon EMR as it entails additional costs to run its underlying EC2 instances.
The option that says: Install Amazon Kinesis Agent on the EC2 instance. Configure the application to write the logs in a local filesystem and configure Amazon Kinesis Agent to send the data to Amazon Kinesis Data Streams. Configure a Kinesis Data Analytics SQL application with the Kinesis data stream as the source and enrich it with data from the DynamoDB table. Store the enriched output stream in an Amazon S3 bucket using Amazon Kinesis Data Firehose is incorrect. Installing Kinesis agent to the EC2 instance is unwarranted since there is already a CloudWatch Logs integration that can deliver the logs. Creating a Kinesis Data Analytics SQL application is also unnecessary and quite costly.
References: https://docs.aws.amazon.com/firehose/latest/dev/writing-with-cloudwatch-logs.html https://aws.amazon.com/blogs/big-data/joining-and-enriching-streaming-data-on-amazon-kinesis/ https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html
Unattempted
Amazon Kinesis Data Firehose captures, transforms, and loads streaming data from sources such as a Kinesis data stream, the Kinesis Agent, or Amazon CloudWatch Logs into downstream services such as Kinesis Data Analytics or Amazon S3. You can write Lambda functions to request additional, customized processing of the data before it is sent downstream. AWS Lambda can perform data enrichment like looking up data from a DynamoDB table, and then produce the enriched data onto another stream. Lambda is commonly used for preprocessing the analytics app to handle more complicated data formats.
There are blueprints that you can use to create a Lambda function for data transformation. It includes a blueprint that reads CloudWatch Logs. Data sent from CloudWatch Logs to Amazon Kinesis Data Firehose is already compressed with gzip level 6 compression, so you do not need to use compression within your Kinesis Data Firehose delivery stream.
Thus, the correct answer is: Write an AWS Lambda function that will enrich the logs with the DynamoDB data. Create an Amazon Kinesis Data Firehose delivery stream, configure it to subscribe to Amazon CloudWatch Logs, and set an Amazon S3 bucket as its destination. Create a CloudWatch Logs subscription that sends log events to your delivery stream.
The option that says: Export the EC2 application logs to Amazon S3 on an hourly basis using AWS CLI. Use AWS Glue crawlers to catalog the logs. Configure an AWS Glue connection to the DynamoDB table and an AWS Glue ETL job to enrich the data. Store the enriched data in an Amazon S3 bucket is incorrect. It does not fulfill the near real-time analysis requirement since the data is only exported on an hourly basis.
The option that says: Write an AWS Lambda function that will export the EC2 application logs to Amazon S3 on an hourly basis. Use Apache Spark SQL on Amazon EMR to read the logs from Amazon S3 and enrich the records with the data from DynamoDB. Store the enriched data in an Amazon S3 bucket is incorrect. This does not fulfill the near real-time analysis requirement. For cost-saving matters, it is more strategic to avoid using Amazon EMR as it entails additional costs to run its underlying EC2 instances.
The option that says: Install Amazon Kinesis Agent on the EC2 instance. Configure the application to write the logs in a local filesystem and configure Amazon Kinesis Agent to send the data to Amazon Kinesis Data Streams. Configure a Kinesis Data Analytics SQL application with the Kinesis data stream as the source and enrich it with data from the DynamoDB table. Store the enriched output stream in an Amazon S3 bucket using Amazon Kinesis Data Firehose is incorrect. Installing Kinesis agent to the EC2 instance is unwarranted since there is already a CloudWatch Logs integration that can deliver the logs. Creating a Kinesis Data Analytics SQL application is also unnecessary and quite costly.
References: https://docs.aws.amazon.com/firehose/latest/dev/writing-with-cloudwatch-logs.html https://aws.amazon.com/blogs/big-data/joining-and-enriching-streaming-data-on-amazon-kinesis/ https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html
Question 4 of 65
4. Question
A company has multiple data analytics teams that run their own Amazon EMR cluster. The teams have their own metadata for running different SQL queries using Hive. A centralized metadata layer must be created that exposes S3 objects as tables that can be used by all teams. What should be done to fulfill this requirement?
Correct
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Amazon EMR is ideal for problems that necessitate the fast and efficient processing of large amounts of data. By default, Hive records metastore information in a MySQL database on the master nodes file system. The metastore contains a description of the table and the underlying data on which it is built, including the partition names, data types, and so on. When a cluster terminates, all cluster nodes shut down, including the master node. When this happens, local data is lost because node file systems use ephemeral storage. If you need the metastore to persist, you must create an external metastore that exists outside the cluster. You have two options for an external metastore: AWS Glue Data Catalog (Amazon EMR version 5.8.0 or later only). Amazon RDS or Amazon Aurora. In this scenario, each of the teams metadata can be externalized in a central metastore through either AWS Glue Data Catalog or Amazon RDS/Amazon Aurora. When using Amazon RDS, take note that if you share metastore information between two clusters, you must ensure that you do not write to the same metastore table concurrently, unless you are writing to different partitions of the same metastore table. Hence, the correct answer is: Configure an external metastore for Hive. The option that says: Alter table recover partitions is incorrect because this option only allows you to import tables concurrently into many clusters. It wont help you track read-after-write consistency for objects in Amazon S3 and create a centralized metadata layer. The option that says: Enable EMRFS consistent view is incorrect because this feature just tracks the consistency of S3 objects. EMRFS consistent view works by using a DynamoDB table to track objects in Amazon S3 that has been synced with or created by EMRFS. The metadata is just used to track all operations (read, write, update, and copy), and no actual content is stored in it. The option that says: Use Amazon EMR Notebooks is incorrect because this is simply a serverless notebook that you can use to run queries and code. References: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-metastore-external-hive.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-external.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
Incorrect
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Amazon EMR is ideal for problems that necessitate the fast and efficient processing of large amounts of data. By default, Hive records metastore information in a MySQL database on the master nodes file system. The metastore contains a description of the table and the underlying data on which it is built, including the partition names, data types, and so on. When a cluster terminates, all cluster nodes shut down, including the master node. When this happens, local data is lost because node file systems use ephemeral storage. If you need the metastore to persist, you must create an external metastore that exists outside the cluster. You have two options for an external metastore: AWS Glue Data Catalog (Amazon EMR version 5.8.0 or later only). Amazon RDS or Amazon Aurora. In this scenario, each of the teams metadata can be externalized in a central metastore through either AWS Glue Data Catalog or Amazon RDS/Amazon Aurora. When using Amazon RDS, take note that if you share metastore information between two clusters, you must ensure that you do not write to the same metastore table concurrently, unless you are writing to different partitions of the same metastore table. Hence, the correct answer is: Configure an external metastore for Hive. The option that says: Alter table recover partitions is incorrect because this option only allows you to import tables concurrently into many clusters. It wont help you track read-after-write consistency for objects in Amazon S3 and create a centralized metadata layer. The option that says: Enable EMRFS consistent view is incorrect because this feature just tracks the consistency of S3 objects. EMRFS consistent view works by using a DynamoDB table to track objects in Amazon S3 that has been synced with or created by EMRFS. The metadata is just used to track all operations (read, write, update, and copy), and no actual content is stored in it. The option that says: Use Amazon EMR Notebooks is incorrect because this is simply a serverless notebook that you can use to run queries and code. References: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-metastore-external-hive.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-external.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
Unattempted
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Amazon EMR is ideal for problems that necessitate the fast and efficient processing of large amounts of data. By default, Hive records metastore information in a MySQL database on the master nodes file system. The metastore contains a description of the table and the underlying data on which it is built, including the partition names, data types, and so on. When a cluster terminates, all cluster nodes shut down, including the master node. When this happens, local data is lost because node file systems use ephemeral storage. If you need the metastore to persist, you must create an external metastore that exists outside the cluster. You have two options for an external metastore: AWS Glue Data Catalog (Amazon EMR version 5.8.0 or later only). Amazon RDS or Amazon Aurora. In this scenario, each of the teams metadata can be externalized in a central metastore through either AWS Glue Data Catalog or Amazon RDS/Amazon Aurora. When using Amazon RDS, take note that if you share metastore information between two clusters, you must ensure that you do not write to the same metastore table concurrently, unless you are writing to different partitions of the same metastore table. Hence, the correct answer is: Configure an external metastore for Hive. The option that says: Alter table recover partitions is incorrect because this option only allows you to import tables concurrently into many clusters. It wont help you track read-after-write consistency for objects in Amazon S3 and create a centralized metadata layer. The option that says: Enable EMRFS consistent view is incorrect because this feature just tracks the consistency of S3 objects. EMRFS consistent view works by using a DynamoDB table to track objects in Amazon S3 that has been synced with or created by EMRFS. The metadata is just used to track all operations (read, write, update, and copy), and no actual content is stored in it. The option that says: Use Amazon EMR Notebooks is incorrect because this is simply a serverless notebook that you can use to run queries and code. References: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-metastore-external-hive.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-external.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
Question 5 of 65
5. Question
A company wants to capture the users viewing patterns on a website by sending clickstream data to an Amazon Kinesis Data stream. The clickstream data is further analyzed in Amazon Kinesis Data Analytics using windowed queries.
Given that data arrives at inconsistent intervals, a Data Analyst must aggregate them so that related records fall into the same time-restricted window.
Which type of windowed query should be used?
Correct
Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real-time. Amazon Kinesis Data Analytics reduces the complexity of building, managing, and integrating streaming applications with other AWS services. You can quickly build SQL queries and sophisticated Apache Flink applications in a supported language such as Java or Scala using built-in templates and operators for common processing functions to organize, transform, aggregate, and analyze data at any scale.
SQL queries in your application code execute continuously over in-application streams. An in-application stream represents unbounded data that flows continuously through your application. Therefore, to get result sets from this continuously updating input, you often bound queries using a window defined in terms of time or rows. These are also called windowed SQL.
You can specify a query to process records in a tumbling window, sliding window, or stagger window manner, depending on your application needs. Kinesis Data Analytics supports the following window types:
Stagger Windows: A query that aggregates data using keyed time-based windows that open as data arrives. The keys allow for multiple overlapping windows. It is suited for analyzing groups of data that arrive at inconsistent times. This is the recommended way to aggregate data using time-based windows, because Stagger Windows reduce late or out-of-order data compared to Tumbling windows.
Tumbling Windows: A query that aggregates data using distinct time-based windows that open and close at regular intervals.
Sliding Windows: A query that aggregates data continuously, using a fixed time or rowcount interval.
Hence, the correct answer is Stagger Windows.
Tumbling Windows is incorrect because this is more suitable for data that arrives at regular intervals. If tumbling windows are used to analyze groups of time-related data, the individual records might fall into separate windows. So then the partial results from each window must be combined later to yield complete results for each group of records.
Stream Joins is incorrect because it is not a type of windowed query. This is a streaming data operation used for correlating data arriving on multiple in-application streams.
Sliding Windows is incorrect. One of the requirements is that related data belonging to the same time-restricted window must be aggregated. Sliding windowed query is not suitable for the scenario since windows can overlap in this type of processing, and a record can be part of multiple windows and be processed with each window.
References: https://aws.amazon.com/kinesis/data-analytics/faqs/ https://docs.aws.amazon.com/kinesisanalytics/latest/dev/stagger-window-concepts.html
Incorrect
Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real-time. Amazon Kinesis Data Analytics reduces the complexity of building, managing, and integrating streaming applications with other AWS services. You can quickly build SQL queries and sophisticated Apache Flink applications in a supported language such as Java or Scala using built-in templates and operators for common processing functions to organize, transform, aggregate, and analyze data at any scale.
SQL queries in your application code execute continuously over in-application streams. An in-application stream represents unbounded data that flows continuously through your application. Therefore, to get result sets from this continuously updating input, you often bound queries using a window defined in terms of time or rows. These are also called windowed SQL.
You can specify a query to process records in a tumbling window, sliding window, or stagger window manner, depending on your application needs. Kinesis Data Analytics supports the following window types:
Stagger Windows: A query that aggregates data using keyed time-based windows that open as data arrives. The keys allow for multiple overlapping windows. It is suited for analyzing groups of data that arrive at inconsistent times. This is the recommended way to aggregate data using time-based windows, because Stagger Windows reduce late or out-of-order data compared to Tumbling windows.
Tumbling Windows: A query that aggregates data using distinct time-based windows that open and close at regular intervals.
Sliding Windows: A query that aggregates data continuously, using a fixed time or rowcount interval.
Hence, the correct answer is Stagger Windows.
Tumbling Windows is incorrect because this is more suitable for data that arrives at regular intervals. If tumbling windows are used to analyze groups of time-related data, the individual records might fall into separate windows. So then the partial results from each window must be combined later to yield complete results for each group of records.
Stream Joins is incorrect because it is not a type of windowed query. This is a streaming data operation used for correlating data arriving on multiple in-application streams.
Sliding Windows is incorrect. One of the requirements is that related data belonging to the same time-restricted window must be aggregated. Sliding windowed query is not suitable for the scenario since windows can overlap in this type of processing, and a record can be part of multiple windows and be processed with each window.
References: https://aws.amazon.com/kinesis/data-analytics/faqs/ https://docs.aws.amazon.com/kinesisanalytics/latest/dev/stagger-window-concepts.html
Unattempted
Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real-time. Amazon Kinesis Data Analytics reduces the complexity of building, managing, and integrating streaming applications with other AWS services. You can quickly build SQL queries and sophisticated Apache Flink applications in a supported language such as Java or Scala using built-in templates and operators for common processing functions to organize, transform, aggregate, and analyze data at any scale.
SQL queries in your application code execute continuously over in-application streams. An in-application stream represents unbounded data that flows continuously through your application. Therefore, to get result sets from this continuously updating input, you often bound queries using a window defined in terms of time or rows. These are also called windowed SQL.
You can specify a query to process records in a tumbling window, sliding window, or stagger window manner, depending on your application needs. Kinesis Data Analytics supports the following window types:
Stagger Windows: A query that aggregates data using keyed time-based windows that open as data arrives. The keys allow for multiple overlapping windows. It is suited for analyzing groups of data that arrive at inconsistent times. This is the recommended way to aggregate data using time-based windows, because Stagger Windows reduce late or out-of-order data compared to Tumbling windows.
Tumbling Windows: A query that aggregates data using distinct time-based windows that open and close at regular intervals.
Sliding Windows: A query that aggregates data continuously, using a fixed time or rowcount interval.
Hence, the correct answer is Stagger Windows.
Tumbling Windows is incorrect because this is more suitable for data that arrives at regular intervals. If tumbling windows are used to analyze groups of time-related data, the individual records might fall into separate windows. So then the partial results from each window must be combined later to yield complete results for each group of records.
Stream Joins is incorrect because it is not a type of windowed query. This is a streaming data operation used for correlating data arriving on multiple in-application streams.
Sliding Windows is incorrect. One of the requirements is that related data belonging to the same time-restricted window must be aggregated. Sliding windowed query is not suitable for the scenario since windows can overlap in this type of processing, and a record can be part of multiple windows and be processed with each window.
References: https://aws.amazon.com/kinesis/data-analytics/faqs/ https://docs.aws.amazon.com/kinesisanalytics/latest/dev/stagger-window-concepts.html
Question 6 of 65
6. Question
A company has a cross-platform application running across a fleet of Amazon EC2 instances. The company wants to offer better customer service by building a centralized logging system that will collect application logs into a service that provides a near-real-time search engine. The system is designed to quickly detect application issues for a faster mean time to recovery. The company does not want to worry about management and maintenance operations, such as hardware provisioning and software patching.
How can the company efficiently set up a data logging system within AWS?
Correct
You can use subscriptions to get access to a real-time feed of log events from CloudWatch Logs and have it delivered to other services such as an Amazon Kinesis stream, an Amazon Kinesis Data Firehose stream, or AWS Lambda for custom processing, analysis, or loading to other systems. When log events are sent to the receiving service, they are Base64 encoded and compressed with the gzip format.
To begin subscribing to log events, create the receiving resource, such as a Kinesis stream, where the events will be delivered. A subscription filter defines the filter pattern to use for filtering which log events get delivered to your AWS resource, as well as information about where to send matching log events to.
Hence, the correct answer is: Stream the data logs to Amazon CloudWatch Logs using the CloudWatch agent. Use CloudWatch Logs Subscription Filters to direct the logs to a Kinesis Data Firehose delivery stream and send the output to Amazon Elasticsearch Service.
The option that says: Stream the data logs to Amazon CloudWatch Logs using the CloudWatch agent. Use CloudWatch Logs Subscription Filters to direct the logs to a Kinesis data stream and send the output to Amazon Elasticsearch Service is incorrect because you cannot directly stream data from Amazon Kinesis Data Stream to Amazon Elasticsearch Service. You have to use an Amazon Kinesis Data Firehose.
The option that says: Stream the data logs to Amazon CloudWatch Logs using the CloudWatch agent. Use CloudWatch Logs Subscription Filters to direct the logs to a Kinesis Data Firehose delivery stream and send the output to Splunk is incorrect because Splunk is a proprietary logging system that you have to set up and manage. It doesnt comply with the requirement that the company should not manage the maintenance and operation of the logging system.
The option that says: Stream the data logs to Amazon CloudWatch Logs using the CloudWatch agent. Use CloudWatch Logs Subscription Filters to direct the logs to a Kinesis Data Firehose delivery stream and send the output to Amazon DynamoDB is incorrect because Kinesis Firehose cannot directly set DynamoDB as the end destination. Moreover, Amazon DynamoDB cant be used to create a near-real-time search engine.
References: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html https://aws.amazon.com/blogs/database/send-apache-web-logs-to-amazon-elasticsearch-service-with-kinesis-firehose/
Incorrect
You can use subscriptions to get access to a real-time feed of log events from CloudWatch Logs and have it delivered to other services such as an Amazon Kinesis stream, an Amazon Kinesis Data Firehose stream, or AWS Lambda for custom processing, analysis, or loading to other systems. When log events are sent to the receiving service, they are Base64 encoded and compressed with the gzip format.
To begin subscribing to log events, create the receiving resource, such as a Kinesis stream, where the events will be delivered. A subscription filter defines the filter pattern to use for filtering which log events get delivered to your AWS resource, as well as information about where to send matching log events to.
Hence, the correct answer is: Stream the data logs to Amazon CloudWatch Logs using the CloudWatch agent. Use CloudWatch Logs Subscription Filters to direct the logs to a Kinesis Data Firehose delivery stream and send the output to Amazon Elasticsearch Service.
The option that says: Stream the data logs to Amazon CloudWatch Logs using the CloudWatch agent. Use CloudWatch Logs Subscription Filters to direct the logs to a Kinesis data stream and send the output to Amazon Elasticsearch Service is incorrect because you cannot directly stream data from Amazon Kinesis Data Stream to Amazon Elasticsearch Service. You have to use an Amazon Kinesis Data Firehose.
The option that says: Stream the data logs to Amazon CloudWatch Logs using the CloudWatch agent. Use CloudWatch Logs Subscription Filters to direct the logs to a Kinesis Data Firehose delivery stream and send the output to Splunk is incorrect because Splunk is a proprietary logging system that you have to set up and manage. It doesnt comply with the requirement that the company should not manage the maintenance and operation of the logging system.
The option that says: Stream the data logs to Amazon CloudWatch Logs using the CloudWatch agent. Use CloudWatch Logs Subscription Filters to direct the logs to a Kinesis Data Firehose delivery stream and send the output to Amazon DynamoDB is incorrect because Kinesis Firehose cannot directly set DynamoDB as the end destination. Moreover, Amazon DynamoDB cant be used to create a near-real-time search engine.
References: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html https://aws.amazon.com/blogs/database/send-apache-web-logs-to-amazon-elasticsearch-service-with-kinesis-firehose/
Unattempted
You can use subscriptions to get access to a real-time feed of log events from CloudWatch Logs and have it delivered to other services such as an Amazon Kinesis stream, an Amazon Kinesis Data Firehose stream, or AWS Lambda for custom processing, analysis, or loading to other systems. When log events are sent to the receiving service, they are Base64 encoded and compressed with the gzip format.
To begin subscribing to log events, create the receiving resource, such as a Kinesis stream, where the events will be delivered. A subscription filter defines the filter pattern to use for filtering which log events get delivered to your AWS resource, as well as information about where to send matching log events to.
Hence, the correct answer is: Stream the data logs to Amazon CloudWatch Logs using the CloudWatch agent. Use CloudWatch Logs Subscription Filters to direct the logs to a Kinesis Data Firehose delivery stream and send the output to Amazon Elasticsearch Service.
The option that says: Stream the data logs to Amazon CloudWatch Logs using the CloudWatch agent. Use CloudWatch Logs Subscription Filters to direct the logs to a Kinesis data stream and send the output to Amazon Elasticsearch Service is incorrect because you cannot directly stream data from Amazon Kinesis Data Stream to Amazon Elasticsearch Service. You have to use an Amazon Kinesis Data Firehose.
The option that says: Stream the data logs to Amazon CloudWatch Logs using the CloudWatch agent. Use CloudWatch Logs Subscription Filters to direct the logs to a Kinesis Data Firehose delivery stream and send the output to Splunk is incorrect because Splunk is a proprietary logging system that you have to set up and manage. It doesnt comply with the requirement that the company should not manage the maintenance and operation of the logging system.
The option that says: Stream the data logs to Amazon CloudWatch Logs using the CloudWatch agent. Use CloudWatch Logs Subscription Filters to direct the logs to a Kinesis Data Firehose delivery stream and send the output to Amazon DynamoDB is incorrect because Kinesis Firehose cannot directly set DynamoDB as the end destination. Moreover, Amazon DynamoDB cant be used to create a near-real-time search engine.
References: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html https://aws.amazon.com/blogs/database/send-apache-web-logs-to-amazon-elasticsearch-service-with-kinesis-firehose/
Question 7 of 65
7. Question
A media company uses Amazon Kinesis Data Streams to ingest massive volumes of real-time data every day. Lately, the application manager noticed that this process has slowed down significantly. Upon investigation, a Data Analyst discovered that Kinesis is throttling the write requests, and the write performance is significantly reduced. The application manager wants a quick fix without performing significant changes to the architecture.
Which actions should the Data Analyst do to resolve this issue quickly? (Select TWO.)
Correct
Amazon Kinesis offers a managed service that lets you focus on building your applications, rather than managing infrastructure. Scalability is provided out-of-the-box, allowing you to ingest and process gigabytes of streaming data per second. It uses a provisioned capacity model. Each data stream is composed of one or more shards that act as units of capacity. Shards make it easy for you to design and scale a streaming pipeline by providing a predefined write and read capacity. As workloads grow, an application may read or write to a shard at a rate that exceeds its capacity, creating a hot shard and requiring you to add capacity quickly. It can lead to throttling errors in the stream, resulting in overworked shards, also known as hot shards.
It is important to spend some time considering the expected data flow rate for your stream to find the appropriate capacity. Choosing a good partition key strategy helps you take full advantage of the provisioned capacity and avoid hot shards. Monitoring your stream metrics and setting alarm thresholds helps you gain the visibility you need to make better scaling decisions. To update the shard count, Kinesis Data Streams performs splits or merges on individual shards. It can cause short-lived shards to be created, in addition to the final shards.
The UpdateShardCount action is useful when you need to scale your stream, up or down, to a specific number of shards, and you provide that number as a parameter in the API call. The command executes a series of SplitShard and MergeShards actions as needed to reach the explicit number of shards that you specified.
Hence, the correct answers are:
Use random partition keys and adjust accordingly to distribute the hash key space evenly across shards.
Use the UpdateShardCount API in Amazon Kinesis to increase the number of shards in the data stream.
The option that says: Reduce throttling by increasing the retention period of the data stream is incorrect because it does not resolve the write performance issue when you keep data inside the stream for a more extended period. However, this solution may help when you have trouble getting data records, especially when it ages out more quickly than when you read it. Sometimes, it may even indirectly lead to even worse write performance.
The option that says: Disable Enhanced Kinesis stream monitoring is incorrect. Enabling monitoring does not slow down write performance. It just helps in identifying hot shards in your data stream.
The option that says: Use an error retry and exponential backoff mechanism in the consumer logic is incorrect. This just helps you resolve ReadProvisionedThroughputExceeded errors which are only applicable in read performance issues and not for addressing write performance problems.
References: https://aws.amazon.com/premiumsupport/knowledge-center/kinesis-data-stream-throttling/ https://aws.amazon.com/blogs/big-data/under-the-hood-scaling-your-kinesis-data-streams/ https://docs.aws.amazon.com/kinesis/latest/APIReference/API_UpdateShardCount.html
Incorrect
Amazon Kinesis offers a managed service that lets you focus on building your applications, rather than managing infrastructure. Scalability is provided out-of-the-box, allowing you to ingest and process gigabytes of streaming data per second. It uses a provisioned capacity model. Each data stream is composed of one or more shards that act as units of capacity. Shards make it easy for you to design and scale a streaming pipeline by providing a predefined write and read capacity. As workloads grow, an application may read or write to a shard at a rate that exceeds its capacity, creating a hot shard and requiring you to add capacity quickly. It can lead to throttling errors in the stream, resulting in overworked shards, also known as hot shards.
It is important to spend some time considering the expected data flow rate for your stream to find the appropriate capacity. Choosing a good partition key strategy helps you take full advantage of the provisioned capacity and avoid hot shards. Monitoring your stream metrics and setting alarm thresholds helps you gain the visibility you need to make better scaling decisions. To update the shard count, Kinesis Data Streams performs splits or merges on individual shards. It can cause short-lived shards to be created, in addition to the final shards.
The UpdateShardCount action is useful when you need to scale your stream, up or down, to a specific number of shards, and you provide that number as a parameter in the API call. The command executes a series of SplitShard and MergeShards actions as needed to reach the explicit number of shards that you specified.
Hence, the correct answers are:
Use random partition keys and adjust accordingly to distribute the hash key space evenly across shards.
Use the UpdateShardCount API in Amazon Kinesis to increase the number of shards in the data stream.
The option that says: Reduce throttling by increasing the retention period of the data stream is incorrect because it does not resolve the write performance issue when you keep data inside the stream for a more extended period. However, this solution may help when you have trouble getting data records, especially when it ages out more quickly than when you read it. Sometimes, it may even indirectly lead to even worse write performance.
The option that says: Disable Enhanced Kinesis stream monitoring is incorrect. Enabling monitoring does not slow down write performance. It just helps in identifying hot shards in your data stream.
The option that says: Use an error retry and exponential backoff mechanism in the consumer logic is incorrect. This just helps you resolve ReadProvisionedThroughputExceeded errors which are only applicable in read performance issues and not for addressing write performance problems.
References: https://aws.amazon.com/premiumsupport/knowledge-center/kinesis-data-stream-throttling/ https://aws.amazon.com/blogs/big-data/under-the-hood-scaling-your-kinesis-data-streams/ https://docs.aws.amazon.com/kinesis/latest/APIReference/API_UpdateShardCount.html
Unattempted
Amazon Kinesis offers a managed service that lets you focus on building your applications, rather than managing infrastructure. Scalability is provided out-of-the-box, allowing you to ingest and process gigabytes of streaming data per second. It uses a provisioned capacity model. Each data stream is composed of one or more shards that act as units of capacity. Shards make it easy for you to design and scale a streaming pipeline by providing a predefined write and read capacity. As workloads grow, an application may read or write to a shard at a rate that exceeds its capacity, creating a hot shard and requiring you to add capacity quickly. It can lead to throttling errors in the stream, resulting in overworked shards, also known as hot shards.
It is important to spend some time considering the expected data flow rate for your stream to find the appropriate capacity. Choosing a good partition key strategy helps you take full advantage of the provisioned capacity and avoid hot shards. Monitoring your stream metrics and setting alarm thresholds helps you gain the visibility you need to make better scaling decisions. To update the shard count, Kinesis Data Streams performs splits or merges on individual shards. It can cause short-lived shards to be created, in addition to the final shards.
The UpdateShardCount action is useful when you need to scale your stream, up or down, to a specific number of shards, and you provide that number as a parameter in the API call. The command executes a series of SplitShard and MergeShards actions as needed to reach the explicit number of shards that you specified.
Hence, the correct answers are:
Use random partition keys and adjust accordingly to distribute the hash key space evenly across shards.
Use the UpdateShardCount API in Amazon Kinesis to increase the number of shards in the data stream.
The option that says: Reduce throttling by increasing the retention period of the data stream is incorrect because it does not resolve the write performance issue when you keep data inside the stream for a more extended period. However, this solution may help when you have trouble getting data records, especially when it ages out more quickly than when you read it. Sometimes, it may even indirectly lead to even worse write performance.
The option that says: Disable Enhanced Kinesis stream monitoring is incorrect. Enabling monitoring does not slow down write performance. It just helps in identifying hot shards in your data stream.
The option that says: Use an error retry and exponential backoff mechanism in the consumer logic is incorrect. This just helps you resolve ReadProvisionedThroughputExceeded errors which are only applicable in read performance issues and not for addressing write performance problems.
References: https://aws.amazon.com/premiumsupport/knowledge-center/kinesis-data-stream-throttling/ https://aws.amazon.com/blogs/big-data/under-the-hood-scaling-your-kinesis-data-streams/ https://docs.aws.amazon.com/kinesis/latest/APIReference/API_UpdateShardCount.html
Question 8 of 65
8. Question
A digital banking startup plans to upgrade its online client banking systems recommendations feature and requires real-time data collection and analysis. It is estimated that each data record will be approximately 25 KB in size. Worried about the systems performance, the product owner wants a solution that can achieve optimal throughput from each user device and enable consumers to receive these records from the stream with dedicated throughput.
How can the startup meet these requirements?
Correct
Amazon Kinesis Data Streams can continuously capture gigabytes of data per second from hundreds of thousands of sources. The data collected is available in milliseconds, enabling real-time analytics. To provide this massively scalable throughput, Kinesis Data Streams relies on shards, which are units of throughput and represent parallelism. One shard provides an ingest throughput of 1 MB/second or 1000 records/second. A shard also has an outbound throughput of 2 MB/sec. As you ingest more data, Kinesis Data Streams can add more shards. Customers often ingest thousands of shards in a single stream.
Enhanced fan-out is an Amazon Kinesis Data Streams feature that enables consumers to receive records from a data stream with a dedicated throughput of up to 2 MB of data per second per shard. A consumer that uses enhanced fan-out doesnt have to contend with other consumers receiving data from the stream. If you use version 2.0 or later of the Amazon Kinesis Client Library (KCL) to build a consumer, the KCL sets up the consumer to use enhanced fan-out to receive data from all the streams shards. If you use the API to build a consumer that uses enhanced fan-out then you can subscribe to individual shards.
Before the adoption of enhanced fan-out technology, users consumed data from a Kinesis data stream with multiple AWS Lambda functions sharing the same 2 MB/second outbound throughput. Due to shared bandwidth constraints, no more than two or three functions could efficiently connect to the data stream at a time.
Due to the practical limitation of two to three applications per stream, you must have at least two streams to support five individual applications. You could attach three applications to the first stream and two applications to the second. However, diverting data into two separate streams adds complexity.
There are two different operations in the Kinesis Data Streams API that add data to a stream, PutRecords and PutRecord. The PutRecords operation sends multiple records to your stream per HTTP request, and the singular PutRecord operation sends records to your stream one at a time (a separate HTTP request is required for each record). You should prefer using PutRecords for most applications because it will achieve a higher throughput per data producer.
Hence, the correct answer is: Configure Amazon Kinesis Data Streams and the banking system to use the PutRecords API to send data to the stream. Register consumers with the enhanced fan-out feature.
The option that says: Configure Amazon Kinesis Data Streams and the banking system to use the PutRecords API to send data to the stream. Develop the stream processing application in an Auto Scaling group of Amazon EC2 instances is incorrect. An Auto Scaling group of EC2 instances will not provide dedicated throughput for the consumers. You have to enable the enhanced fan-out feature in Amazon Kinesis Data Streams.
The option that says: Configure Amazon Kinesis Data Firehose and the banking system to use the PutRecordBatch API to send data to the stream. Register consumers with the enhanced fan-out feature is incorrect. The enhanced fan-out feature is only available with Amazon Kinesis Data Streams. The PutRecordBatch API writes multiple data records into a delivery stream in a single call, which can achieve higher throughput per producer than when writing single records.
The option that says: Configure Amazon Kinesis Data Streams and the banking system to use the PutRecordBatch API to send data to the stream. Raise a support case in AWS to allow consumers with dedicated throughput is incorrect. You dont have to submit a support case to enable the dedicated throughput for Kinesis Data Stream. You just have to use the enhanced fan-out feature in Amazon Kinesis Data Stream. In addition, the PutRecordBatch API is only available in Kinesis Data Firehose and not in Kinesis Data Stream.
References: https://docs.aws.amazon.com/streams/latest/dev/enhanced-consumers.html https://docs.aws.amazon.com/streams/latest/dev/shared-throughput-kcl-consumers.html https://aws.amazon.com/blogs/compute/increasing-real-time-stream-processing-performance-with-amazon-kinesis-data-streams-enhanced-fan-out-and-aws-lambda/
Incorrect
Amazon Kinesis Data Streams can continuously capture gigabytes of data per second from hundreds of thousands of sources. The data collected is available in milliseconds, enabling real-time analytics. To provide this massively scalable throughput, Kinesis Data Streams relies on shards, which are units of throughput and represent parallelism. One shard provides an ingest throughput of 1 MB/second or 1000 records/second. A shard also has an outbound throughput of 2 MB/sec. As you ingest more data, Kinesis Data Streams can add more shards. Customers often ingest thousands of shards in a single stream.
Enhanced fan-out is an Amazon Kinesis Data Streams feature that enables consumers to receive records from a data stream with a dedicated throughput of up to 2 MB of data per second per shard. A consumer that uses enhanced fan-out doesnt have to contend with other consumers receiving data from the stream. If you use version 2.0 or later of the Amazon Kinesis Client Library (KCL) to build a consumer, the KCL sets up the consumer to use enhanced fan-out to receive data from all the streams shards. If you use the API to build a consumer that uses enhanced fan-out then you can subscribe to individual shards.
Before the adoption of enhanced fan-out technology, users consumed data from a Kinesis data stream with multiple AWS Lambda functions sharing the same 2 MB/second outbound throughput. Due to shared bandwidth constraints, no more than two or three functions could efficiently connect to the data stream at a time.
Due to the practical limitation of two to three applications per stream, you must have at least two streams to support five individual applications. You could attach three applications to the first stream and two applications to the second. However, diverting data into two separate streams adds complexity.
There are two different operations in the Kinesis Data Streams API that add data to a stream, PutRecords and PutRecord. The PutRecords operation sends multiple records to your stream per HTTP request, and the singular PutRecord operation sends records to your stream one at a time (a separate HTTP request is required for each record). You should prefer using PutRecords for most applications because it will achieve a higher throughput per data producer.
Hence, the correct answer is: Configure Amazon Kinesis Data Streams and the banking system to use the PutRecords API to send data to the stream. Register consumers with the enhanced fan-out feature.
The option that says: Configure Amazon Kinesis Data Streams and the banking system to use the PutRecords API to send data to the stream. Develop the stream processing application in an Auto Scaling group of Amazon EC2 instances is incorrect. An Auto Scaling group of EC2 instances will not provide dedicated throughput for the consumers. You have to enable the enhanced fan-out feature in Amazon Kinesis Data Streams.
The option that says: Configure Amazon Kinesis Data Firehose and the banking system to use the PutRecordBatch API to send data to the stream. Register consumers with the enhanced fan-out feature is incorrect. The enhanced fan-out feature is only available with Amazon Kinesis Data Streams. The PutRecordBatch API writes multiple data records into a delivery stream in a single call, which can achieve higher throughput per producer than when writing single records.
The option that says: Configure Amazon Kinesis Data Streams and the banking system to use the PutRecordBatch API to send data to the stream. Raise a support case in AWS to allow consumers with dedicated throughput is incorrect. You dont have to submit a support case to enable the dedicated throughput for Kinesis Data Stream. You just have to use the enhanced fan-out feature in Amazon Kinesis Data Stream. In addition, the PutRecordBatch API is only available in Kinesis Data Firehose and not in Kinesis Data Stream.
References: https://docs.aws.amazon.com/streams/latest/dev/enhanced-consumers.html https://docs.aws.amazon.com/streams/latest/dev/shared-throughput-kcl-consumers.html https://aws.amazon.com/blogs/compute/increasing-real-time-stream-processing-performance-with-amazon-kinesis-data-streams-enhanced-fan-out-and-aws-lambda/
Unattempted
Amazon Kinesis Data Streams can continuously capture gigabytes of data per second from hundreds of thousands of sources. The data collected is available in milliseconds, enabling real-time analytics. To provide this massively scalable throughput, Kinesis Data Streams relies on shards, which are units of throughput and represent parallelism. One shard provides an ingest throughput of 1 MB/second or 1000 records/second. A shard also has an outbound throughput of 2 MB/sec. As you ingest more data, Kinesis Data Streams can add more shards. Customers often ingest thousands of shards in a single stream.
Enhanced fan-out is an Amazon Kinesis Data Streams feature that enables consumers to receive records from a data stream with a dedicated throughput of up to 2 MB of data per second per shard. A consumer that uses enhanced fan-out doesnt have to contend with other consumers receiving data from the stream. If you use version 2.0 or later of the Amazon Kinesis Client Library (KCL) to build a consumer, the KCL sets up the consumer to use enhanced fan-out to receive data from all the streams shards. If you use the API to build a consumer that uses enhanced fan-out then you can subscribe to individual shards.
Before the adoption of enhanced fan-out technology, users consumed data from a Kinesis data stream with multiple AWS Lambda functions sharing the same 2 MB/second outbound throughput. Due to shared bandwidth constraints, no more than two or three functions could efficiently connect to the data stream at a time.
Due to the practical limitation of two to three applications per stream, you must have at least two streams to support five individual applications. You could attach three applications to the first stream and two applications to the second. However, diverting data into two separate streams adds complexity.
There are two different operations in the Kinesis Data Streams API that add data to a stream, PutRecords and PutRecord. The PutRecords operation sends multiple records to your stream per HTTP request, and the singular PutRecord operation sends records to your stream one at a time (a separate HTTP request is required for each record). You should prefer using PutRecords for most applications because it will achieve a higher throughput per data producer.
Hence, the correct answer is: Configure Amazon Kinesis Data Streams and the banking system to use the PutRecords API to send data to the stream. Register consumers with the enhanced fan-out feature.
The option that says: Configure Amazon Kinesis Data Streams and the banking system to use the PutRecords API to send data to the stream. Develop the stream processing application in an Auto Scaling group of Amazon EC2 instances is incorrect. An Auto Scaling group of EC2 instances will not provide dedicated throughput for the consumers. You have to enable the enhanced fan-out feature in Amazon Kinesis Data Streams.
The option that says: Configure Amazon Kinesis Data Firehose and the banking system to use the PutRecordBatch API to send data to the stream. Register consumers with the enhanced fan-out feature is incorrect. The enhanced fan-out feature is only available with Amazon Kinesis Data Streams. The PutRecordBatch API writes multiple data records into a delivery stream in a single call, which can achieve higher throughput per producer than when writing single records.
The option that says: Configure Amazon Kinesis Data Streams and the banking system to use the PutRecordBatch API to send data to the stream. Raise a support case in AWS to allow consumers with dedicated throughput is incorrect. You dont have to submit a support case to enable the dedicated throughput for Kinesis Data Stream. You just have to use the enhanced fan-out feature in Amazon Kinesis Data Stream. In addition, the PutRecordBatch API is only available in Kinesis Data Firehose and not in Kinesis Data Stream.
References: https://docs.aws.amazon.com/streams/latest/dev/enhanced-consumers.html https://docs.aws.amazon.com/streams/latest/dev/shared-throughput-kcl-consumers.html https://aws.amazon.com/blogs/compute/increasing-real-time-stream-processing-performance-with-amazon-kinesis-data-streams-enhanced-fan-out-and-aws-lambda/
Question 9 of 65
9. Question
A team of data scientists is using Amazon Athena to run ad-hoc queries against data stored on Amazon S3. The data is stored in different S3 buckets in the ap-southeast-1 and ap-northeast-2 regions. The team wants a solution that will allow them to use Athena in the ap-southeast-1 region for querying data from both regions to process queries.
How can the team solve the problem while keeping the cost as low as possible?
Correct
AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue crawlers automatically infer database and table schema from your dataset, storing the associated metadata in the AWS Glue Data Catalog.
Athena natively supports querying datasets and data sources that are registered with the AWS Glue Data Catalog. When you run Data Manipulation Language (DML) queries in Athena with the Data Catalog as your source, you are using the Data Catalog schema to derive insight from the underlying dataset. When you run Data Definition Language (DDL) queries, the schema you define are defined in the AWS Glue Data Catalog. From within Athena, you can also run an AWS Glue crawler on a data source to create schema in the AWS Glue Data Catalog.
AWS Glue can crawl data in different AWS Regions. When you define an Amazon S3 data store to crawl, you can choose whether to crawl a path in your account or another account.
The output of the crawler is one or more metadata tables defined in the AWS Glue Data Catalog. A table is created for one or more files found in your data store. If all the Amazon S3 files in a folder have the same schema, the crawler creates one table. Also, if the Amazon S3 object is partitioned, only one metadata table is created.
Hence, the correct answer is: Run the AWS Glue crawler in ap-southeast-1 to catalog datasets in all Regions. Execute the Athena queries in ap-southeast-1 once the data has been crawled.
The option that says: Use the AWS Database Migration Service to migrate the AWS Glue Data Catalog from ap-southeast-1 to ap-northeast-2. Execute the Athena queries in ap-southeast-1 is incorrect because you cant use AWS DMS with AWS Glue Data Catalog.
The option that says: Enable the Cross-Region Replication (CRR) feature for the S3 buckets in ap-northeast-2 to replicate data in ap-northeast-2. Execute the AWS Glue crawler there to update the AWS Glue Data Catalog in ap-southeast-1 then run Athena queries is incorrect because replicating the data in S3 means that your storage costs will also double.
The option that says: Modify the AWS Glue resource policies to grant the ap-southeast-1 AWS Glue Data Catalog access to ap-northeast-2. Verify that the catalog in ap-southeast-1 has access to the catalog in ap-northeast-2 then execute the Athena queries in ap-southeast-1 is incorrect because a resource-based policy is primarily used to provide IAM users and roles granular access to metadata definitions of databases, tables, connections, and user-defined functions, and not the actual S3 data.
References: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html https://aws.amazon.com/blogs/big-data/create-cross-account-and-cross-region-aws-glue-connections/
Incorrect
AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue crawlers automatically infer database and table schema from your dataset, storing the associated metadata in the AWS Glue Data Catalog.
Athena natively supports querying datasets and data sources that are registered with the AWS Glue Data Catalog. When you run Data Manipulation Language (DML) queries in Athena with the Data Catalog as your source, you are using the Data Catalog schema to derive insight from the underlying dataset. When you run Data Definition Language (DDL) queries, the schema you define are defined in the AWS Glue Data Catalog. From within Athena, you can also run an AWS Glue crawler on a data source to create schema in the AWS Glue Data Catalog.
AWS Glue can crawl data in different AWS Regions. When you define an Amazon S3 data store to crawl, you can choose whether to crawl a path in your account or another account.
The output of the crawler is one or more metadata tables defined in the AWS Glue Data Catalog. A table is created for one or more files found in your data store. If all the Amazon S3 files in a folder have the same schema, the crawler creates one table. Also, if the Amazon S3 object is partitioned, only one metadata table is created.
Hence, the correct answer is: Run the AWS Glue crawler in ap-southeast-1 to catalog datasets in all Regions. Execute the Athena queries in ap-southeast-1 once the data has been crawled.
The option that says: Use the AWS Database Migration Service to migrate the AWS Glue Data Catalog from ap-southeast-1 to ap-northeast-2. Execute the Athena queries in ap-southeast-1 is incorrect because you cant use AWS DMS with AWS Glue Data Catalog.
The option that says: Enable the Cross-Region Replication (CRR) feature for the S3 buckets in ap-northeast-2 to replicate data in ap-northeast-2. Execute the AWS Glue crawler there to update the AWS Glue Data Catalog in ap-southeast-1 then run Athena queries is incorrect because replicating the data in S3 means that your storage costs will also double.
The option that says: Modify the AWS Glue resource policies to grant the ap-southeast-1 AWS Glue Data Catalog access to ap-northeast-2. Verify that the catalog in ap-southeast-1 has access to the catalog in ap-northeast-2 then execute the Athena queries in ap-southeast-1 is incorrect because a resource-based policy is primarily used to provide IAM users and roles granular access to metadata definitions of databases, tables, connections, and user-defined functions, and not the actual S3 data.
References: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html https://aws.amazon.com/blogs/big-data/create-cross-account-and-cross-region-aws-glue-connections/
Unattempted
AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue crawlers automatically infer database and table schema from your dataset, storing the associated metadata in the AWS Glue Data Catalog.
Athena natively supports querying datasets and data sources that are registered with the AWS Glue Data Catalog. When you run Data Manipulation Language (DML) queries in Athena with the Data Catalog as your source, you are using the Data Catalog schema to derive insight from the underlying dataset. When you run Data Definition Language (DDL) queries, the schema you define are defined in the AWS Glue Data Catalog. From within Athena, you can also run an AWS Glue crawler on a data source to create schema in the AWS Glue Data Catalog.
AWS Glue can crawl data in different AWS Regions. When you define an Amazon S3 data store to crawl, you can choose whether to crawl a path in your account or another account.
The output of the crawler is one or more metadata tables defined in the AWS Glue Data Catalog. A table is created for one or more files found in your data store. If all the Amazon S3 files in a folder have the same schema, the crawler creates one table. Also, if the Amazon S3 object is partitioned, only one metadata table is created.
Hence, the correct answer is: Run the AWS Glue crawler in ap-southeast-1 to catalog datasets in all Regions. Execute the Athena queries in ap-southeast-1 once the data has been crawled.
The option that says: Use the AWS Database Migration Service to migrate the AWS Glue Data Catalog from ap-southeast-1 to ap-northeast-2. Execute the Athena queries in ap-southeast-1 is incorrect because you cant use AWS DMS with AWS Glue Data Catalog.
The option that says: Enable the Cross-Region Replication (CRR) feature for the S3 buckets in ap-northeast-2 to replicate data in ap-northeast-2. Execute the AWS Glue crawler there to update the AWS Glue Data Catalog in ap-southeast-1 then run Athena queries is incorrect because replicating the data in S3 means that your storage costs will also double.
The option that says: Modify the AWS Glue resource policies to grant the ap-southeast-1 AWS Glue Data Catalog access to ap-northeast-2. Verify that the catalog in ap-southeast-1 has access to the catalog in ap-northeast-2 then execute the Athena queries in ap-southeast-1 is incorrect because a resource-based policy is primarily used to provide IAM users and roles granular access to metadata definitions of databases, tables, connections, and user-defined functions, and not the actual S3 data.
References: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html https://aws.amazon.com/blogs/big-data/create-cross-account-and-cross-region-aws-glue-connections/
Question 10 of 65
10. Question
A sports technology company plans to build the latest kneepads version that can collect data from athletes wearing them. The product owner is looking to develop them with wearable medical sensors to ingest near-real-time data securely at scale and store it in durable storage. Furthermore, it should only collect non-confidential information from the streaming data and exclude those classified as sensitive data.
Which solution achieves these requirements with the least operational overhead?
Correct
Amazon Kinesis Data Streams enables you to build custom applications that process or analyze streaming data for specialized needs. It is useful for rapidly moving data off data producers and then continuously processing the data, whether used to transform the data before emitting to a data store, run real-time metrics and analytics, or derive more complex data streams for further processing.
Within seconds, the data will be available for your Amazon Kinesis Applications to read and process from the stream. The throughput of an Amazon Kinesis data stream is determined by the number of shards within the data stream. You can use the UpdateShardCount API or the AWS Management Console to scale the number of shards in a data stream, or you can change the throughput of an Amazon Kinesis data stream by adjusting the number of shards within the data stream.
Meanwhile, Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards youre already using today. It is a fully managed service that automatically scales to match your datas throughput and requires no ongoing administration. It can also batch, compress, and encrypt the data before loading it, minimizing the storage used at the destination and increasing security. With the Firehose data transformation feature, you can now specify a Lambda function that can perform transformations directly on the stream when you create a delivery stream.
Among the choices, the solution that meets the requirements with the least operational overhead is the option that says: Using Amazon Kinesis Data Firehose, ingest the streaming data, and use Amazon S3 for durable storage. Write an AWS Lambda function that removes sensitive data. During the creation of the Kinesis Data Firehose delivery stream, enable record transformation and use the Lambda function.
The option that says: Using Amazon Kinesis Data Firehose, ingest the streaming data, and use Amazon S3 for durable storage. Write an AWS Lambda function that removes sensitive data. Schedule a separate job that invokes the Lambda function once the data is stored in Amazon S3 is incorrect. You do not need to schedule a different job for the AWS Lambda function since you can directly enable and set up a data transformation process in the Kinesis Data Firehose stream. With the Firehose data transformation feature, a Lambda function can be specified to perform transformations directly on the data stream.
The option that says: Using Amazon Kinesis Data Streams, ingest the streaming data, and use an Amazon EC2 instance for durable storage. Write an Amazon Kinesis Data Analytics application that removes sensitive data is incorrect. Writing a custom Kinesis Data Analytics application entails additional effort. In addition, Amazon EC2 does not provide durable storage.
The option that says: Using Amazon Kinesis Data Streams, ingest the streaming data, and use Amazon S3 for durable storage. Write an AWS Lambda function that removes sensitive data. Schedule a separate job that invokes the Lambda function once the data is stored in Amazon S3 is incorrect. Amazon Kinesis Data Streams does not support direct transfer to Amazon S3 without using another service. The data transformation is also not done in near real-time. A better solution is to use Amazon Kinesis Firehose with its data transformation feature enabled.
References: https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html https://aws.amazon.com/blogs/big-data/persist-streaming-data-to-amazon-s3-using-amazon-kinesis-firehose-and-aws-lambda/ https://aws.amazon.com/blogs/compute/amazon-kinesis-firehose-data-transformation-with-aws-lambda/
Incorrect
Amazon Kinesis Data Streams enables you to build custom applications that process or analyze streaming data for specialized needs. It is useful for rapidly moving data off data producers and then continuously processing the data, whether used to transform the data before emitting to a data store, run real-time metrics and analytics, or derive more complex data streams for further processing.
Within seconds, the data will be available for your Amazon Kinesis Applications to read and process from the stream. The throughput of an Amazon Kinesis data stream is determined by the number of shards within the data stream. You can use the UpdateShardCount API or the AWS Management Console to scale the number of shards in a data stream, or you can change the throughput of an Amazon Kinesis data stream by adjusting the number of shards within the data stream.
Meanwhile, Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards youre already using today. It is a fully managed service that automatically scales to match your datas throughput and requires no ongoing administration. It can also batch, compress, and encrypt the data before loading it, minimizing the storage used at the destination and increasing security. With the Firehose data transformation feature, you can now specify a Lambda function that can perform transformations directly on the stream when you create a delivery stream.
Among the choices, the solution that meets the requirements with the least operational overhead is the option that says: Using Amazon Kinesis Data Firehose, ingest the streaming data, and use Amazon S3 for durable storage. Write an AWS Lambda function that removes sensitive data. During the creation of the Kinesis Data Firehose delivery stream, enable record transformation and use the Lambda function.
The option that says: Using Amazon Kinesis Data Firehose, ingest the streaming data, and use Amazon S3 for durable storage. Write an AWS Lambda function that removes sensitive data. Schedule a separate job that invokes the Lambda function once the data is stored in Amazon S3 is incorrect. You do not need to schedule a different job for the AWS Lambda function since you can directly enable and set up a data transformation process in the Kinesis Data Firehose stream. With the Firehose data transformation feature, a Lambda function can be specified to perform transformations directly on the data stream.
The option that says: Using Amazon Kinesis Data Streams, ingest the streaming data, and use an Amazon EC2 instance for durable storage. Write an Amazon Kinesis Data Analytics application that removes sensitive data is incorrect. Writing a custom Kinesis Data Analytics application entails additional effort. In addition, Amazon EC2 does not provide durable storage.
The option that says: Using Amazon Kinesis Data Streams, ingest the streaming data, and use Amazon S3 for durable storage. Write an AWS Lambda function that removes sensitive data. Schedule a separate job that invokes the Lambda function once the data is stored in Amazon S3 is incorrect. Amazon Kinesis Data Streams does not support direct transfer to Amazon S3 without using another service. The data transformation is also not done in near real-time. A better solution is to use Amazon Kinesis Firehose with its data transformation feature enabled.
References: https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html https://aws.amazon.com/blogs/big-data/persist-streaming-data-to-amazon-s3-using-amazon-kinesis-firehose-and-aws-lambda/ https://aws.amazon.com/blogs/compute/amazon-kinesis-firehose-data-transformation-with-aws-lambda/
Unattempted
Amazon Kinesis Data Streams enables you to build custom applications that process or analyze streaming data for specialized needs. It is useful for rapidly moving data off data producers and then continuously processing the data, whether used to transform the data before emitting to a data store, run real-time metrics and analytics, or derive more complex data streams for further processing.
Within seconds, the data will be available for your Amazon Kinesis Applications to read and process from the stream. The throughput of an Amazon Kinesis data stream is determined by the number of shards within the data stream. You can use the UpdateShardCount API or the AWS Management Console to scale the number of shards in a data stream, or you can change the throughput of an Amazon Kinesis data stream by adjusting the number of shards within the data stream.
Meanwhile, Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards youre already using today. It is a fully managed service that automatically scales to match your datas throughput and requires no ongoing administration. It can also batch, compress, and encrypt the data before loading it, minimizing the storage used at the destination and increasing security. With the Firehose data transformation feature, you can now specify a Lambda function that can perform transformations directly on the stream when you create a delivery stream.
Among the choices, the solution that meets the requirements with the least operational overhead is the option that says: Using Amazon Kinesis Data Firehose, ingest the streaming data, and use Amazon S3 for durable storage. Write an AWS Lambda function that removes sensitive data. During the creation of the Kinesis Data Firehose delivery stream, enable record transformation and use the Lambda function.
The option that says: Using Amazon Kinesis Data Firehose, ingest the streaming data, and use Amazon S3 for durable storage. Write an AWS Lambda function that removes sensitive data. Schedule a separate job that invokes the Lambda function once the data is stored in Amazon S3 is incorrect. You do not need to schedule a different job for the AWS Lambda function since you can directly enable and set up a data transformation process in the Kinesis Data Firehose stream. With the Firehose data transformation feature, a Lambda function can be specified to perform transformations directly on the data stream.
The option that says: Using Amazon Kinesis Data Streams, ingest the streaming data, and use an Amazon EC2 instance for durable storage. Write an Amazon Kinesis Data Analytics application that removes sensitive data is incorrect. Writing a custom Kinesis Data Analytics application entails additional effort. In addition, Amazon EC2 does not provide durable storage.
The option that says: Using Amazon Kinesis Data Streams, ingest the streaming data, and use Amazon S3 for durable storage. Write an AWS Lambda function that removes sensitive data. Schedule a separate job that invokes the Lambda function once the data is stored in Amazon S3 is incorrect. Amazon Kinesis Data Streams does not support direct transfer to Amazon S3 without using another service. The data transformation is also not done in near real-time. A better solution is to use Amazon Kinesis Firehose with its data transformation feature enabled.
References: https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html https://aws.amazon.com/blogs/big-data/persist-streaming-data-to-amazon-s3-using-amazon-kinesis-firehose-and-aws-lambda/ https://aws.amazon.com/blogs/compute/amazon-kinesis-firehose-data-transformation-with-aws-lambda/
Question 11 of 65
11. Question
A scientist in a research institution runs an Apache Hive script to batch process agricultural data stored on an Amazon S3 bucket. The script needs to run at 4:00 PM every day after all new data is saved to the S3 bucket. The output of the script is saved in another Amazon S3 bucket. Running the script on a three-node cluster on the buildings data center takes about 1-2 hours to complete. The scientist wants to move this batch process to the cloud and run it on a regular basis efficiently.
Which of the following is the most cost-effective solution for scheduling and executing the batch process?
Correct
You can create an AWS Lambda function to provision an Amazon EMR cluster and invoke it with Amazon CloudWatch. You can create a scheduled CloudWatch Event to invoke the Lambda function at 4:00 PM every day. The Lambda function will then spin up an Amazon EMR cluster using the RunJobFlow API that will execute the Hive script for the batch process.
RunJobFlow creates and starts running a new EMR cluster (job flow). The cluster runs the steps specified. After the steps complete, the cluster stops and the HDFS partition is lost. To prevent loss of data, configure the last step of the job flow to store results in Amazon S3. If the JobFlowInstancesConfig KeepJobFlowAliveWhenNoSteps parameter is set to TRUE, the cluster transitions to the WAITING state rather than shutting down after the steps have completed.
Therefore the correct answer is: Schedule an Amazon CloudWatch Events rule to invoke a Lambda function to run at 4:00 PM daily. Configure your AWS Lambda function to provision an Amazon EMR cluster with a Hive execution step. On the RunJobFlow API, set the KeepJobFlowAliveWhenNoSteps to FALSE and disable the termination protection flag.
The option that says: Schedule an Amazon CloudWatch Events rule to invoke a Lambda function to run at 4:00 PM daily. Configure your AWS Lambda function to provision an Amazon EMR cluster with Hue (Hadoop User Experience), Apache Hive, and Apache Oozie. Set the termination protection flag to FALSE and use Spot Instances for the core nodes of the cluster. Configure an Oozie workflow in the cluster to invoke the Hive script at bootup is incorrect because using Spot Instances for the core nodes is not recommended as it will affect the job availability when the instances suddenly terminate.
The option that says: Configure an AWS Glue job to run at 4:00 PM using a time-based schedule. Configure the job to include the Hive script to perform the batch operation at the specified time is incorrect because the AWS Glue job usually executes Apache Spark, Spark Streaming, or Python shell scripts only. AWS Glue doesnt directly support Apache Hive.
The option that says: Create an AWS Step Function workflow to schedule running a Lambda function daily at 4:00 PM. Have the Lambda function load the Hive runtime and copy the Hive script. Add a step to call the RunJobFlow API to provision an EMR cluster and bootstrap the Hive script is incorrect because Lambda does not have an Apache Hive runtime. You can set up an Amazon EMR cluster with a Hive execution step to fulfill this requirement.
References: https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/ https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html https://aws.amazon.com/blogs/aws/new-using-step-functions-to-orchestrate-amazon-emr-workloads/
Incorrect
You can create an AWS Lambda function to provision an Amazon EMR cluster and invoke it with Amazon CloudWatch. You can create a scheduled CloudWatch Event to invoke the Lambda function at 4:00 PM every day. The Lambda function will then spin up an Amazon EMR cluster using the RunJobFlow API that will execute the Hive script for the batch process.
RunJobFlow creates and starts running a new EMR cluster (job flow). The cluster runs the steps specified. After the steps complete, the cluster stops and the HDFS partition is lost. To prevent loss of data, configure the last step of the job flow to store results in Amazon S3. If the JobFlowInstancesConfig KeepJobFlowAliveWhenNoSteps parameter is set to TRUE, the cluster transitions to the WAITING state rather than shutting down after the steps have completed.
Therefore the correct answer is: Schedule an Amazon CloudWatch Events rule to invoke a Lambda function to run at 4:00 PM daily. Configure your AWS Lambda function to provision an Amazon EMR cluster with a Hive execution step. On the RunJobFlow API, set the KeepJobFlowAliveWhenNoSteps to FALSE and disable the termination protection flag.
The option that says: Schedule an Amazon CloudWatch Events rule to invoke a Lambda function to run at 4:00 PM daily. Configure your AWS Lambda function to provision an Amazon EMR cluster with Hue (Hadoop User Experience), Apache Hive, and Apache Oozie. Set the termination protection flag to FALSE and use Spot Instances for the core nodes of the cluster. Configure an Oozie workflow in the cluster to invoke the Hive script at bootup is incorrect because using Spot Instances for the core nodes is not recommended as it will affect the job availability when the instances suddenly terminate.
The option that says: Configure an AWS Glue job to run at 4:00 PM using a time-based schedule. Configure the job to include the Hive script to perform the batch operation at the specified time is incorrect because the AWS Glue job usually executes Apache Spark, Spark Streaming, or Python shell scripts only. AWS Glue doesnt directly support Apache Hive.
The option that says: Create an AWS Step Function workflow to schedule running a Lambda function daily at 4:00 PM. Have the Lambda function load the Hive runtime and copy the Hive script. Add a step to call the RunJobFlow API to provision an EMR cluster and bootstrap the Hive script is incorrect because Lambda does not have an Apache Hive runtime. You can set up an Amazon EMR cluster with a Hive execution step to fulfill this requirement.
References: https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/ https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html https://aws.amazon.com/blogs/aws/new-using-step-functions-to-orchestrate-amazon-emr-workloads/
Unattempted
You can create an AWS Lambda function to provision an Amazon EMR cluster and invoke it with Amazon CloudWatch. You can create a scheduled CloudWatch Event to invoke the Lambda function at 4:00 PM every day. The Lambda function will then spin up an Amazon EMR cluster using the RunJobFlow API that will execute the Hive script for the batch process.
RunJobFlow creates and starts running a new EMR cluster (job flow). The cluster runs the steps specified. After the steps complete, the cluster stops and the HDFS partition is lost. To prevent loss of data, configure the last step of the job flow to store results in Amazon S3. If the JobFlowInstancesConfig KeepJobFlowAliveWhenNoSteps parameter is set to TRUE, the cluster transitions to the WAITING state rather than shutting down after the steps have completed.
Therefore the correct answer is: Schedule an Amazon CloudWatch Events rule to invoke a Lambda function to run at 4:00 PM daily. Configure your AWS Lambda function to provision an Amazon EMR cluster with a Hive execution step. On the RunJobFlow API, set the KeepJobFlowAliveWhenNoSteps to FALSE and disable the termination protection flag.
The option that says: Schedule an Amazon CloudWatch Events rule to invoke a Lambda function to run at 4:00 PM daily. Configure your AWS Lambda function to provision an Amazon EMR cluster with Hue (Hadoop User Experience), Apache Hive, and Apache Oozie. Set the termination protection flag to FALSE and use Spot Instances for the core nodes of the cluster. Configure an Oozie workflow in the cluster to invoke the Hive script at bootup is incorrect because using Spot Instances for the core nodes is not recommended as it will affect the job availability when the instances suddenly terminate.
The option that says: Configure an AWS Glue job to run at 4:00 PM using a time-based schedule. Configure the job to include the Hive script to perform the batch operation at the specified time is incorrect because the AWS Glue job usually executes Apache Spark, Spark Streaming, or Python shell scripts only. AWS Glue doesnt directly support Apache Hive.
The option that says: Create an AWS Step Function workflow to schedule running a Lambda function daily at 4:00 PM. Have the Lambda function load the Hive runtime and copy the Hive script. Add a step to call the RunJobFlow API to provision an EMR cluster and bootstrap the Hive script is incorrect because Lambda does not have an Apache Hive runtime. You can set up an Amazon EMR cluster with a Hive execution step to fulfill this requirement.
References: https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/ https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html https://aws.amazon.com/blogs/aws/new-using-step-functions-to-orchestrate-amazon-emr-workloads/
Question 12 of 65
12. Question
A company has recently hired a Data Analyst to uncover any untapped value from the records they collected over the past years. The Data Analyst has been instructed to use Amazon Redshift to analyze the historical records. However, the analyst is unsure of which distribution style should be used.
Which of the following is NOT a best practice when choosing the best distribution style?
Correct
When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is run.
Some suggestions for best approach follow:
Distribute the fact table and one dimension table on their common columns.
Your fact table can have only one distribution key. Any tables that join on another key arent collocated with the fact table. Choose one dimension to collocate based on how frequently it is joined and the size of the joining rows. Designate both the dimension tables primary key and the fact tables corresponding foreign key as the DISTKEY.
Choose the largest dimension based on the size of the filtered dataset.
Only the rows that are used in the join need to be distributed, so consider the size of the dataset after filtering, not the size of the table.
Choose a column with high cardinality in the filtered result set.
If you distribute a sales table on a date column, for example, you should probably get fairly even data distribution, unless most of your sales are seasonal. However, if you commonly use a range-restricted predicate to filter for a narrow date period, most of the filtered rows occur on a limited set of slices and the query workload is skewed.
Change some dimension tables to use ALL distribution.
If a dimension table cannot be collocated with the fact table or other important joining tables, you can improve query performance significantly by distributing the entire table to all of the nodes. Using ALL distribution multiplies storage space requirements and increases load times and maintenance operations, so you should weigh all factors before choosing ALL distribution.
To let Amazon Redshift choose the appropriate distribution style, dont specify DISTSTYLE. Take note that one of the best practices in selecting a distribution style is to use the largest dimension based on the size of the filtered dataset, and not the smallest.
Hence, the correct answer is: Select the smallest dimension based on the filtered datasets size.
The following options are all incorrect because they follow the best practice in choosing the appropriate distribution style:
Designate a common column for the fact table and the dimension table
Use a DISTSTYLE ALL distribution for tables that are not frequently updated
Select a DISTKEY with high cardinality
References: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html https://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html
Incorrect
When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is run.
Some suggestions for best approach follow:
Distribute the fact table and one dimension table on their common columns.
Your fact table can have only one distribution key. Any tables that join on another key arent collocated with the fact table. Choose one dimension to collocate based on how frequently it is joined and the size of the joining rows. Designate both the dimension tables primary key and the fact tables corresponding foreign key as the DISTKEY.
Choose the largest dimension based on the size of the filtered dataset.
Only the rows that are used in the join need to be distributed, so consider the size of the dataset after filtering, not the size of the table.
Choose a column with high cardinality in the filtered result set.
If you distribute a sales table on a date column, for example, you should probably get fairly even data distribution, unless most of your sales are seasonal. However, if you commonly use a range-restricted predicate to filter for a narrow date period, most of the filtered rows occur on a limited set of slices and the query workload is skewed.
Change some dimension tables to use ALL distribution.
If a dimension table cannot be collocated with the fact table or other important joining tables, you can improve query performance significantly by distributing the entire table to all of the nodes. Using ALL distribution multiplies storage space requirements and increases load times and maintenance operations, so you should weigh all factors before choosing ALL distribution.
To let Amazon Redshift choose the appropriate distribution style, dont specify DISTSTYLE. Take note that one of the best practices in selecting a distribution style is to use the largest dimension based on the size of the filtered dataset, and not the smallest.
Hence, the correct answer is: Select the smallest dimension based on the filtered datasets size.
The following options are all incorrect because they follow the best practice in choosing the appropriate distribution style:
Designate a common column for the fact table and the dimension table
Use a DISTSTYLE ALL distribution for tables that are not frequently updated
Select a DISTKEY with high cardinality
References: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html https://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html
Unattempted
When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is run.
Some suggestions for best approach follow:
Distribute the fact table and one dimension table on their common columns.
Your fact table can have only one distribution key. Any tables that join on another key arent collocated with the fact table. Choose one dimension to collocate based on how frequently it is joined and the size of the joining rows. Designate both the dimension tables primary key and the fact tables corresponding foreign key as the DISTKEY.
Choose the largest dimension based on the size of the filtered dataset.
Only the rows that are used in the join need to be distributed, so consider the size of the dataset after filtering, not the size of the table.
Choose a column with high cardinality in the filtered result set.
If you distribute a sales table on a date column, for example, you should probably get fairly even data distribution, unless most of your sales are seasonal. However, if you commonly use a range-restricted predicate to filter for a narrow date period, most of the filtered rows occur on a limited set of slices and the query workload is skewed.
Change some dimension tables to use ALL distribution.
If a dimension table cannot be collocated with the fact table or other important joining tables, you can improve query performance significantly by distributing the entire table to all of the nodes. Using ALL distribution multiplies storage space requirements and increases load times and maintenance operations, so you should weigh all factors before choosing ALL distribution.
To let Amazon Redshift choose the appropriate distribution style, dont specify DISTSTYLE. Take note that one of the best practices in selecting a distribution style is to use the largest dimension based on the size of the filtered dataset, and not the smallest.
Hence, the correct answer is: Select the smallest dimension based on the filtered datasets size.
The following options are all incorrect because they follow the best practice in choosing the appropriate distribution style:
Designate a common column for the fact table and the dimension table
Use a DISTSTYLE ALL distribution for tables that are not frequently updated
Select a DISTKEY with high cardinality
References: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html https://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html
Question 13 of 65
13. Question
A retail company has an Amazon Redshift data warehouse that contains about 300 TB of consumer data patterns. The Redshift cluster is updated every 6 hours by importing newly gathered consumer activity data. Read-only queries are executed throughout the day to generate product patterns that are used to curate the homepage of the company website. However, there is a long-running query on this cluster that runs for more than two hours every hour that causes other short-running queries to get queued up or take a longer time to execute.
Which of the following is the most cost-effective solution to optimize query execution and avoid any downtime?
Correct
Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries wont get stuck in queues behind long-running queries. You can use workload management (WLM) to define multiple query queues and to route queries to the appropriate queues at runtime.
In some cases, you might have multiple sessions or users running queries at the same time. In these cases, some queries might consume cluster resources for long periods of time and affect the performance of other queries. For example, suppose that one group of users submits occasional complex, long-running queries that select and sort rows from several large tables. Another group frequently submits short queries that select only a few rows from one or two tables and run in a few seconds. In this situation, the short-running queries might have to wait in a queue for a long-running query to complete. WLM helps manage this situation.
You can configure Amazon Redshift WLM to run with either automatic WLM or manual WLM. Automatic WLM manages the resources required to run queries. It determines how many queries run concurrently and how much memory is allocated to each dispatched query. Manual WLM allows you to modify your WLM configuration to create separate queues for the long-running queries and the short-running queries.
Therefore, the correct answer is: On the Amazon Redshift console, create a parameter group associated with the cluster and configure scaling for the workload management (WLM) queue.
The option that says: On the Amazon Redshift console, create a schedule to increase the number of nodes every morning to accommodate the long-running query is incorrect. Although this may be possible, adding more nodes will incur costs. Using WLM is more cost-effective as it allows you to route queries on different queues automatically.
The option that says: Manually add nodes to the cluster every morning using elastic resize. Resize the cluster again after the long-running query is finished is incorrect. Elastic resize takes a cluster snapshot and the cluster is temporarily unavailable while elastic resize migrates cluster metadata. This means that the system will experience downtime of a few minutes.
The option that says: Use the snapshot, restore, and resize operations in Redshift to create a new cluster. Use the new cluster to run the long-running query then delete the cluster after the query completes is incorrect as creating a separate Redshift cluster just for the long-running queries is not cost-effective. A better way is to configure scaling using the workload management (WLM) queue.
References: https://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html https://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-operations.html
Incorrect
Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries wont get stuck in queues behind long-running queries. You can use workload management (WLM) to define multiple query queues and to route queries to the appropriate queues at runtime.
In some cases, you might have multiple sessions or users running queries at the same time. In these cases, some queries might consume cluster resources for long periods of time and affect the performance of other queries. For example, suppose that one group of users submits occasional complex, long-running queries that select and sort rows from several large tables. Another group frequently submits short queries that select only a few rows from one or two tables and run in a few seconds. In this situation, the short-running queries might have to wait in a queue for a long-running query to complete. WLM helps manage this situation.
You can configure Amazon Redshift WLM to run with either automatic WLM or manual WLM. Automatic WLM manages the resources required to run queries. It determines how many queries run concurrently and how much memory is allocated to each dispatched query. Manual WLM allows you to modify your WLM configuration to create separate queues for the long-running queries and the short-running queries.
Therefore, the correct answer is: On the Amazon Redshift console, create a parameter group associated with the cluster and configure scaling for the workload management (WLM) queue.
The option that says: On the Amazon Redshift console, create a schedule to increase the number of nodes every morning to accommodate the long-running query is incorrect. Although this may be possible, adding more nodes will incur costs. Using WLM is more cost-effective as it allows you to route queries on different queues automatically.
The option that says: Manually add nodes to the cluster every morning using elastic resize. Resize the cluster again after the long-running query is finished is incorrect. Elastic resize takes a cluster snapshot and the cluster is temporarily unavailable while elastic resize migrates cluster metadata. This means that the system will experience downtime of a few minutes.
The option that says: Use the snapshot, restore, and resize operations in Redshift to create a new cluster. Use the new cluster to run the long-running query then delete the cluster after the query completes is incorrect as creating a separate Redshift cluster just for the long-running queries is not cost-effective. A better way is to configure scaling using the workload management (WLM) queue.
References: https://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html https://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-operations.html
Unattempted
Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries wont get stuck in queues behind long-running queries. You can use workload management (WLM) to define multiple query queues and to route queries to the appropriate queues at runtime.
In some cases, you might have multiple sessions or users running queries at the same time. In these cases, some queries might consume cluster resources for long periods of time and affect the performance of other queries. For example, suppose that one group of users submits occasional complex, long-running queries that select and sort rows from several large tables. Another group frequently submits short queries that select only a few rows from one or two tables and run in a few seconds. In this situation, the short-running queries might have to wait in a queue for a long-running query to complete. WLM helps manage this situation.
You can configure Amazon Redshift WLM to run with either automatic WLM or manual WLM. Automatic WLM manages the resources required to run queries. It determines how many queries run concurrently and how much memory is allocated to each dispatched query. Manual WLM allows you to modify your WLM configuration to create separate queues for the long-running queries and the short-running queries.
Therefore, the correct answer is: On the Amazon Redshift console, create a parameter group associated with the cluster and configure scaling for the workload management (WLM) queue.
The option that says: On the Amazon Redshift console, create a schedule to increase the number of nodes every morning to accommodate the long-running query is incorrect. Although this may be possible, adding more nodes will incur costs. Using WLM is more cost-effective as it allows you to route queries on different queues automatically.
The option that says: Manually add nodes to the cluster every morning using elastic resize. Resize the cluster again after the long-running query is finished is incorrect. Elastic resize takes a cluster snapshot and the cluster is temporarily unavailable while elastic resize migrates cluster metadata. This means that the system will experience downtime of a few minutes.
The option that says: Use the snapshot, restore, and resize operations in Redshift to create a new cluster. Use the new cluster to run the long-running query then delete the cluster after the query completes is incorrect as creating a separate Redshift cluster just for the long-running queries is not cost-effective. A better way is to configure scaling using the workload management (WLM) queue.
References: https://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html https://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-operations.html
Question 14 of 65
14. Question
A mobile game company is using Amazon Redshift data warehouse with a three-node dense storage cluster to analyze 5TB of user gameplay data. With the recent acquisition of a rising mobile game, the company needs to merge another 3TB of data to the cluster. The data analytics team will run complex queries and heavy analytical workloads on this cluster that will require high I/O performance. The company decided to adjust the cluster to meet these changes.
Which of the actions should the Data Analysts take to adjust the cluster performance?
Correct
Amazon Redshift offers different node types to accommodate your workloads dense compute and dense storage.
DC2 (dense compute) nodes allow you to have compute-intensive data warehouses with local SSD storage included. You choose the number of nodes you need based on data size and performance requirements. DC2 nodes store your data locally for high performance, and as the data size grows, you can add more compute nodes to increase the storage capacity of the cluster. DC2 nodes allow only up to 2.56TB storage per node but with a very high I/O performance of 7.50 GB/s.
DS2 (dense storage) nodes enable you to create large data warehouses using hard disk drives (HDDs). This node type is recommended for substantial data storage needs. DS2 nodes allow up to 16TB of HDD storage per node but only at a maximum of 3.30 GB/s of I/O performance.
Amazon Redshift also allows you to resize your cluster as your compute and storage demand changes. You can use one of the following approaches:
Elastic resize Use elastic resize to change the node type, number of nodes, or both. If you only change the number of nodes, then queries are temporarily paused and connections are held open if possible. During the resize operation, the cluster is read-only. Typically, elastic resize takes 1015 minutes. AWS recommends using elastic resize when possible.
Classic resize Use classic resize to change the node type, number of nodes, or both. Choose this option when you are resizing to a configuration that isnt available through elastic resize. An example is to or from a single-node cluster. During the resize operation, the cluster is read-only. Typically, classic resize takes 2 hours2 days or longer, depending on your datas size.
Snapshot and restore with classic resize To keep your cluster available during a classic resize, you can first make a copy of an existing cluster, then resize the new cluster.
Therefore, the correct answer for this scenario is: Modify the cluster to use dense compute nodes and scale it using elastic resize to increase performance.
The option that says: Keep using dense storage nodes on the cluster but scale it using elastic resize to increase performance is incorrect because DS2 nodes do not provide the high I/O performance required for the scenario.
The following options are incorrect because classic resize may take hours, or even days, to perform the resize operation. AWS also recommends using elastic resize over the classic type whenever possible.
Modify the cluster to use dense compute nodes and scale it using classic resize to increase performance.
Keep using dense storage nodes on the cluster but scale it using classic resize to increase performance.
References: https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-operations.html#elastic-resize https://aws.amazon.com/premiumsupport/knowledge-center/resize-redshift-cluster/ https://aws.amazon.com/redshift/pricing/
Incorrect
Amazon Redshift offers different node types to accommodate your workloads dense compute and dense storage.
DC2 (dense compute) nodes allow you to have compute-intensive data warehouses with local SSD storage included. You choose the number of nodes you need based on data size and performance requirements. DC2 nodes store your data locally for high performance, and as the data size grows, you can add more compute nodes to increase the storage capacity of the cluster. DC2 nodes allow only up to 2.56TB storage per node but with a very high I/O performance of 7.50 GB/s.
DS2 (dense storage) nodes enable you to create large data warehouses using hard disk drives (HDDs). This node type is recommended for substantial data storage needs. DS2 nodes allow up to 16TB of HDD storage per node but only at a maximum of 3.30 GB/s of I/O performance.
Amazon Redshift also allows you to resize your cluster as your compute and storage demand changes. You can use one of the following approaches:
Elastic resize Use elastic resize to change the node type, number of nodes, or both. If you only change the number of nodes, then queries are temporarily paused and connections are held open if possible. During the resize operation, the cluster is read-only. Typically, elastic resize takes 1015 minutes. AWS recommends using elastic resize when possible.
Classic resize Use classic resize to change the node type, number of nodes, or both. Choose this option when you are resizing to a configuration that isnt available through elastic resize. An example is to or from a single-node cluster. During the resize operation, the cluster is read-only. Typically, classic resize takes 2 hours2 days or longer, depending on your datas size.
Snapshot and restore with classic resize To keep your cluster available during a classic resize, you can first make a copy of an existing cluster, then resize the new cluster.
Therefore, the correct answer for this scenario is: Modify the cluster to use dense compute nodes and scale it using elastic resize to increase performance.
The option that says: Keep using dense storage nodes on the cluster but scale it using elastic resize to increase performance is incorrect because DS2 nodes do not provide the high I/O performance required for the scenario.
The following options are incorrect because classic resize may take hours, or even days, to perform the resize operation. AWS also recommends using elastic resize over the classic type whenever possible.
Modify the cluster to use dense compute nodes and scale it using classic resize to increase performance.
Keep using dense storage nodes on the cluster but scale it using classic resize to increase performance.
References: https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-operations.html#elastic-resize https://aws.amazon.com/premiumsupport/knowledge-center/resize-redshift-cluster/ https://aws.amazon.com/redshift/pricing/
Unattempted
Amazon Redshift offers different node types to accommodate your workloads dense compute and dense storage.
DC2 (dense compute) nodes allow you to have compute-intensive data warehouses with local SSD storage included. You choose the number of nodes you need based on data size and performance requirements. DC2 nodes store your data locally for high performance, and as the data size grows, you can add more compute nodes to increase the storage capacity of the cluster. DC2 nodes allow only up to 2.56TB storage per node but with a very high I/O performance of 7.50 GB/s.
DS2 (dense storage) nodes enable you to create large data warehouses using hard disk drives (HDDs). This node type is recommended for substantial data storage needs. DS2 nodes allow up to 16TB of HDD storage per node but only at a maximum of 3.30 GB/s of I/O performance.
Amazon Redshift also allows you to resize your cluster as your compute and storage demand changes. You can use one of the following approaches:
Elastic resize Use elastic resize to change the node type, number of nodes, or both. If you only change the number of nodes, then queries are temporarily paused and connections are held open if possible. During the resize operation, the cluster is read-only. Typically, elastic resize takes 1015 minutes. AWS recommends using elastic resize when possible.
Classic resize Use classic resize to change the node type, number of nodes, or both. Choose this option when you are resizing to a configuration that isnt available through elastic resize. An example is to or from a single-node cluster. During the resize operation, the cluster is read-only. Typically, classic resize takes 2 hours2 days or longer, depending on your datas size.
Snapshot and restore with classic resize To keep your cluster available during a classic resize, you can first make a copy of an existing cluster, then resize the new cluster.
Therefore, the correct answer for this scenario is: Modify the cluster to use dense compute nodes and scale it using elastic resize to increase performance.
The option that says: Keep using dense storage nodes on the cluster but scale it using elastic resize to increase performance is incorrect because DS2 nodes do not provide the high I/O performance required for the scenario.
The following options are incorrect because classic resize may take hours, or even days, to perform the resize operation. AWS also recommends using elastic resize over the classic type whenever possible.
Modify the cluster to use dense compute nodes and scale it using classic resize to increase performance.
Keep using dense storage nodes on the cluster but scale it using classic resize to increase performance.
References: https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-operations.html#elastic-resize https://aws.amazon.com/premiumsupport/knowledge-center/resize-redshift-cluster/ https://aws.amazon.com/redshift/pricing/
Question 15 of 65
15. Question
A Data Analyst is using an Amazon DynamoDB table for keeping inventory and order management data. The Data Analyst adds a custom classifier to an AWS Glue crawler to extract data from the database. After running the crawler, AWS Glue returns a classification string of UNKNOWN.
What is the most likely reason for the returned classification string?
Correct
You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. The ETL job reads from and writes to the data stores that are specified in the source and target Data Catalog tables.
A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema. The classifier also returns a certainty number to indicate how certain the format recognition was.
AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers. AWS Glue invokes custom classifiers first, in the order that you specify in your crawler definition. Depending on the results that are returned from custom classifiers, AWS Glue might also invoke built-in classifiers. If a classifier returns certainty=1.0 during processing, it indicates that its 100 percent certain that it can create the correct schema. AWS Glue then uses the output of that classifier.
If no classifier returns certainty=1.0, AWS Glue uses the output of the classifier that has the highest certainty. If no classifier returns a certainty greater than 0.0, AWS Glue returns the default classification string of UNKNOWN.
Hence, the correct answer is: AWS Glue was unable to find a classifier with certainty greater than 0.0.
The option that says: AWS Glue has invoked a built-in classifier is incorrect because the classifier would return a certainty of 1.0 if a built-in classifier was invoked.
The option that says: AWS Glue has invoked a custom classifier with a certainty of -1 is incorrect because the certainty level has no negative values.
The option that says: AWS Glue has invoked a custom classifier that matches the schema of a built-in classifier is incorrect. AWS Glue invokes the custom classifiers first before its built-in classifier. If a custom classifier exactly matches the schema of a built-in classifier then AWS Glue would return a certainty of 1.0 and use the output of the custom classifier.
References: https://aws.amazon.com/emr/faqs/ https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html
Incorrect
You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. The ETL job reads from and writes to the data stores that are specified in the source and target Data Catalog tables.
A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema. The classifier also returns a certainty number to indicate how certain the format recognition was.
AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers. AWS Glue invokes custom classifiers first, in the order that you specify in your crawler definition. Depending on the results that are returned from custom classifiers, AWS Glue might also invoke built-in classifiers. If a classifier returns certainty=1.0 during processing, it indicates that its 100 percent certain that it can create the correct schema. AWS Glue then uses the output of that classifier.
If no classifier returns certainty=1.0, AWS Glue uses the output of the classifier that has the highest certainty. If no classifier returns a certainty greater than 0.0, AWS Glue returns the default classification string of UNKNOWN.
Hence, the correct answer is: AWS Glue was unable to find a classifier with certainty greater than 0.0.
The option that says: AWS Glue has invoked a built-in classifier is incorrect because the classifier would return a certainty of 1.0 if a built-in classifier was invoked.
The option that says: AWS Glue has invoked a custom classifier with a certainty of -1 is incorrect because the certainty level has no negative values.
The option that says: AWS Glue has invoked a custom classifier that matches the schema of a built-in classifier is incorrect. AWS Glue invokes the custom classifiers first before its built-in classifier. If a custom classifier exactly matches the schema of a built-in classifier then AWS Glue would return a certainty of 1.0 and use the output of the custom classifier.
References: https://aws.amazon.com/emr/faqs/ https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html
Unattempted
You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. The ETL job reads from and writes to the data stores that are specified in the source and target Data Catalog tables.
A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema. The classifier also returns a certainty number to indicate how certain the format recognition was.
AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers. AWS Glue invokes custom classifiers first, in the order that you specify in your crawler definition. Depending on the results that are returned from custom classifiers, AWS Glue might also invoke built-in classifiers. If a classifier returns certainty=1.0 during processing, it indicates that its 100 percent certain that it can create the correct schema. AWS Glue then uses the output of that classifier.
If no classifier returns certainty=1.0, AWS Glue uses the output of the classifier that has the highest certainty. If no classifier returns a certainty greater than 0.0, AWS Glue returns the default classification string of UNKNOWN.
Hence, the correct answer is: AWS Glue was unable to find a classifier with certainty greater than 0.0.
The option that says: AWS Glue has invoked a built-in classifier is incorrect because the classifier would return a certainty of 1.0 if a built-in classifier was invoked.
The option that says: AWS Glue has invoked a custom classifier with a certainty of -1 is incorrect because the certainty level has no negative values.
The option that says: AWS Glue has invoked a custom classifier that matches the schema of a built-in classifier is incorrect. AWS Glue invokes the custom classifiers first before its built-in classifier. If a custom classifier exactly matches the schema of a built-in classifier then AWS Glue would return a certainty of 1.0 and use the output of the custom classifier.
References: https://aws.amazon.com/emr/faqs/ https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html
Question 16 of 65
16. Question
A company is using a persistent Amazon EMR cluster to process vast amounts of data and store them as external tables in an S3 bucket. The Data Analyst must launch several transient EMR clusters to access the same tables simultaneously. However, the metadata about the Amazon S3 external tables are stored and defined on the persistent cluster.
Which of the following is the most efficient way to expose the Hive metastore with minimal effort?
Correct
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Amazon EMR is ideal for problems that necessitate the fast and efficient processing of large amounts of data.
In this scenario, there are two types of clusters: persistent and transient. A persistent cluster is terminated manually when you no longer need it while the transient cluster is automatically terminated after all the steps are completed. Using Amazon EMR version 5.8.0 or later, you can configure Hive to use the AWS Glue Data Catalog as its metastore. This configuration is recommended if the metastore is shared by different clusters.
Hence, the correct answer is: Configure Hive to use the AWS Glue Data Catalog as its metastore.
The option that says: Configure Hive to use an Amazon DynamoDB as its metastore is incorrect because external metastore only supports AWS Glue Data Catalog and External MySQL Database or Amazon Aurora.
The option that says: Configure Hive to use an External MySQL Database as its metastore is incorrect because it entails a lot of effort to provision and manage an external MySQL database than simply using AWS Glue Data Catalog.
The option that says: Configure Hive to use Amazon Aurora as its metastore is incorrect. Just like the option above, you must use AWS Glue Data Catalog for minimal effort. Setting up and managing a Hive table metadata with Amazon Aurora entails a lot of configurations and associated costs.
References: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html https://aws.amazon.com/emr/faqs/
Incorrect
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Amazon EMR is ideal for problems that necessitate the fast and efficient processing of large amounts of data.
In this scenario, there are two types of clusters: persistent and transient. A persistent cluster is terminated manually when you no longer need it while the transient cluster is automatically terminated after all the steps are completed. Using Amazon EMR version 5.8.0 or later, you can configure Hive to use the AWS Glue Data Catalog as its metastore. This configuration is recommended if the metastore is shared by different clusters.
Hence, the correct answer is: Configure Hive to use the AWS Glue Data Catalog as its metastore.
The option that says: Configure Hive to use an Amazon DynamoDB as its metastore is incorrect because external metastore only supports AWS Glue Data Catalog and External MySQL Database or Amazon Aurora.
The option that says: Configure Hive to use an External MySQL Database as its metastore is incorrect because it entails a lot of effort to provision and manage an external MySQL database than simply using AWS Glue Data Catalog.
The option that says: Configure Hive to use Amazon Aurora as its metastore is incorrect. Just like the option above, you must use AWS Glue Data Catalog for minimal effort. Setting up and managing a Hive table metadata with Amazon Aurora entails a lot of configurations and associated costs.
References: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html https://aws.amazon.com/emr/faqs/
Unattempted
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Amazon EMR is ideal for problems that necessitate the fast and efficient processing of large amounts of data.
In this scenario, there are two types of clusters: persistent and transient. A persistent cluster is terminated manually when you no longer need it while the transient cluster is automatically terminated after all the steps are completed. Using Amazon EMR version 5.8.0 or later, you can configure Hive to use the AWS Glue Data Catalog as its metastore. This configuration is recommended if the metastore is shared by different clusters.
Hence, the correct answer is: Configure Hive to use the AWS Glue Data Catalog as its metastore.
The option that says: Configure Hive to use an Amazon DynamoDB as its metastore is incorrect because external metastore only supports AWS Glue Data Catalog and External MySQL Database or Amazon Aurora.
The option that says: Configure Hive to use an External MySQL Database as its metastore is incorrect because it entails a lot of effort to provision and manage an external MySQL database than simply using AWS Glue Data Catalog.
The option that says: Configure Hive to use Amazon Aurora as its metastore is incorrect. Just like the option above, you must use AWS Glue Data Catalog for minimal effort. Setting up and managing a Hive table metadata with Amazon Aurora entails a lot of configurations and associated costs.
References: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html https://aws.amazon.com/emr/faqs/
Question 17 of 65
17. Question
A company is using Amazon Kinesis Data Streams to capture real-time messages from over 150 websites. To handle the traffic spikes, the data stream has been provisioned with 40 shards to achieve maximum data throughput. An Amazon Kinesis Client Library (KCL) application hosted in an Auto Scaling group of EC2 instances consumes the stream, analyzes the data, and store the results in a DynamoDB table. The average CPU utilization across all servers is 20% including peak times. The DynamoDB table has a provisioned write capacity unit set to 5.
The manager received a report that the application increased its latency during peak times. Upon initial investigation, there are no ProvisionedThroughputExceededException errors found on the KCL logs and the CPU Utilization of the instances didnt exceed its limit. The Data Analyst is instructed to implement a solution that will resolve the latency problem.
Which of the following is the best approach to solve this issue?
Correct
Amazon Kinesis Client Library (KCL) helps you consume and process data from a Kinesis data stream by taking care of many of the complex tasks associated with distributed computing. These include load balancing across multiple consumer application instances, responding to consumer application instance failures, checkpointing processed records, and reacting to resharding. The KCL takes care of all of these subtasks so that you can focus your efforts on writing your custom record-processing logic.
In this scenario, you need to resolve the latency problem of your KCL application. There were no ProvisionedThroughputExceededException errors found on the logs and the CPU Utilization of the instances didnt exceed its limit. This means that there is no issue on your data stream or on your EC2 instances. Whats left is the write throughput of your DynamoDB table that could potentially delay the processing. Since the DynamoDB table only has a provisioned write capacity unit of 5, this could cause the processing to slow down.
Hence, the correct answer is: Increase the write throughput of the DynamoDB table.
The option that says: Increase the number of shards in the Kinesis Data Stream is incorrect because there are no ProvisionedThroughputExceededException errors found in the logs, which indicates the lack of shards in the Kinesis stream. In this scenario, the issue lies in the underlying data store being used by the KCL application.
The option that says: Scale up the KCL application by using a higher EC2 instance type to increase network performance is incorrect because increasing the instance size of the instance wont help reduce the latency. The issue is not caused by the network performance of the KCL consumers but its underlying data store.
The option that says: Update the Auto Scaling group to increase the minimum number of running EC2 instances is incorrect because it is already mentioned in the scenario that the CPU Utilization of the EC2 instances did not exceed its limit. The average CPU utilization across all servers is 20% including peak times which means that the compute capacity is not an issue.
References: https://aws.amazon.com/blogs/big-data/processing-amazon-dynamodb-streams-using-the-amazon-kinesis-client-library/ https://docs.aws.amazon.com/streams/latest/dev/introduction.html
Incorrect
Amazon Kinesis Client Library (KCL) helps you consume and process data from a Kinesis data stream by taking care of many of the complex tasks associated with distributed computing. These include load balancing across multiple consumer application instances, responding to consumer application instance failures, checkpointing processed records, and reacting to resharding. The KCL takes care of all of these subtasks so that you can focus your efforts on writing your custom record-processing logic.
In this scenario, you need to resolve the latency problem of your KCL application. There were no ProvisionedThroughputExceededException errors found on the logs and the CPU Utilization of the instances didnt exceed its limit. This means that there is no issue on your data stream or on your EC2 instances. Whats left is the write throughput of your DynamoDB table that could potentially delay the processing. Since the DynamoDB table only has a provisioned write capacity unit of 5, this could cause the processing to slow down.
Hence, the correct answer is: Increase the write throughput of the DynamoDB table.
The option that says: Increase the number of shards in the Kinesis Data Stream is incorrect because there are no ProvisionedThroughputExceededException errors found in the logs, which indicates the lack of shards in the Kinesis stream. In this scenario, the issue lies in the underlying data store being used by the KCL application.
The option that says: Scale up the KCL application by using a higher EC2 instance type to increase network performance is incorrect because increasing the instance size of the instance wont help reduce the latency. The issue is not caused by the network performance of the KCL consumers but its underlying data store.
The option that says: Update the Auto Scaling group to increase the minimum number of running EC2 instances is incorrect because it is already mentioned in the scenario that the CPU Utilization of the EC2 instances did not exceed its limit. The average CPU utilization across all servers is 20% including peak times which means that the compute capacity is not an issue.
References: https://aws.amazon.com/blogs/big-data/processing-amazon-dynamodb-streams-using-the-amazon-kinesis-client-library/ https://docs.aws.amazon.com/streams/latest/dev/introduction.html
Unattempted
Amazon Kinesis Client Library (KCL) helps you consume and process data from a Kinesis data stream by taking care of many of the complex tasks associated with distributed computing. These include load balancing across multiple consumer application instances, responding to consumer application instance failures, checkpointing processed records, and reacting to resharding. The KCL takes care of all of these subtasks so that you can focus your efforts on writing your custom record-processing logic.
In this scenario, you need to resolve the latency problem of your KCL application. There were no ProvisionedThroughputExceededException errors found on the logs and the CPU Utilization of the instances didnt exceed its limit. This means that there is no issue on your data stream or on your EC2 instances. Whats left is the write throughput of your DynamoDB table that could potentially delay the processing. Since the DynamoDB table only has a provisioned write capacity unit of 5, this could cause the processing to slow down.
Hence, the correct answer is: Increase the write throughput of the DynamoDB table.
The option that says: Increase the number of shards in the Kinesis Data Stream is incorrect because there are no ProvisionedThroughputExceededException errors found in the logs, which indicates the lack of shards in the Kinesis stream. In this scenario, the issue lies in the underlying data store being used by the KCL application.
The option that says: Scale up the KCL application by using a higher EC2 instance type to increase network performance is incorrect because increasing the instance size of the instance wont help reduce the latency. The issue is not caused by the network performance of the KCL consumers but its underlying data store.
The option that says: Update the Auto Scaling group to increase the minimum number of running EC2 instances is incorrect because it is already mentioned in the scenario that the CPU Utilization of the EC2 instances did not exceed its limit. The average CPU utilization across all servers is 20% including peak times which means that the compute capacity is not an issue.
References: https://aws.amazon.com/blogs/big-data/processing-amazon-dynamodb-streams-using-the-amazon-kinesis-client-library/ https://docs.aws.amazon.com/streams/latest/dev/introduction.html
Question 18 of 65
18. Question
A smart home security company wants to add more features to its current system to enhance security and improve customer satisfaction. The company uses sensors that send nested JSON files asynchronously into a Kinesis data stream by utilizing the Kinesis Producer Library (KPL) in Java. Upon inspection, it was found that a faulty sensor tends to push recorded data to the cloud at irregular intervals. The company has to design a near-real-time analytics solution to get data from the most updated and healthy sensors.
What solution will allow the company to meet these requirements?
Correct
There is a maximum amount of time (milliseconds) that a record may spend being buffered before it gets sent. Records may be sent sooner than this depending on the other buffering limits. This setting provides coarse ordering among records any two records will be reordered by no more than twice this amount (assuming no failures and retries and equal network latency). The library makes a best effort to enforce this time, but cannot guarantee that it will be precisely met. In general, if the CPU is not overloaded, the library will meet this deadline within 10ms. Failures and retries can additionally increase the number of time records spent in the KPL. If your application cannot tolerate late records, use the record_ttl setting to drop records that do not get transmitted in time. Setting this too low can negatively impact throughput.
The KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly.
Hence, the correct answer is:
Modify the sensors code by utilizing the PutRecord or PutRecords of the Kinesis Data Streams API from the AWS SDK for Java. Create an Amazon Kinesis Data Analytics application for data enrichment using a custom anomaly detection SQL script. Send the enriched data to an Amazon Kinesis Data Firehose delivery stream and enable data transformation by configuring an AWS Lambda Function to flatten the JSON file. Use Amazon Elasticsearch Service as the destination of data coming from the Kinesis Data Firehose delivery stream.
The following options are both incorrect because KPL adds a processing delay that depends on the value of the RecordMaxBufferedTime parameter. Setting the RecordMaxBufferedTime to a low value negatively affects the throughput:
Deactivate the buffering on the sensor side by setting the value of the RecordMaxBufferedTime configuration parameter of the KPL to 0. Create a dedicated Kinesis Data Firehose delivery stream for each data stream and enable data transformation by configuring an AWS Lambda Function to flatten the JSON file. Load the processed data to an Amazon S3 bucket. Use an Amazon Redshift cluster to read the data from Amazon S3.
Deactivate the buffering on the sensor side by setting the value of the RecordMaxBufferedTime configuration parameter of the KPL to -1. Direct the data to Amazon Kinesis Data Analytics for data enrichment using a custom anomaly detection SQL script. Send the enriched data to a fleet of Kinesis data streams with data transformation enabled to flatten the JSON file. Use an Amazon Redshift cluster with dense storage as the destination of data coming from the Kinesis Data Firehose delivery stream.
The option that says: Modify the sensors code by utilizing the PutRecord or PutRecords of the Kinesis Data Streams API from the AWS SDK for Java. Process data from the stream using an Streaming ETL Job in AWS Glue. Create an AWS Lambda to push the processed data into an Amazon Elasticsearch Service cluster is incorrect because this is not a near-real-time solution. AWS Glue processes and writes out data in 100-second window by default.
References: https://docs.amazonaws.cn/en_us/streams/latest/dev/developing-producers-with-kpl.html https://javadoc.io/static/com.amazonaws/amazon-kinesis-producer/0.14.0/com/amazonaws/services/kinesis/producer/KinesisProducerConfiguration.html https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html
Incorrect
There is a maximum amount of time (milliseconds) that a record may spend being buffered before it gets sent. Records may be sent sooner than this depending on the other buffering limits. This setting provides coarse ordering among records any two records will be reordered by no more than twice this amount (assuming no failures and retries and equal network latency). The library makes a best effort to enforce this time, but cannot guarantee that it will be precisely met. In general, if the CPU is not overloaded, the library will meet this deadline within 10ms. Failures and retries can additionally increase the number of time records spent in the KPL. If your application cannot tolerate late records, use the record_ttl setting to drop records that do not get transmitted in time. Setting this too low can negatively impact throughput.
The KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly.
Hence, the correct answer is:
Modify the sensors code by utilizing the PutRecord or PutRecords of the Kinesis Data Streams API from the AWS SDK for Java. Create an Amazon Kinesis Data Analytics application for data enrichment using a custom anomaly detection SQL script. Send the enriched data to an Amazon Kinesis Data Firehose delivery stream and enable data transformation by configuring an AWS Lambda Function to flatten the JSON file. Use Amazon Elasticsearch Service as the destination of data coming from the Kinesis Data Firehose delivery stream.
The following options are both incorrect because KPL adds a processing delay that depends on the value of the RecordMaxBufferedTime parameter. Setting the RecordMaxBufferedTime to a low value negatively affects the throughput:
Deactivate the buffering on the sensor side by setting the value of the RecordMaxBufferedTime configuration parameter of the KPL to 0. Create a dedicated Kinesis Data Firehose delivery stream for each data stream and enable data transformation by configuring an AWS Lambda Function to flatten the JSON file. Load the processed data to an Amazon S3 bucket. Use an Amazon Redshift cluster to read the data from Amazon S3.
Deactivate the buffering on the sensor side by setting the value of the RecordMaxBufferedTime configuration parameter of the KPL to -1. Direct the data to Amazon Kinesis Data Analytics for data enrichment using a custom anomaly detection SQL script. Send the enriched data to a fleet of Kinesis data streams with data transformation enabled to flatten the JSON file. Use an Amazon Redshift cluster with dense storage as the destination of data coming from the Kinesis Data Firehose delivery stream.
The option that says: Modify the sensors code by utilizing the PutRecord or PutRecords of the Kinesis Data Streams API from the AWS SDK for Java. Process data from the stream using an Streaming ETL Job in AWS Glue. Create an AWS Lambda to push the processed data into an Amazon Elasticsearch Service cluster is incorrect because this is not a near-real-time solution. AWS Glue processes and writes out data in 100-second window by default.
References: https://docs.amazonaws.cn/en_us/streams/latest/dev/developing-producers-with-kpl.html https://javadoc.io/static/com.amazonaws/amazon-kinesis-producer/0.14.0/com/amazonaws/services/kinesis/producer/KinesisProducerConfiguration.html https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html
Unattempted
There is a maximum amount of time (milliseconds) that a record may spend being buffered before it gets sent. Records may be sent sooner than this depending on the other buffering limits. This setting provides coarse ordering among records any two records will be reordered by no more than twice this amount (assuming no failures and retries and equal network latency). The library makes a best effort to enforce this time, but cannot guarantee that it will be precisely met. In general, if the CPU is not overloaded, the library will meet this deadline within 10ms. Failures and retries can additionally increase the number of time records spent in the KPL. If your application cannot tolerate late records, use the record_ttl setting to drop records that do not get transmitted in time. Setting this too low can negatively impact throughput.
The KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly.
Hence, the correct answer is:
Modify the sensors code by utilizing the PutRecord or PutRecords of the Kinesis Data Streams API from the AWS SDK for Java. Create an Amazon Kinesis Data Analytics application for data enrichment using a custom anomaly detection SQL script. Send the enriched data to an Amazon Kinesis Data Firehose delivery stream and enable data transformation by configuring an AWS Lambda Function to flatten the JSON file. Use Amazon Elasticsearch Service as the destination of data coming from the Kinesis Data Firehose delivery stream.
The following options are both incorrect because KPL adds a processing delay that depends on the value of the RecordMaxBufferedTime parameter. Setting the RecordMaxBufferedTime to a low value negatively affects the throughput:
Deactivate the buffering on the sensor side by setting the value of the RecordMaxBufferedTime configuration parameter of the KPL to 0. Create a dedicated Kinesis Data Firehose delivery stream for each data stream and enable data transformation by configuring an AWS Lambda Function to flatten the JSON file. Load the processed data to an Amazon S3 bucket. Use an Amazon Redshift cluster to read the data from Amazon S3.
Deactivate the buffering on the sensor side by setting the value of the RecordMaxBufferedTime configuration parameter of the KPL to -1. Direct the data to Amazon Kinesis Data Analytics for data enrichment using a custom anomaly detection SQL script. Send the enriched data to a fleet of Kinesis data streams with data transformation enabled to flatten the JSON file. Use an Amazon Redshift cluster with dense storage as the destination of data coming from the Kinesis Data Firehose delivery stream.
The option that says: Modify the sensors code by utilizing the PutRecord or PutRecords of the Kinesis Data Streams API from the AWS SDK for Java. Process data from the stream using an Streaming ETL Job in AWS Glue. Create an AWS Lambda to push the processed data into an Amazon Elasticsearch Service cluster is incorrect because this is not a near-real-time solution. AWS Glue processes and writes out data in 100-second window by default.
References: https://docs.amazonaws.cn/en_us/streams/latest/dev/developing-producers-with-kpl.html https://javadoc.io/static/com.amazonaws/amazon-kinesis-producer/0.14.0/com/amazonaws/services/kinesis/producer/KinesisProducerConfiguration.html https://docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html
Question 19 of 65
19. Question
A Data Analytics team is conducting an in-depth data exploration using Amazon Redshift. The team manages a Redshift cluster that uses a star schema design from which thousands of files are being loaded into the central fact table. A team member suggested optimizing the cluster resource utilization when loading data into the fact table to achieve the highest throughput.
Which solution will meet these requirements?
Correct
The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files and by setting distribution keys on your tables.
Amazon Redshift automatically loads in parallel from multiple data files. If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load. This type of load is much slower and requires a VACUUM process at the end, if the table has a sort column defined.
Hence, the correct answer is: Load the data into the Redshift cluster using a single COPY command.
The option that says: Load the data into the Redshift cluster using multiple COPY commands is incorrect because using multiple COPY commands will force Redshift to use serialized load which is slower.
The option that says: Ingest data into the Redshift cluster via an Hadoop Distributed File System (HDFS) connector. Load multiple files into the HDFS using the S3DistCp command is incorrect because the S3DistCp command is used in Amazon EMR, and not in Amazon Redshift.
The option that says: Determine the number of Redshift cluster nodes and use LOAD commands equal to that number to parallelize the loading of data into each node is incorrect because Redshift already loads data in parallel automatically when using a single COPY command.
References: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html https://docs.aws.amazon.com/us_en/redshift/latest/dg/t_Loading-data-from-S3.html
Incorrect
The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files and by setting distribution keys on your tables.
Amazon Redshift automatically loads in parallel from multiple data files. If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load. This type of load is much slower and requires a VACUUM process at the end, if the table has a sort column defined.
Hence, the correct answer is: Load the data into the Redshift cluster using a single COPY command.
The option that says: Load the data into the Redshift cluster using multiple COPY commands is incorrect because using multiple COPY commands will force Redshift to use serialized load which is slower.
The option that says: Ingest data into the Redshift cluster via an Hadoop Distributed File System (HDFS) connector. Load multiple files into the HDFS using the S3DistCp command is incorrect because the S3DistCp command is used in Amazon EMR, and not in Amazon Redshift.
The option that says: Determine the number of Redshift cluster nodes and use LOAD commands equal to that number to parallelize the loading of data into each node is incorrect because Redshift already loads data in parallel automatically when using a single COPY command.
References: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html https://docs.aws.amazon.com/us_en/redshift/latest/dg/t_Loading-data-from-S3.html
Unattempted
The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files and by setting distribution keys on your tables.
Amazon Redshift automatically loads in parallel from multiple data files. If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load. This type of load is much slower and requires a VACUUM process at the end, if the table has a sort column defined.
Hence, the correct answer is: Load the data into the Redshift cluster using a single COPY command.
The option that says: Load the data into the Redshift cluster using multiple COPY commands is incorrect because using multiple COPY commands will force Redshift to use serialized load which is slower.
The option that says: Ingest data into the Redshift cluster via an Hadoop Distributed File System (HDFS) connector. Load multiple files into the HDFS using the S3DistCp command is incorrect because the S3DistCp command is used in Amazon EMR, and not in Amazon Redshift.
The option that says: Determine the number of Redshift cluster nodes and use LOAD commands equal to that number to parallelize the loading of data into each node is incorrect because Redshift already loads data in parallel automatically when using a single COPY command.
References: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-single-copy-command.html https://docs.aws.amazon.com/us_en/redshift/latest/dg/t_Loading-data-from-S3.html
Question 20 of 65
20. Question
A particle physics laboratory is generating up to 1 TB of data per day as physicists generate simulations for their experiments. The raw data is converted into large .csv files and stored in an Amazon S3 bucket with folders partitioned by date. At the end of each business day, the data is loaded into Amazon Redshift data warehouse to run analysis and detect patterns on the experiments. However, it takes a lot of time whenever data is loaded from the S3 bucket to Redshift.
Which of the following actions will help improve the data loading times?
Correct
The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files and by setting distribution keys on your tables.
When you load all the data from a single large file, Amazon Redshift is forced to perform a serialized load, which is much slower. The COPY command loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression. The number of files should be a multiple of the number of slices in your cluster.
You can follow this general process to load data from Amazon S3:
Split your data into multiple files.
Upload your files to Amazon S3.
Run a COPY command to load the table.
Verify that the data was loaded correctly.
Data from the files is loaded into the target table, one line per row. The fields in the data file are matched to table columns in order, left to right. Fields in the data files can be fixed-width or character delimited; the default delimiter is a pipe (|). By default, all the table columns are loaded, but you can optionally define a comma-separated list of columns. If a table column is not included in the column list specified in the COPY command, it is loaded with a default value.
Therefore, the correct answer is: Store the .csv files in Amazon S3 but split the large .csv files into smaller chunks. Use the COPY command to load the files into Amazon Redshift.
The option that says: Store the .csv files in Amazon S3 in compressed format, then issue the INSERT command to load the files into Amazon Redshift is incorrect because you have to use the COPY command to load data from an S3 bucket to Amazon Redshift. You only use the INSERT command when you need to move data or a subset of data from one table into another.
The option that says: Stream the large .csv files in parallel to Amazon Kinesis Data Firehose and ingest into Amazon Redshift is incorrect. Although this may be possible, creating an Amazon Kinesis Data Firehose stream is unnecessary as it will just incur an additional cost. The most suitable solution for this scenario is to simply split the large files into smaller chunks to improve the load performance.
The option that says: Vacuum the table in Amazon Redshift after loading each .csv file in an unsorted key order to improve the loading time is incorrect as this will not improve the loading time performance. The VACUUM simply re-sorts rows and reclaims space in either a specified table or all tables in the current database.
References: https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html
Incorrect
The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files and by setting distribution keys on your tables.
When you load all the data from a single large file, Amazon Redshift is forced to perform a serialized load, which is much slower. The COPY command loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression. The number of files should be a multiple of the number of slices in your cluster.
You can follow this general process to load data from Amazon S3:
Split your data into multiple files.
Upload your files to Amazon S3.
Run a COPY command to load the table.
Verify that the data was loaded correctly.
Data from the files is loaded into the target table, one line per row. The fields in the data file are matched to table columns in order, left to right. Fields in the data files can be fixed-width or character delimited; the default delimiter is a pipe (|). By default, all the table columns are loaded, but you can optionally define a comma-separated list of columns. If a table column is not included in the column list specified in the COPY command, it is loaded with a default value.
Therefore, the correct answer is: Store the .csv files in Amazon S3 but split the large .csv files into smaller chunks. Use the COPY command to load the files into Amazon Redshift.
The option that says: Store the .csv files in Amazon S3 in compressed format, then issue the INSERT command to load the files into Amazon Redshift is incorrect because you have to use the COPY command to load data from an S3 bucket to Amazon Redshift. You only use the INSERT command when you need to move data or a subset of data from one table into another.
The option that says: Stream the large .csv files in parallel to Amazon Kinesis Data Firehose and ingest into Amazon Redshift is incorrect. Although this may be possible, creating an Amazon Kinesis Data Firehose stream is unnecessary as it will just incur an additional cost. The most suitable solution for this scenario is to simply split the large files into smaller chunks to improve the load performance.
The option that says: Vacuum the table in Amazon Redshift after loading each .csv file in an unsorted key order to improve the loading time is incorrect as this will not improve the loading time performance. The VACUUM simply re-sorts rows and reclaims space in either a specified table or all tables in the current database.
References: https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html
Unattempted
The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files in an Amazon S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files and by setting distribution keys on your tables.
When you load all the data from a single large file, Amazon Redshift is forced to perform a serialized load, which is much slower. The COPY command loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression. The number of files should be a multiple of the number of slices in your cluster.
You can follow this general process to load data from Amazon S3:
Split your data into multiple files.
Upload your files to Amazon S3.
Run a COPY command to load the table.
Verify that the data was loaded correctly.
Data from the files is loaded into the target table, one line per row. The fields in the data file are matched to table columns in order, left to right. Fields in the data files can be fixed-width or character delimited; the default delimiter is a pipe (|). By default, all the table columns are loaded, but you can optionally define a comma-separated list of columns. If a table column is not included in the column list specified in the COPY command, it is loaded with a default value.
Therefore, the correct answer is: Store the .csv files in Amazon S3 but split the large .csv files into smaller chunks. Use the COPY command to load the files into Amazon Redshift.
The option that says: Store the .csv files in Amazon S3 in compressed format, then issue the INSERT command to load the files into Amazon Redshift is incorrect because you have to use the COPY command to load data from an S3 bucket to Amazon Redshift. You only use the INSERT command when you need to move data or a subset of data from one table into another.
The option that says: Stream the large .csv files in parallel to Amazon Kinesis Data Firehose and ingest into Amazon Redshift is incorrect. Although this may be possible, creating an Amazon Kinesis Data Firehose stream is unnecessary as it will just incur an additional cost. The most suitable solution for this scenario is to simply split the large files into smaller chunks to improve the load performance.
The option that says: Vacuum the table in Amazon Redshift after loading each .csv file in an unsorted key order to improve the loading time is incorrect as this will not improve the loading time performance. The VACUUM simply re-sorts rows and reclaims space in either a specified table or all tables in the current database.
References: https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html
Question 21 of 65
21. Question
A weather forecasting company created a mobile app for users who want to see weather updates on their smartphones. It uses historical weather data to generate accurate weather forecasts. The data analysts have been tasked to help identify a high-performing long-term storage for historical weather data with the following requirements:
Historical weather data is approximately 25 TB uncompressed.
Every 30 minutes, there will be single-row inserts (Low volume)
Thousands of aggregation queries each day (High volume)
There is a need to perform multiple complex joins.
A small subset of the columns in a table is usually involved when querying data.
Which storage service will provide the MOST suitable solution?
Correct
Amazon Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and very efficient, targeted data compression encoding schemes. Amazon Redshift uses a columnar architecture, which means the data is organized by columns on disk instead of row-by-row as in the OLTP approach. Columnar architecture offers advantages when querying a subset of the columns in a table by greatly reducing I/O. And because the data is stored by column, it can be highly compressed which further reduces I/O and allows more data to be stored and quickly queried.
Amazon Redshift supports SQL Join. SQL JOIN is a clause that is used for combining specific fields from two or more tables based on the common columns available. Joins are used to combine rows from multiple tables. Redshifts JOIN clause is perhaps the second most important clause after SELECT clause. Having the ability to do complex JOINs are typically needed for a more sophisticated result than a smaller dataset could yield.
Thus, the correct answer is: Load the historical weather data on Amazon Redshift.
The option that says: Import the historical weather data on Amazon Aurora MySQL is incorrect. Although Aurora MySQL can handle 25 TB of uncompressed data and can work just like Redshift, it is not an optimal database to use for queries that typically involve a small subset of the columns in the table.
The option that says: Load the historical weather data on Amazon Neptune is incorrect because Amazon Neptune is just a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with milliseconds latency.
The option that says: Store the historical weather data on Amazon Elasticsearch is incorrect because Elasticsearch is simply a popular open-source search and analytics engine that is suitable for use cases such as log analytics, real-time application monitoring, and clickstream analysis.
References: https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html https://docs.aws.amazon.com/redshift/latest/dg/r_Join_examples.html
Incorrect
Amazon Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and very efficient, targeted data compression encoding schemes. Amazon Redshift uses a columnar architecture, which means the data is organized by columns on disk instead of row-by-row as in the OLTP approach. Columnar architecture offers advantages when querying a subset of the columns in a table by greatly reducing I/O. And because the data is stored by column, it can be highly compressed which further reduces I/O and allows more data to be stored and quickly queried.
Amazon Redshift supports SQL Join. SQL JOIN is a clause that is used for combining specific fields from two or more tables based on the common columns available. Joins are used to combine rows from multiple tables. Redshifts JOIN clause is perhaps the second most important clause after SELECT clause. Having the ability to do complex JOINs are typically needed for a more sophisticated result than a smaller dataset could yield.
Thus, the correct answer is: Load the historical weather data on Amazon Redshift.
The option that says: Import the historical weather data on Amazon Aurora MySQL is incorrect. Although Aurora MySQL can handle 25 TB of uncompressed data and can work just like Redshift, it is not an optimal database to use for queries that typically involve a small subset of the columns in the table.
The option that says: Load the historical weather data on Amazon Neptune is incorrect because Amazon Neptune is just a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with milliseconds latency.
The option that says: Store the historical weather data on Amazon Elasticsearch is incorrect because Elasticsearch is simply a popular open-source search and analytics engine that is suitable for use cases such as log analytics, real-time application monitoring, and clickstream analysis.
References: https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html https://docs.aws.amazon.com/redshift/latest/dg/r_Join_examples.html
Unattempted
Amazon Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and very efficient, targeted data compression encoding schemes. Amazon Redshift uses a columnar architecture, which means the data is organized by columns on disk instead of row-by-row as in the OLTP approach. Columnar architecture offers advantages when querying a subset of the columns in a table by greatly reducing I/O. And because the data is stored by column, it can be highly compressed which further reduces I/O and allows more data to be stored and quickly queried.
Amazon Redshift supports SQL Join. SQL JOIN is a clause that is used for combining specific fields from two or more tables based on the common columns available. Joins are used to combine rows from multiple tables. Redshifts JOIN clause is perhaps the second most important clause after SELECT clause. Having the ability to do complex JOINs are typically needed for a more sophisticated result than a smaller dataset could yield.
Thus, the correct answer is: Load the historical weather data on Amazon Redshift.
The option that says: Import the historical weather data on Amazon Aurora MySQL is incorrect. Although Aurora MySQL can handle 25 TB of uncompressed data and can work just like Redshift, it is not an optimal database to use for queries that typically involve a small subset of the columns in the table.
The option that says: Load the historical weather data on Amazon Neptune is incorrect because Amazon Neptune is just a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with milliseconds latency.
The option that says: Store the historical weather data on Amazon Elasticsearch is incorrect because Elasticsearch is simply a popular open-source search and analytics engine that is suitable for use cases such as log analytics, real-time application monitoring, and clickstream analysis.
References: https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html https://docs.aws.amazon.com/redshift/latest/dg/r_Join_examples.html
Question 22 of 65
22. Question
A retail company uses Amazon S3 Standard buckets on both the Tokyo and Singapore AWS Regions as its primary data storage. A recent change in the regulatory compliance has prompted the company to apply server-side encryption and enable storage lifecycle for all S3 buckets in both Regions, which transfers data to S3 Standard-IA and Amazon S3 Glacier. The company queries and analyzes data residing in the Tokyo Region using Amazon Athena. However, even with correct IAM permissions, some of the data cannot be accessed.
Which of the following MOST likely explains why some data are inaccessible?
Correct
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.
Athena supports querying objects that are stored with multiple storage classes in the same bucket specified by the LOCATION clause. For example, you can query data in objects that are stored in different Storage classes (Standard, Standard-IA and Intelligent-Tiering) in Amazon S3.
When data is moved or transitioned to the Amazon S3 GLACIER storage class, it is no longer readable or queryable by Athena. This is true even after storage class objects are restored. To make the restored objects that you want to query readable by Athena, copy the restored objects back into Amazon S3 to change their storage class.
S3 Standard-IA is an Amazon S3 storage class for data that is accessed less frequently but requires rapid access when needed. Amazon S3 Standard-IA offers high durability, high throughput, and low latency of S3 Standard.
Hence, the correct answer is: Amazon Athena is trying to access data stored in Amazon S3 Glacier.
The option that says: Amazon Athena is trying to access data stored in Amazon S3 Standard-IA in Tokyo Region is incorrect because querying data stored in an S3 Standard-IA bucket via Athena is possible.
The option that says: Amazon Athena is trying to access data stored in Amazon S3 that do not have public access enabled is incorrect. Amazon Athena can query S3 buckets that do not have public access. Also, it is stated in the scenario that the company is already using the correct IAM permissions. Therefore, it is highly unlikely that there is a permission issue.
The option that says: Amazon Athena is running from a different AWS Region is incorrect because Amazon Athena can do inter-region queries.
References: https://aws.amazon.com/s3/storage-classes/ https://docs.aws.amazon.com/athena/latest/ug/other-notable-limitations.html https://docs.aws.amazon.com/athena/latest/ug/creating-tables.html
Incorrect
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.
Athena supports querying objects that are stored with multiple storage classes in the same bucket specified by the LOCATION clause. For example, you can query data in objects that are stored in different Storage classes (Standard, Standard-IA and Intelligent-Tiering) in Amazon S3.
When data is moved or transitioned to the Amazon S3 GLACIER storage class, it is no longer readable or queryable by Athena. This is true even after storage class objects are restored. To make the restored objects that you want to query readable by Athena, copy the restored objects back into Amazon S3 to change their storage class.
S3 Standard-IA is an Amazon S3 storage class for data that is accessed less frequently but requires rapid access when needed. Amazon S3 Standard-IA offers high durability, high throughput, and low latency of S3 Standard.
Hence, the correct answer is: Amazon Athena is trying to access data stored in Amazon S3 Glacier.
The option that says: Amazon Athena is trying to access data stored in Amazon S3 Standard-IA in Tokyo Region is incorrect because querying data stored in an S3 Standard-IA bucket via Athena is possible.
The option that says: Amazon Athena is trying to access data stored in Amazon S3 that do not have public access enabled is incorrect. Amazon Athena can query S3 buckets that do not have public access. Also, it is stated in the scenario that the company is already using the correct IAM permissions. Therefore, it is highly unlikely that there is a permission issue.
The option that says: Amazon Athena is running from a different AWS Region is incorrect because Amazon Athena can do inter-region queries.
References: https://aws.amazon.com/s3/storage-classes/ https://docs.aws.amazon.com/athena/latest/ug/other-notable-limitations.html https://docs.aws.amazon.com/athena/latest/ug/creating-tables.html
Unattempted
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.
Athena supports querying objects that are stored with multiple storage classes in the same bucket specified by the LOCATION clause. For example, you can query data in objects that are stored in different Storage classes (Standard, Standard-IA and Intelligent-Tiering) in Amazon S3.
When data is moved or transitioned to the Amazon S3 GLACIER storage class, it is no longer readable or queryable by Athena. This is true even after storage class objects are restored. To make the restored objects that you want to query readable by Athena, copy the restored objects back into Amazon S3 to change their storage class.
S3 Standard-IA is an Amazon S3 storage class for data that is accessed less frequently but requires rapid access when needed. Amazon S3 Standard-IA offers high durability, high throughput, and low latency of S3 Standard.
Hence, the correct answer is: Amazon Athena is trying to access data stored in Amazon S3 Glacier.
The option that says: Amazon Athena is trying to access data stored in Amazon S3 Standard-IA in Tokyo Region is incorrect because querying data stored in an S3 Standard-IA bucket via Athena is possible.
The option that says: Amazon Athena is trying to access data stored in Amazon S3 that do not have public access enabled is incorrect. Amazon Athena can query S3 buckets that do not have public access. Also, it is stated in the scenario that the company is already using the correct IAM permissions. Therefore, it is highly unlikely that there is a permission issue.
The option that says: Amazon Athena is running from a different AWS Region is incorrect because Amazon Athena can do inter-region queries.
References: https://aws.amazon.com/s3/storage-classes/ https://docs.aws.amazon.com/athena/latest/ug/other-notable-limitations.html https://docs.aws.amazon.com/athena/latest/ug/creating-tables.html
Question 23 of 65
23. Question
A company is undergoing digital transformation and wants to improve its business and data analytics capabilities. The system currently has 200 TB stored in it and every month, it generates an additional 50 TB of data. A new solution must be implemented that supports the following:
The IT operations team generates an hourly performance report that uses the data for the current month.
The accounting team needs to run a financial health check report daily based on last months data and an investor report monthly for the previous 12 months.
The management team wants a visual data dashboard that automatically updates itself based on the last 30 days of data as soon as it gets committed in the backend.
Which of the following is the most cost-effective solution that meets all the above requirements?
Correct
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. It enables you to use your data to acquire new insights for your business and customers. Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remain in Amazon S3. Multiple clusters can concurrently query the same dataset in Amazon S3 without making copies of the data for each cluster.
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
Meanwhile, Amazon QuickSight is a business analytics service you can use to build visualizations, perform ad hoc analysis, and get business insights from your data. It can automatically discover AWS data sources and also works with your data sources. Amazon QuickSight enables organizations to scale to hundreds of thousands of users and deliver responsive performance using a robust in-memory engine (SPICE).
In this scenario that asks for a proper storage solution, you need to optimize the jobs performance with run-frequency and data sources as specified by the different teams and ensure you maintain the AWS Well-Architected Framework specifically in terms of cost optimization. Query services, data warehouses, and complex data processing frameworks all have their place, and they are used for different things. AWS recommends using Amazon EMR if you use custom code to process and analyze extremely large datasets with big data processing frameworks such as Apache Spark, Hadoop, Presto, or Hbase.
Amazon EMR gives you full control over your clusters configuration and the software you install on them. Amazon Redshift is the best service to use when performing complex queries on massive collections of structured and semi-structured data and getting fast performance. While the Redshift Spectrum feature is excellent for running queries against data in Amazon Redshift and S3, it isnt a fit for the types of use cases that enterprises typically ask from processing frameworks like Amazon EMR.
However, with Redshift Spectrum, Amazon Redshift users can take advantage of inexpensive S3 storage and still scale-out to pull, filter, aggregate, group, and sort data. Because Spectrum is serverless, theres nothing to provision or manage. You pay only for the queries you run against the data that you scan.
Hence, the correct answer is: Maintain the last two months of data in Amazon Redshift and unload the older data to an Amazon S3 bucket. Use Amazon Redshift Spectrum and set up an external schema for data access. Use Amazon QuickSight with Amazon Redshift Spectrum as the data source.
The option that says: Leverage on Amazon S3 to store the last 12 months of data. Use Amazon QuickSight with Amazon Redshift Spectrum as the data source is incorrect. Two workloads coming from both the management team and the accounting team require better performance. Storing the last 12 months of data will have a performance impact on your queries. The previous two months of data must be kept in Redshift and not entirely in Amazon S3 for faster retrieval.
The option that says: Maintain the last two months of data in Amazon Redshift then unload the older data to an Amazon S3 bucket. Use a persistent Amazon EMR with Apache Spark cluster for data access. Use Amazon QuickSight with Amazon EMR as the data source is incorrect because a persistent Amazon EMR cluster entails a lot of costs. More importantly, you cannot use Amazon EMR as a data source for Amazon QuickSight.
The option that says: Use an Amazon Redshift cluster to store the last 12 months of data. Use Amazon QuickSight with the cluster as the data source is incorrect. Keeping 12 months of data in a Redshift cluster costs more than just unloading the infrequently used data to Amazon S3. This option does not utilize the Redshift Spectrum feature that can save more recurring costs to the company.
References: https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/
Incorrect
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. It enables you to use your data to acquire new insights for your business and customers. Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remain in Amazon S3. Multiple clusters can concurrently query the same dataset in Amazon S3 without making copies of the data for each cluster.
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
Meanwhile, Amazon QuickSight is a business analytics service you can use to build visualizations, perform ad hoc analysis, and get business insights from your data. It can automatically discover AWS data sources and also works with your data sources. Amazon QuickSight enables organizations to scale to hundreds of thousands of users and deliver responsive performance using a robust in-memory engine (SPICE).
In this scenario that asks for a proper storage solution, you need to optimize the jobs performance with run-frequency and data sources as specified by the different teams and ensure you maintain the AWS Well-Architected Framework specifically in terms of cost optimization. Query services, data warehouses, and complex data processing frameworks all have their place, and they are used for different things. AWS recommends using Amazon EMR if you use custom code to process and analyze extremely large datasets with big data processing frameworks such as Apache Spark, Hadoop, Presto, or Hbase.
Amazon EMR gives you full control over your clusters configuration and the software you install on them. Amazon Redshift is the best service to use when performing complex queries on massive collections of structured and semi-structured data and getting fast performance. While the Redshift Spectrum feature is excellent for running queries against data in Amazon Redshift and S3, it isnt a fit for the types of use cases that enterprises typically ask from processing frameworks like Amazon EMR.
However, with Redshift Spectrum, Amazon Redshift users can take advantage of inexpensive S3 storage and still scale-out to pull, filter, aggregate, group, and sort data. Because Spectrum is serverless, theres nothing to provision or manage. You pay only for the queries you run against the data that you scan.
Hence, the correct answer is: Maintain the last two months of data in Amazon Redshift and unload the older data to an Amazon S3 bucket. Use Amazon Redshift Spectrum and set up an external schema for data access. Use Amazon QuickSight with Amazon Redshift Spectrum as the data source.
The option that says: Leverage on Amazon S3 to store the last 12 months of data. Use Amazon QuickSight with Amazon Redshift Spectrum as the data source is incorrect. Two workloads coming from both the management team and the accounting team require better performance. Storing the last 12 months of data will have a performance impact on your queries. The previous two months of data must be kept in Redshift and not entirely in Amazon S3 for faster retrieval.
The option that says: Maintain the last two months of data in Amazon Redshift then unload the older data to an Amazon S3 bucket. Use a persistent Amazon EMR with Apache Spark cluster for data access. Use Amazon QuickSight with Amazon EMR as the data source is incorrect because a persistent Amazon EMR cluster entails a lot of costs. More importantly, you cannot use Amazon EMR as a data source for Amazon QuickSight.
The option that says: Use an Amazon Redshift cluster to store the last 12 months of data. Use Amazon QuickSight with the cluster as the data source is incorrect. Keeping 12 months of data in a Redshift cluster costs more than just unloading the infrequently used data to Amazon S3. This option does not utilize the Redshift Spectrum feature that can save more recurring costs to the company.
References: https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/
Unattempted
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. It enables you to use your data to acquire new insights for your business and customers. Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remain in Amazon S3. Multiple clusters can concurrently query the same dataset in Amazon S3 without making copies of the data for each cluster.
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
Meanwhile, Amazon QuickSight is a business analytics service you can use to build visualizations, perform ad hoc analysis, and get business insights from your data. It can automatically discover AWS data sources and also works with your data sources. Amazon QuickSight enables organizations to scale to hundreds of thousands of users and deliver responsive performance using a robust in-memory engine (SPICE).
In this scenario that asks for a proper storage solution, you need to optimize the jobs performance with run-frequency and data sources as specified by the different teams and ensure you maintain the AWS Well-Architected Framework specifically in terms of cost optimization. Query services, data warehouses, and complex data processing frameworks all have their place, and they are used for different things. AWS recommends using Amazon EMR if you use custom code to process and analyze extremely large datasets with big data processing frameworks such as Apache Spark, Hadoop, Presto, or Hbase.
Amazon EMR gives you full control over your clusters configuration and the software you install on them. Amazon Redshift is the best service to use when performing complex queries on massive collections of structured and semi-structured data and getting fast performance. While the Redshift Spectrum feature is excellent for running queries against data in Amazon Redshift and S3, it isnt a fit for the types of use cases that enterprises typically ask from processing frameworks like Amazon EMR.
However, with Redshift Spectrum, Amazon Redshift users can take advantage of inexpensive S3 storage and still scale-out to pull, filter, aggregate, group, and sort data. Because Spectrum is serverless, theres nothing to provision or manage. You pay only for the queries you run against the data that you scan.
Hence, the correct answer is: Maintain the last two months of data in Amazon Redshift and unload the older data to an Amazon S3 bucket. Use Amazon Redshift Spectrum and set up an external schema for data access. Use Amazon QuickSight with Amazon Redshift Spectrum as the data source.
The option that says: Leverage on Amazon S3 to store the last 12 months of data. Use Amazon QuickSight with Amazon Redshift Spectrum as the data source is incorrect. Two workloads coming from both the management team and the accounting team require better performance. Storing the last 12 months of data will have a performance impact on your queries. The previous two months of data must be kept in Redshift and not entirely in Amazon S3 for faster retrieval.
The option that says: Maintain the last two months of data in Amazon Redshift then unload the older data to an Amazon S3 bucket. Use a persistent Amazon EMR with Apache Spark cluster for data access. Use Amazon QuickSight with Amazon EMR as the data source is incorrect because a persistent Amazon EMR cluster entails a lot of costs. More importantly, you cannot use Amazon EMR as a data source for Amazon QuickSight.
The option that says: Use an Amazon Redshift cluster to store the last 12 months of data. Use Amazon QuickSight with the cluster as the data source is incorrect. Keeping 12 months of data in a Redshift cluster costs more than just unloading the infrequently used data to Amazon S3. This option does not utilize the Redshift Spectrum feature that can save more recurring costs to the company.
References: https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/
Question 24 of 65
24. Question
A company wants to analyze the data inside a GZIP-compressed comma-separated values (CSV) file that they generate every month. The file is 150 MB in size with 25,000 data records and is currently archived in Amazon S3 Glacier. The data analytics team needs to query a subset of the data and extract the first ten columns for the records that match a specific condition.
Which of the following is the most cost-effective solution to implement?
Correct
Amazon Simple Storage Service (S3) stores data for millions of applications used by market leaders in every industry. Many of these customers also use Amazon Glacier for secure, durable, and extremely low-cost archival storage.
S3 Select, launching in preview now generally available, enables applications to retrieve only a subset of data from an object using simple SQL expressions. Without S3 Select, you would need to download, decompress, and process the entire CSV to get the data you needed. With S3 Select, you can use a simple SQL expression to return only the data from the store youre interested in, instead of retrieving the entire object. Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects compressed with GZIP or BZIP2 (for CSV and JSON objects only) and server-side encrypted objects. You can specify the format of the results as either CSV or JSON, and you can determine how the records in the output are delimited.
In contrast, cold data stored in Glacier can now be easily queried within minutes. The following are requirements for using S3 Glacier Select:
Archive objects queried by S3 Glacier Select must be formatted as uncompressed comma-separated values (CSV).
You must have an S3 bucket to work with. The AWS account you use to initiate an S3 Glacier Select job must have write permissions for the S3 bucket. The Amazon S3 bucket must be in the same AWS Region as the vault that contains the queried archive object.
You must have permission to call Get Job Output (GET output).
Amazon Athena is out-of-the-box integrated with AWS Glue Data Catalog; you can also use Glues fully-managed ETL capabilities to transform data or convert it into columnar formats to optimize cost and improve performance. Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables.
Hence, the correct answer is: Restore the archive to an Amazon S3 bucket and use Amazon S3 Select to query the data.
The option that says: Use Amazon Glacier Select to directly query the records is incorrect because you cant directly query a GZIP-compressed CSV file using Amazon S3 Glacier Select. The CSV files must be formatted as uncompressed comma-separated values (CSV) before you can query them. Alternatively, you can load the data to Amazon S3 then use S3 Select.
The option that says: Restore the archive to an Amazon S3 bucket and use Amazon Athena to query the data is incorrect. Although this solution is feasible, the use of Amazon Athena is unwarranted since the scenario already mentioned that it only needs to query a subset of data. Using Amazon S3 Select meets the requirements and costs less than using Athena.
The option that says: Restore the archive to an Amazon S3 bucket and use Amazon Redshift Spectrum to query the data is incorrect. Amazon Redshift Spectrum can query GZIP-compressed CSV files. However, it costs more than S3 Select. Please take note that we only need to query a subset of data and not the entire data.
References: https://docs.aws.amazon.com/amazonglacier/latest/dev/glacier-select.html https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html
Incorrect
Amazon Simple Storage Service (S3) stores data for millions of applications used by market leaders in every industry. Many of these customers also use Amazon Glacier for secure, durable, and extremely low-cost archival storage.
S3 Select, launching in preview now generally available, enables applications to retrieve only a subset of data from an object using simple SQL expressions. Without S3 Select, you would need to download, decompress, and process the entire CSV to get the data you needed. With S3 Select, you can use a simple SQL expression to return only the data from the store youre interested in, instead of retrieving the entire object. Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects compressed with GZIP or BZIP2 (for CSV and JSON objects only) and server-side encrypted objects. You can specify the format of the results as either CSV or JSON, and you can determine how the records in the output are delimited.
In contrast, cold data stored in Glacier can now be easily queried within minutes. The following are requirements for using S3 Glacier Select:
Archive objects queried by S3 Glacier Select must be formatted as uncompressed comma-separated values (CSV).
You must have an S3 bucket to work with. The AWS account you use to initiate an S3 Glacier Select job must have write permissions for the S3 bucket. The Amazon S3 bucket must be in the same AWS Region as the vault that contains the queried archive object.
You must have permission to call Get Job Output (GET output).
Amazon Athena is out-of-the-box integrated with AWS Glue Data Catalog; you can also use Glues fully-managed ETL capabilities to transform data or convert it into columnar formats to optimize cost and improve performance. Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables.
Hence, the correct answer is: Restore the archive to an Amazon S3 bucket and use Amazon S3 Select to query the data.
The option that says: Use Amazon Glacier Select to directly query the records is incorrect because you cant directly query a GZIP-compressed CSV file using Amazon S3 Glacier Select. The CSV files must be formatted as uncompressed comma-separated values (CSV) before you can query them. Alternatively, you can load the data to Amazon S3 then use S3 Select.
The option that says: Restore the archive to an Amazon S3 bucket and use Amazon Athena to query the data is incorrect. Although this solution is feasible, the use of Amazon Athena is unwarranted since the scenario already mentioned that it only needs to query a subset of data. Using Amazon S3 Select meets the requirements and costs less than using Athena.
The option that says: Restore the archive to an Amazon S3 bucket and use Amazon Redshift Spectrum to query the data is incorrect. Amazon Redshift Spectrum can query GZIP-compressed CSV files. However, it costs more than S3 Select. Please take note that we only need to query a subset of data and not the entire data.
References: https://docs.aws.amazon.com/amazonglacier/latest/dev/glacier-select.html https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html
Unattempted
Amazon Simple Storage Service (S3) stores data for millions of applications used by market leaders in every industry. Many of these customers also use Amazon Glacier for secure, durable, and extremely low-cost archival storage.
S3 Select, launching in preview now generally available, enables applications to retrieve only a subset of data from an object using simple SQL expressions. Without S3 Select, you would need to download, decompress, and process the entire CSV to get the data you needed. With S3 Select, you can use a simple SQL expression to return only the data from the store youre interested in, instead of retrieving the entire object. Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects compressed with GZIP or BZIP2 (for CSV and JSON objects only) and server-side encrypted objects. You can specify the format of the results as either CSV or JSON, and you can determine how the records in the output are delimited.
In contrast, cold data stored in Glacier can now be easily queried within minutes. The following are requirements for using S3 Glacier Select:
Archive objects queried by S3 Glacier Select must be formatted as uncompressed comma-separated values (CSV).
You must have an S3 bucket to work with. The AWS account you use to initiate an S3 Glacier Select job must have write permissions for the S3 bucket. The Amazon S3 bucket must be in the same AWS Region as the vault that contains the queried archive object.
You must have permission to call Get Job Output (GET output).
Amazon Athena is out-of-the-box integrated with AWS Glue Data Catalog; you can also use Glues fully-managed ETL capabilities to transform data or convert it into columnar formats to optimize cost and improve performance. Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables.
Hence, the correct answer is: Restore the archive to an Amazon S3 bucket and use Amazon S3 Select to query the data.
The option that says: Use Amazon Glacier Select to directly query the records is incorrect because you cant directly query a GZIP-compressed CSV file using Amazon S3 Glacier Select. The CSV files must be formatted as uncompressed comma-separated values (CSV) before you can query them. Alternatively, you can load the data to Amazon S3 then use S3 Select.
The option that says: Restore the archive to an Amazon S3 bucket and use Amazon Athena to query the data is incorrect. Although this solution is feasible, the use of Amazon Athena is unwarranted since the scenario already mentioned that it only needs to query a subset of data. Using Amazon S3 Select meets the requirements and costs less than using Athena.
The option that says: Restore the archive to an Amazon S3 bucket and use Amazon Redshift Spectrum to query the data is incorrect. Amazon Redshift Spectrum can query GZIP-compressed CSV files. However, it costs more than S3 Select. Please take note that we only need to query a subset of data and not the entire data.
References: https://docs.aws.amazon.com/amazonglacier/latest/dev/glacier-select.html https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html
Question 25 of 65
25. Question
A company uses an Amazon Redshift Cluster for data warehousing and plans to migrate the data from a new application into a cluster table. The Data Analyst used the COPY command to migrate over 20,000 records stored in CSV files from an S3 bucket. However, after the process was finished, the analyst discovered that no data was imported. No errors were found upon investigation.
What is most likely causing the issue?
Correct
In Amazon Redshift, the COPY command loads data into a table from data files or an Amazon DynamoDB table. The files can be located in an Amazon Simple Storage Service (Amazon S3) bucket, an Amazon EMR cluster, or a remote host accessed using a Secure Shell (SSH) connection. As it loads the table, COPY attempts to convert the strings in the source data to the target columns data type. If you need to specify a conversion that is different from the default behavior, or if the default conversion results in errors, you can manage data conversions by setting supported parameters.
COPY fails to load data to Amazon Redshift if the CSV file uses carriage returns (\\r, ^M, or 0x0D in hexadecimal) as a line terminator. Because Amazon Redshift doesnt recognize carriage returns as line terminators, the file is parsed as one line. When the COPY command has the IGNOREHEADER parameter set to a non-zero number, Amazon Redshift skips the first line, and therefore, the entire file. No load errors are returned because the operation is technically successful.
The critical statement in this scenario was that no errors were returned even after the process finished. Although all options can prevent data import from succeeding, the only reason that would not return an error message but would allow the process to finish is that: The CSV file uses carriage returns as a line terminator and the COPY command has the The IGNOREHEADER parameter.
The option that says: The COPY command was blocked by other queries running in the Amazon Redshift Cluster is incorrect. If this is the case then Redshift would either return a timeout error or keep on running until the blocking process is resolved.
The option that says: The COPY command was trying to import it into a table that has not been created yet is incorrect because the process will fail with an error stating that the table does not exist. Take note that its stated in the scenario that there were no errors found upon investigation.
The option that says: The Amazon Redshift cluster is stopped is incorrect because if this is the case then the COPY command will fail and return an error stating that the cluster is not available or the connection failed.
References: https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html https://aws.amazon.com/premiumsupport/knowledge-center/redshift-copy-nothing-loaded/
Incorrect
In Amazon Redshift, the COPY command loads data into a table from data files or an Amazon DynamoDB table. The files can be located in an Amazon Simple Storage Service (Amazon S3) bucket, an Amazon EMR cluster, or a remote host accessed using a Secure Shell (SSH) connection. As it loads the table, COPY attempts to convert the strings in the source data to the target columns data type. If you need to specify a conversion that is different from the default behavior, or if the default conversion results in errors, you can manage data conversions by setting supported parameters.
COPY fails to load data to Amazon Redshift if the CSV file uses carriage returns (\\r, ^M, or 0x0D in hexadecimal) as a line terminator. Because Amazon Redshift doesnt recognize carriage returns as line terminators, the file is parsed as one line. When the COPY command has the IGNOREHEADER parameter set to a non-zero number, Amazon Redshift skips the first line, and therefore, the entire file. No load errors are returned because the operation is technically successful.
The critical statement in this scenario was that no errors were returned even after the process finished. Although all options can prevent data import from succeeding, the only reason that would not return an error message but would allow the process to finish is that: The CSV file uses carriage returns as a line terminator and the COPY command has the The IGNOREHEADER parameter.
The option that says: The COPY command was blocked by other queries running in the Amazon Redshift Cluster is incorrect. If this is the case then Redshift would either return a timeout error or keep on running until the blocking process is resolved.
The option that says: The COPY command was trying to import it into a table that has not been created yet is incorrect because the process will fail with an error stating that the table does not exist. Take note that its stated in the scenario that there were no errors found upon investigation.
The option that says: The Amazon Redshift cluster is stopped is incorrect because if this is the case then the COPY command will fail and return an error stating that the cluster is not available or the connection failed.
References: https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html https://aws.amazon.com/premiumsupport/knowledge-center/redshift-copy-nothing-loaded/
Unattempted
In Amazon Redshift, the COPY command loads data into a table from data files or an Amazon DynamoDB table. The files can be located in an Amazon Simple Storage Service (Amazon S3) bucket, an Amazon EMR cluster, or a remote host accessed using a Secure Shell (SSH) connection. As it loads the table, COPY attempts to convert the strings in the source data to the target columns data type. If you need to specify a conversion that is different from the default behavior, or if the default conversion results in errors, you can manage data conversions by setting supported parameters.
COPY fails to load data to Amazon Redshift if the CSV file uses carriage returns (\\r, ^M, or 0x0D in hexadecimal) as a line terminator. Because Amazon Redshift doesnt recognize carriage returns as line terminators, the file is parsed as one line. When the COPY command has the IGNOREHEADER parameter set to a non-zero number, Amazon Redshift skips the first line, and therefore, the entire file. No load errors are returned because the operation is technically successful.
The critical statement in this scenario was that no errors were returned even after the process finished. Although all options can prevent data import from succeeding, the only reason that would not return an error message but would allow the process to finish is that: The CSV file uses carriage returns as a line terminator and the COPY command has the The IGNOREHEADER parameter.
The option that says: The COPY command was blocked by other queries running in the Amazon Redshift Cluster is incorrect. If this is the case then Redshift would either return a timeout error or keep on running until the blocking process is resolved.
The option that says: The COPY command was trying to import it into a table that has not been created yet is incorrect because the process will fail with an error stating that the table does not exist. Take note that its stated in the scenario that there were no errors found upon investigation.
The option that says: The Amazon Redshift cluster is stopped is incorrect because if this is the case then the COPY command will fail and return an error stating that the cluster is not available or the connection failed.
References: https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html https://aws.amazon.com/premiumsupport/knowledge-center/redshift-copy-nothing-loaded/
Question 26 of 65
26. Question
A smart home automation firm needs to analyze data generated from 100 unique devices. The data is collected by a Kinesis Data Firehose delivery stream and stored to Amazon S3 in JSON format. Every night at 12:00 AM, the data is loaded for processing.
The firm wants to use Amazon Athena to study data changes over time that are stored in those files. Also, the cost of running Athena queries must be optimized.
Which steps will have the MOST impact on lowering the costs? (Select TWO.)
Correct
Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and ORC are columnar data formats that save space and enable faster queries than row-oriented formats like JSON. Snappy compression happens automatically as part of the conversion process. The framing format for Snappy that Kinesis Data Firehose uses, in this case, is compatible with Hadoop. It means that you can use the results of the Snappy compression and run queries on this data in Athena.
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost. Athena leverages Hive for partitioning data. You can partition your data by any key. A common practice is to partition the data based on time, often leading to a multi-level partitioning scheme. For example, a customer who has data coming in every hour might decide to partition by year, month, date, and hour. Another customer, who has data from many different sources but loaded one time per day, may partition by a data source identifier and date.
The scenario tests what you know about tuning Athena and file formats. Moreover, the question focuses particularly on reducing storage costs. Hence, the correct answers are:
In Athena, create the external table and partition it by the device and date.
Through the delivery stream, convert the data format into Apache Parquet with Snappy compression.
The option that says: Through the delivery stream, convert the data format into Apache Avro with no compression is incorrect. Enabling compression reduces costs because there are fewer data to store. Furthermore, at this moment, Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC only.
The option that says: Configure the new delivery stream to use a custom prefix based on year, month, day, and hour is incorrect. In Kinesis Data Firehose-S3 delivery, the default prefix is already based on year, month, day, and hour. From a storage cost-reduction perspective, it does not help in properly partitioning your data. You have to partition your data by a data source identifier (e.g. sensor ID or device ID) and date for better performance.
The option that says: In Athena, create the external table and partition it by year, month, day, and hour is incorrect. Since data gets loaded once a day, it is more suitable to partition it based on device and date instead.
References: https://docs.aws.amazon.com/athena/latest/ug/partitions.html https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html https://docs.aws.amazon.com/athena/latest/ug/partition-projection-kinesis-firehose-example.html
Incorrect
Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and ORC are columnar data formats that save space and enable faster queries than row-oriented formats like JSON. Snappy compression happens automatically as part of the conversion process. The framing format for Snappy that Kinesis Data Firehose uses, in this case, is compatible with Hadoop. It means that you can use the results of the Snappy compression and run queries on this data in Athena.
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost. Athena leverages Hive for partitioning data. You can partition your data by any key. A common practice is to partition the data based on time, often leading to a multi-level partitioning scheme. For example, a customer who has data coming in every hour might decide to partition by year, month, date, and hour. Another customer, who has data from many different sources but loaded one time per day, may partition by a data source identifier and date.
The scenario tests what you know about tuning Athena and file formats. Moreover, the question focuses particularly on reducing storage costs. Hence, the correct answers are:
In Athena, create the external table and partition it by the device and date.
Through the delivery stream, convert the data format into Apache Parquet with Snappy compression.
The option that says: Through the delivery stream, convert the data format into Apache Avro with no compression is incorrect. Enabling compression reduces costs because there are fewer data to store. Furthermore, at this moment, Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC only.
The option that says: Configure the new delivery stream to use a custom prefix based on year, month, day, and hour is incorrect. In Kinesis Data Firehose-S3 delivery, the default prefix is already based on year, month, day, and hour. From a storage cost-reduction perspective, it does not help in properly partitioning your data. You have to partition your data by a data source identifier (e.g. sensor ID or device ID) and date for better performance.
The option that says: In Athena, create the external table and partition it by year, month, day, and hour is incorrect. Since data gets loaded once a day, it is more suitable to partition it based on device and date instead.
References: https://docs.aws.amazon.com/athena/latest/ug/partitions.html https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html https://docs.aws.amazon.com/athena/latest/ug/partition-projection-kinesis-firehose-example.html
Unattempted
Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and ORC are columnar data formats that save space and enable faster queries than row-oriented formats like JSON. Snappy compression happens automatically as part of the conversion process. The framing format for Snappy that Kinesis Data Firehose uses, in this case, is compatible with Hadoop. It means that you can use the results of the Snappy compression and run queries on this data in Athena.
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost. Athena leverages Hive for partitioning data. You can partition your data by any key. A common practice is to partition the data based on time, often leading to a multi-level partitioning scheme. For example, a customer who has data coming in every hour might decide to partition by year, month, date, and hour. Another customer, who has data from many different sources but loaded one time per day, may partition by a data source identifier and date.
The scenario tests what you know about tuning Athena and file formats. Moreover, the question focuses particularly on reducing storage costs. Hence, the correct answers are:
In Athena, create the external table and partition it by the device and date.
Through the delivery stream, convert the data format into Apache Parquet with Snappy compression.
The option that says: Through the delivery stream, convert the data format into Apache Avro with no compression is incorrect. Enabling compression reduces costs because there are fewer data to store. Furthermore, at this moment, Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC only.
The option that says: Configure the new delivery stream to use a custom prefix based on year, month, day, and hour is incorrect. In Kinesis Data Firehose-S3 delivery, the default prefix is already based on year, month, day, and hour. From a storage cost-reduction perspective, it does not help in properly partitioning your data. You have to partition your data by a data source identifier (e.g. sensor ID or device ID) and date for better performance.
The option that says: In Athena, create the external table and partition it by year, month, day, and hour is incorrect. Since data gets loaded once a day, it is more suitable to partition it based on device and date instead.
References: https://docs.aws.amazon.com/athena/latest/ug/partitions.html https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html https://docs.aws.amazon.com/athena/latest/ug/partition-projection-kinesis-firehose-example.html
Question 27 of 65
27. Question
A company needs to upgrade an Amazon Redshift cluster to support the new features of its data warehouse application. There will be several changes to the current database such as user permission updates and table schema modifications. Before running the upgrade scripts, the Data Analyst must create point-in-time backups to restore the service to its previous state if problems arise.
Which of the following options could help fulfill this task?
Correct
Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.
To create a point-in-time backup of a cluster, you can use a snapshot. There are two types of snapshots: automated and manual. The backups you will create are stored in an Amazon S3 bucket. If an automated snapshot is enabled, Amazon Redshift will take a snapshot every eight hours or following every 5 GB per node of data changes, or whichever comes first. While a manual snapshot can be taken at any time. By default, manual snapshots are retained indefinitely, even after you delete your cluster. Since you need to create a backup before upgrading the DDL, the appropriate snapshot to use is a manual snapshot.
Hence, the correct answer is: Create a manual snapshot of the cluster.
The option that says: Unload the results of Amazon Redshift to Amazon S3 is incorrect because you cant create a manual snapshot using the UNLOAD command.
The option that says: Restore the service using the automated snapshot is incorrect. Although you can use this option to recover the resource to its previous state, take note that an automated snapshot only takes a snapshot every eight hours. It is stated in the scenario that you must create a backup before running the upgrade script. Therefore, you must create a manual snapshot.
The option that says: Use AWS Lake Formation to automatically take the snapshot of the Amazon Redshift cluster and store the data in Amazon S3 is incorrect because AWS Lake Formation is just an integrated data lake service that makes it easy for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and machine learning.
References: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html https://docs.aws.amazon.com/redshift/latest/mgmt/managing-snapshots-console.html#snapshot-restore
Incorrect
Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.
To create a point-in-time backup of a cluster, you can use a snapshot. There are two types of snapshots: automated and manual. The backups you will create are stored in an Amazon S3 bucket. If an automated snapshot is enabled, Amazon Redshift will take a snapshot every eight hours or following every 5 GB per node of data changes, or whichever comes first. While a manual snapshot can be taken at any time. By default, manual snapshots are retained indefinitely, even after you delete your cluster. Since you need to create a backup before upgrading the DDL, the appropriate snapshot to use is a manual snapshot.
Hence, the correct answer is: Create a manual snapshot of the cluster.
The option that says: Unload the results of Amazon Redshift to Amazon S3 is incorrect because you cant create a manual snapshot using the UNLOAD command.
The option that says: Restore the service using the automated snapshot is incorrect. Although you can use this option to recover the resource to its previous state, take note that an automated snapshot only takes a snapshot every eight hours. It is stated in the scenario that you must create a backup before running the upgrade script. Therefore, you must create a manual snapshot.
The option that says: Use AWS Lake Formation to automatically take the snapshot of the Amazon Redshift cluster and store the data in Amazon S3 is incorrect because AWS Lake Formation is just an integrated data lake service that makes it easy for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and machine learning.
References: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html https://docs.aws.amazon.com/redshift/latest/mgmt/managing-snapshots-console.html#snapshot-restore
Unattempted
Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.
To create a point-in-time backup of a cluster, you can use a snapshot. There are two types of snapshots: automated and manual. The backups you will create are stored in an Amazon S3 bucket. If an automated snapshot is enabled, Amazon Redshift will take a snapshot every eight hours or following every 5 GB per node of data changes, or whichever comes first. While a manual snapshot can be taken at any time. By default, manual snapshots are retained indefinitely, even after you delete your cluster. Since you need to create a backup before upgrading the DDL, the appropriate snapshot to use is a manual snapshot.
Hence, the correct answer is: Create a manual snapshot of the cluster.
The option that says: Unload the results of Amazon Redshift to Amazon S3 is incorrect because you cant create a manual snapshot using the UNLOAD command.
The option that says: Restore the service using the automated snapshot is incorrect. Although you can use this option to recover the resource to its previous state, take note that an automated snapshot only takes a snapshot every eight hours. It is stated in the scenario that you must create a backup before running the upgrade script. Therefore, you must create a manual snapshot.
The option that says: Use AWS Lake Formation to automatically take the snapshot of the Amazon Redshift cluster and store the data in Amazon S3 is incorrect because AWS Lake Formation is just an integrated data lake service that makes it easy for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and machine learning.
References: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html https://docs.aws.amazon.com/redshift/latest/mgmt/managing-snapshots-console.html#snapshot-restore
Question 28 of 65
28. Question
A Data Analyst is investigating an ETL performance issue, which occurred after a huge amount of data was loaded into a table residing in an Amazon Redshift Cluster. The COPY command ran at expected time duration, but the regular VACUUM job took 3 hours longer to complete than usual. The analyst later discovered that a schema change was made to the table since its last run. Furthermore, no other user was logged in or ran another VACUUM process in the cluster.
Which of the following are most likely causing the latency? (Select TWO)
Correct
Amazon Redshift can automatically sort and perform a VACUUM DELETE operation on tables in the background. To clean up tables after a load or a series of incremental updates, you can also run the VACUUM command, either against the entire database or against individual tables.
Only the table owner or a superuser can effectively vacuum a table. If you dont have owner or superuser privileges for a table, a VACUUM operation that specifies a single table fails. If you run a VACUUM of the entire database without specifying a table name, the operation completes successfully. However, the operation has no effect on tables for which you dont have owner or superuser privileges.
The VACUUM process re-sorts rows and reclaims space in either a specified table or all tables in the current database in Amazon Redshift. By default, VACUUM skips the sort phase for any table where more than 95 percent of the tables rows are already sorted. Skipping the sort phase can significantly improve VACUUM performance. It can slow down because of the following reasons:
A high percentage of unsorted data
Large table with too many columns
Interleaved sort key usage
Irregular or infrequent use of VACUUM
Concurrent tables, cluster queries, DDL statements, or ETL jobs
Use the svv_vacuum_progress query to check the status and details of your VACUUM operation. It is recommended to use the VACUUM command with the BOOST option. The BOOST option allocates additional resources to VACUUM, such as available memory and disk space. With the BOOST option, VACUUM operates in one window and blocks concurrent deletes and updates for the duration of the VACUUM operation.
Hence, the correct answers are:
The table has ten more additional columns than the previous run.
The source data was not loaded in sort key order.
All other options are incorrect based on the explanation above. If they were true, performance would have been better:
After the load, the table has a very low percentage of unsorted data.
The VACUUM operation is run too frequently.
The VACUUM operation was run with BOOST option.
References: https://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html https://docs.aws.amazon.com/redshift/latest/dg/vacuum-managing-vacuum-times.html https://aws.amazon.com/premiumsupport/knowledge-center/redshift-vacuum-performance/
Incorrect
Amazon Redshift can automatically sort and perform a VACUUM DELETE operation on tables in the background. To clean up tables after a load or a series of incremental updates, you can also run the VACUUM command, either against the entire database or against individual tables.
Only the table owner or a superuser can effectively vacuum a table. If you dont have owner or superuser privileges for a table, a VACUUM operation that specifies a single table fails. If you run a VACUUM of the entire database without specifying a table name, the operation completes successfully. However, the operation has no effect on tables for which you dont have owner or superuser privileges.
The VACUUM process re-sorts rows and reclaims space in either a specified table or all tables in the current database in Amazon Redshift. By default, VACUUM skips the sort phase for any table where more than 95 percent of the tables rows are already sorted. Skipping the sort phase can significantly improve VACUUM performance. It can slow down because of the following reasons:
A high percentage of unsorted data
Large table with too many columns
Interleaved sort key usage
Irregular or infrequent use of VACUUM
Concurrent tables, cluster queries, DDL statements, or ETL jobs
Use the svv_vacuum_progress query to check the status and details of your VACUUM operation. It is recommended to use the VACUUM command with the BOOST option. The BOOST option allocates additional resources to VACUUM, such as available memory and disk space. With the BOOST option, VACUUM operates in one window and blocks concurrent deletes and updates for the duration of the VACUUM operation.
Hence, the correct answers are:
The table has ten more additional columns than the previous run.
The source data was not loaded in sort key order.
All other options are incorrect based on the explanation above. If they were true, performance would have been better:
After the load, the table has a very low percentage of unsorted data.
The VACUUM operation is run too frequently.
The VACUUM operation was run with BOOST option.
References: https://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html https://docs.aws.amazon.com/redshift/latest/dg/vacuum-managing-vacuum-times.html https://aws.amazon.com/premiumsupport/knowledge-center/redshift-vacuum-performance/
Unattempted
Amazon Redshift can automatically sort and perform a VACUUM DELETE operation on tables in the background. To clean up tables after a load or a series of incremental updates, you can also run the VACUUM command, either against the entire database or against individual tables.
Only the table owner or a superuser can effectively vacuum a table. If you dont have owner or superuser privileges for a table, a VACUUM operation that specifies a single table fails. If you run a VACUUM of the entire database without specifying a table name, the operation completes successfully. However, the operation has no effect on tables for which you dont have owner or superuser privileges.
The VACUUM process re-sorts rows and reclaims space in either a specified table or all tables in the current database in Amazon Redshift. By default, VACUUM skips the sort phase for any table where more than 95 percent of the tables rows are already sorted. Skipping the sort phase can significantly improve VACUUM performance. It can slow down because of the following reasons:
A high percentage of unsorted data
Large table with too many columns
Interleaved sort key usage
Irregular or infrequent use of VACUUM
Concurrent tables, cluster queries, DDL statements, or ETL jobs
Use the svv_vacuum_progress query to check the status and details of your VACUUM operation. It is recommended to use the VACUUM command with the BOOST option. The BOOST option allocates additional resources to VACUUM, such as available memory and disk space. With the BOOST option, VACUUM operates in one window and blocks concurrent deletes and updates for the duration of the VACUUM operation.
Hence, the correct answers are:
The table has ten more additional columns than the previous run.
The source data was not loaded in sort key order.
All other options are incorrect based on the explanation above. If they were true, performance would have been better:
After the load, the table has a very low percentage of unsorted data.
The VACUUM operation is run too frequently.
The VACUUM operation was run with BOOST option.
References: https://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html https://docs.aws.amazon.com/redshift/latest/dg/vacuum-managing-vacuum-times.html https://aws.amazon.com/premiumsupport/knowledge-center/redshift-vacuum-performance/
Question 29 of 65
29. Question
A Data Analyst wants to review the operational expenses of the companys cloud infrastructure. One suggestion was to manage the costs incurred from Amazon Athena and ensure that processes have a quota on the amount of data scanned in Amazon S3. If the total amount of data scanned for all queries made by the team breaches the threshold, it should notify the manager via an e-mail alert. What should the manager do to achieve this?
Correct
Amazon Athena allows you to set two types of cost controls: per-query limit per-workgroup limit (a.k.a workgroup-wide data usage control limit) Workgroups allow you to set data usage control limits per query or per workgroup, set up alarms when those limits are exceeded, and publish query metrics to CloudWatch. The per-query control limit specifies the total amount of data scanned per query. If any query that runs in the workgroup exceeds the limit, it is canceled. Use workgroups to separate users, teams, applications, or workloads, to set limits on the amount of data that each query or the entire workgroup can process, and to track costs. Because workgroups act as resources, you can use resource-level identity-based policies to control access to a specific workgroup. You can also view query-related metrics in Amazon CloudWatch, control costs by configuring limits on the amount of data scanned, and create thresholds and trigger actions, such as Amazon SNS, when these thresholds are breached. Thus, the correct answer is: For each workgroup used, configure the workgroup data usage control limit to the prescribed threshold and send a notification via an SNS topic. The option that says: Modify the Amazon S3 bucket policy to send an e-mail whenever the amount of data scanned breaches the threshold is incorrect because you cannot set those thresholds in Athena using S3 bucket policies. The option that says: Modify the primary workgroup and set the per query data usage control limit to the prescribed threshold is incorrect because the per-query control limit automatically cancels queries, which was not mandated in the scenario. This option doesnt provide a way to send an SNS notification for the breached threshold. The option that says: Write an AWS Lambda function that triggers an Amazon SNS topic to send an e-mail to the manager whenever the prescribed threshold is breached. Configure Athena to invoke the Lambda function is incorrect because it is unnecessary to write the AWS Lambda function when an Amazon Athena workgroup provides the same benefits. You have to set the control limit in the Athena workgroup to achieve this requirement. References: https://docs.aws.amazon.com/athena/latest/ug/manage-queries-control-costs-with-workgroups.html https://docs.aws.amazon.com/athena/latest/ug/control-limits.html https://docs.aws.amazon.com/athena/latest/ug/workgroups-setting-control-limits-cloudwatch.html
Incorrect
Amazon Athena allows you to set two types of cost controls: per-query limit per-workgroup limit (a.k.a workgroup-wide data usage control limit) Workgroups allow you to set data usage control limits per query or per workgroup, set up alarms when those limits are exceeded, and publish query metrics to CloudWatch. The per-query control limit specifies the total amount of data scanned per query. If any query that runs in the workgroup exceeds the limit, it is canceled. Use workgroups to separate users, teams, applications, or workloads, to set limits on the amount of data that each query or the entire workgroup can process, and to track costs. Because workgroups act as resources, you can use resource-level identity-based policies to control access to a specific workgroup. You can also view query-related metrics in Amazon CloudWatch, control costs by configuring limits on the amount of data scanned, and create thresholds and trigger actions, such as Amazon SNS, when these thresholds are breached. Thus, the correct answer is: For each workgroup used, configure the workgroup data usage control limit to the prescribed threshold and send a notification via an SNS topic. The option that says: Modify the Amazon S3 bucket policy to send an e-mail whenever the amount of data scanned breaches the threshold is incorrect because you cannot set those thresholds in Athena using S3 bucket policies. The option that says: Modify the primary workgroup and set the per query data usage control limit to the prescribed threshold is incorrect because the per-query control limit automatically cancels queries, which was not mandated in the scenario. This option doesnt provide a way to send an SNS notification for the breached threshold. The option that says: Write an AWS Lambda function that triggers an Amazon SNS topic to send an e-mail to the manager whenever the prescribed threshold is breached. Configure Athena to invoke the Lambda function is incorrect because it is unnecessary to write the AWS Lambda function when an Amazon Athena workgroup provides the same benefits. You have to set the control limit in the Athena workgroup to achieve this requirement. References: https://docs.aws.amazon.com/athena/latest/ug/manage-queries-control-costs-with-workgroups.html https://docs.aws.amazon.com/athena/latest/ug/control-limits.html https://docs.aws.amazon.com/athena/latest/ug/workgroups-setting-control-limits-cloudwatch.html
Unattempted
Amazon Athena allows you to set two types of cost controls: per-query limit per-workgroup limit (a.k.a workgroup-wide data usage control limit) Workgroups allow you to set data usage control limits per query or per workgroup, set up alarms when those limits are exceeded, and publish query metrics to CloudWatch. The per-query control limit specifies the total amount of data scanned per query. If any query that runs in the workgroup exceeds the limit, it is canceled. Use workgroups to separate users, teams, applications, or workloads, to set limits on the amount of data that each query or the entire workgroup can process, and to track costs. Because workgroups act as resources, you can use resource-level identity-based policies to control access to a specific workgroup. You can also view query-related metrics in Amazon CloudWatch, control costs by configuring limits on the amount of data scanned, and create thresholds and trigger actions, such as Amazon SNS, when these thresholds are breached. Thus, the correct answer is: For each workgroup used, configure the workgroup data usage control limit to the prescribed threshold and send a notification via an SNS topic. The option that says: Modify the Amazon S3 bucket policy to send an e-mail whenever the amount of data scanned breaches the threshold is incorrect because you cannot set those thresholds in Athena using S3 bucket policies. The option that says: Modify the primary workgroup and set the per query data usage control limit to the prescribed threshold is incorrect because the per-query control limit automatically cancels queries, which was not mandated in the scenario. This option doesnt provide a way to send an SNS notification for the breached threshold. The option that says: Write an AWS Lambda function that triggers an Amazon SNS topic to send an e-mail to the manager whenever the prescribed threshold is breached. Configure Athena to invoke the Lambda function is incorrect because it is unnecessary to write the AWS Lambda function when an Amazon Athena workgroup provides the same benefits. You have to set the control limit in the Athena workgroup to achieve this requirement. References: https://docs.aws.amazon.com/athena/latest/ug/manage-queries-control-costs-with-workgroups.html https://docs.aws.amazon.com/athena/latest/ug/control-limits.html https://docs.aws.amazon.com/athena/latest/ug/workgroups-setting-control-limits-cloudwatch.html
Question 30 of 65
30. Question
A research group is building a prototype for a large-scale soil health monitoring system. The project aims to help local farmers collect useful metrics in tracking soil quality to forecast irrigation needs.
The group plans to build an Amazon S3 data lake to store metrics in .csv format and query data through Amazon Athena. As the data lake grows, the group wants to improve query performance by optimizing their storage solution.
Which optimization methods could be done to satisfy these requirements? (Select THREE.)
Correct
Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open-source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats. By compressing, partitioning, and using columnar formats you can improve performance and reduce your costs.
Parquet and ORC file formats both support predicate pushdown (also called predicate filtering). Parquet and ORC both have blocks of data that represent column values. Each block holds statistics for the block, such as max/min values. When a query is being executed, these statistics determine whether the block should be read or skipped.
To further reduce costs, you can create an S3 bucket in the same region where you run Athena. Amazon Athena charges an additional fee for inter-region data transfer.
Hence, the correct answers are:
Reduce the data transfer I/O by compressing the S3 objects.
Convert the .csv data into an Apache Parquet format to reduce I/O by returning only the relevant data blocks required by predicate pushdowns.
Build the data lake in the same Region as Amazon Athena.
The option that says: Randomize prefix naming for the keys in the Amazon S3 data lake to increase throughput across partitions is incorrect because you no longer have to randomize prefix naming to gain a performance boost.
The option that says: Build the data lake in the same account as Amazon Athena is incorrect. This is technically inaccurate as you could still create a data lake in any region in an AWS account. To optimize data transfer, you must build the data lake in the AWS region where you run Amazon Athena.
The option that says: Convert the .csv data into a JSON format to reduce I/O by returning only the relevant document keys required by the query is incorrect because JSON is not a columnar format. Therefore, you wont get any performance benefits from using it.
References: https://aws.amazon.com/athena/faqs/ https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and
Incorrect
Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open-source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats. By compressing, partitioning, and using columnar formats you can improve performance and reduce your costs.
Parquet and ORC file formats both support predicate pushdown (also called predicate filtering). Parquet and ORC both have blocks of data that represent column values. Each block holds statistics for the block, such as max/min values. When a query is being executed, these statistics determine whether the block should be read or skipped.
To further reduce costs, you can create an S3 bucket in the same region where you run Athena. Amazon Athena charges an additional fee for inter-region data transfer.
Hence, the correct answers are:
Reduce the data transfer I/O by compressing the S3 objects.
Convert the .csv data into an Apache Parquet format to reduce I/O by returning only the relevant data blocks required by predicate pushdowns.
Build the data lake in the same Region as Amazon Athena.
The option that says: Randomize prefix naming for the keys in the Amazon S3 data lake to increase throughput across partitions is incorrect because you no longer have to randomize prefix naming to gain a performance boost.
The option that says: Build the data lake in the same account as Amazon Athena is incorrect. This is technically inaccurate as you could still create a data lake in any region in an AWS account. To optimize data transfer, you must build the data lake in the AWS region where you run Amazon Athena.
The option that says: Convert the .csv data into a JSON format to reduce I/O by returning only the relevant document keys required by the query is incorrect because JSON is not a columnar format. Therefore, you wont get any performance benefits from using it.
References: https://aws.amazon.com/athena/faqs/ https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and
Unattempted
Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open-source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats. By compressing, partitioning, and using columnar formats you can improve performance and reduce your costs.
Parquet and ORC file formats both support predicate pushdown (also called predicate filtering). Parquet and ORC both have blocks of data that represent column values. Each block holds statistics for the block, such as max/min values. When a query is being executed, these statistics determine whether the block should be read or skipped.
To further reduce costs, you can create an S3 bucket in the same region where you run Athena. Amazon Athena charges an additional fee for inter-region data transfer.
Hence, the correct answers are:
Reduce the data transfer I/O by compressing the S3 objects.
Convert the .csv data into an Apache Parquet format to reduce I/O by returning only the relevant data blocks required by predicate pushdowns.
Build the data lake in the same Region as Amazon Athena.
The option that says: Randomize prefix naming for the keys in the Amazon S3 data lake to increase throughput across partitions is incorrect because you no longer have to randomize prefix naming to gain a performance boost.
The option that says: Build the data lake in the same account as Amazon Athena is incorrect. This is technically inaccurate as you could still create a data lake in any region in an AWS account. To optimize data transfer, you must build the data lake in the AWS region where you run Amazon Athena.
The option that says: Convert the .csv data into a JSON format to reduce I/O by returning only the relevant document keys required by the query is incorrect because JSON is not a columnar format. Therefore, you wont get any performance benefits from using it.
References: https://aws.amazon.com/athena/faqs/ https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and
Question 31 of 65
31. Question
A startup is using a data catalog to store, annotate, and share metadata in the cloud. The metadata tables in the catalog determine the structure of data in various data stores including its other attributes. The data stores being used are Amazon S3, Amazon RDS, Amazon Redshift, and Amazon DynamoDB. A Data Analyst needs to create a solution that will populate the data catalog on a scheduled basis.
Which of the following can be done to achieve this requirement with the least amount of effort?
Correct
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so theres no infrastructure to set up or manage. Use the AWS Glue console to discover your data, transform it, and make it available for search and querying.
In this scenario, a startup is using AWS Glue Data Catalog to store, annotate, and share metadata in the cloud. To populate metadata into the AWS Glue Data Catalog, you can use AWS Glue crawlers. The crawlers will scan different data stores and automatically infer schemas and partition structure to populate the catalog with corresponding table definitions and statistics.
Hence, the correct answer is: Set up an AWS Glue crawler schedule to populate the data catalog.
The option that says: Create a DynamoDB table and use a Lambda function to process the records in the DynamoDB stream is incorrect because DynamoDB is a data store and not a data catalog. Therefore, this option is incorrect.
The option that says: Create a data catalog using Amazon RDS and schedule the AWS Glue crawler to update the tables in the catalog is incorrect because you cant create a data catalog in Amazon RDS. This is only applicable in AWS Glue.
The option that says: Set up an Apache Hive metastore in Amazon EMR and configure the AWS Glue crawler to connect to the data store is incorrect. Although you can integrate AWS Glue Data Catalog as the metastore for Apache Hive, this option entails a lot of effort to implement since you have to do several configurations and maintenance work for the Amazon EMR cluster. A better solution is to create an AWS Glue crawler schedule to populate the data catalog on a regular basis.
References: https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html https://aws.amazon.com/glue/faqs/
Incorrect
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so theres no infrastructure to set up or manage. Use the AWS Glue console to discover your data, transform it, and make it available for search and querying.
In this scenario, a startup is using AWS Glue Data Catalog to store, annotate, and share metadata in the cloud. To populate metadata into the AWS Glue Data Catalog, you can use AWS Glue crawlers. The crawlers will scan different data stores and automatically infer schemas and partition structure to populate the catalog with corresponding table definitions and statistics.
Hence, the correct answer is: Set up an AWS Glue crawler schedule to populate the data catalog.
The option that says: Create a DynamoDB table and use a Lambda function to process the records in the DynamoDB stream is incorrect because DynamoDB is a data store and not a data catalog. Therefore, this option is incorrect.
The option that says: Create a data catalog using Amazon RDS and schedule the AWS Glue crawler to update the tables in the catalog is incorrect because you cant create a data catalog in Amazon RDS. This is only applicable in AWS Glue.
The option that says: Set up an Apache Hive metastore in Amazon EMR and configure the AWS Glue crawler to connect to the data store is incorrect. Although you can integrate AWS Glue Data Catalog as the metastore for Apache Hive, this option entails a lot of effort to implement since you have to do several configurations and maintenance work for the Amazon EMR cluster. A better solution is to create an AWS Glue crawler schedule to populate the data catalog on a regular basis.
References: https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html https://aws.amazon.com/glue/faqs/
Unattempted
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so theres no infrastructure to set up or manage. Use the AWS Glue console to discover your data, transform it, and make it available for search and querying.
In this scenario, a startup is using AWS Glue Data Catalog to store, annotate, and share metadata in the cloud. To populate metadata into the AWS Glue Data Catalog, you can use AWS Glue crawlers. The crawlers will scan different data stores and automatically infer schemas and partition structure to populate the catalog with corresponding table definitions and statistics.
Hence, the correct answer is: Set up an AWS Glue crawler schedule to populate the data catalog.
The option that says: Create a DynamoDB table and use a Lambda function to process the records in the DynamoDB stream is incorrect because DynamoDB is a data store and not a data catalog. Therefore, this option is incorrect.
The option that says: Create a data catalog using Amazon RDS and schedule the AWS Glue crawler to update the tables in the catalog is incorrect because you cant create a data catalog in Amazon RDS. This is only applicable in AWS Glue.
The option that says: Set up an Apache Hive metastore in Amazon EMR and configure the AWS Glue crawler to connect to the data store is incorrect. Although you can integrate AWS Glue Data Catalog as the metastore for Apache Hive, this option entails a lot of effort to implement since you have to do several configurations and maintenance work for the Amazon EMR cluster. A better solution is to create an AWS Glue crawler schedule to populate the data catalog on a regular basis.
References: https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html https://aws.amazon.com/glue/faqs/
Question 32 of 65
32. Question
A company is working on an analytics platform that ingests a vast amount of CSV-formatted data from multiple sources and stores them in an Amazon S3 Standard bucket. The S3 bucket is expected to store around 25 GB of raw data every day. The company runs Amazon Athena aggregate functions to gain a summarized view of data that dates back 6 months ago, after which the data becomes infrequently accessed. The companys policy requires raw data to be archived 2 years after creation. An average query scans around 200 MB of data with less than a minute response time. A data engineer is required to optimize the cost of running the platform.
Which set of steps should the data engineer do?
Correct
In Athena, tables and databases are containers for the metadata definitions that define a schema for underlying source data. For each dataset, a table needs to exist in Athena. The metadata in the table tells Athena where the data is located in Amazon S3, and specifies the structure of the data, for example, column names, data types, and the name of the table. Databases are a logical grouping of tables, and also hold only metadata and schema information for a dataset.
Apache Parquet and Apache ORC are popular columnar data stores. They provide features that store data efficiently by employing column-wise compression, different encoding, compression based on data type, and predicate pushdown. They are also splittable. Generally, better compression ratios or skipping blocks of data means reading fewer bytes from Amazon S3, leading to better query performance.
Above is a diagram that explains the architecture of the correct answer. Here, we can create an AWS Glue ETL job to transform the raw CSV data into a columnar format. To manage your objects so that they are stored cost-effectively throughout their lifecycle, configure their Amazon S3 Lifecycle. An S3 Lifecycle configuration is a set of rules that define actions that Amazon S3 applies to a group of objects. In the scenario, we can create 2 lifecycle rules: 1 for archiving raw data to Glacier 2 years after object creation, and 1 for moving processed data to the S3-IA tier after 6 months.
Athena can only query the latest version of data on a versioned Amazon S3 bucket and cannot query previous versions of the data. You must have the appropriate permissions to work with data in the Amazon S3 location. Athena does not support querying the data in the GLACIER storage class. It ignores objects transitioned to the GLACIER storage class based on an Amazon S3 lifecycle policy.
Hence, the correct answer is the option that says: Compress, partition, and transform the raw data into a columnar data format using an AWS Glue ETL job. Then, query the processed data using Amazon Athena. Create a lifecycle policy that will transfer the processed data in the S3 Standard-IA storage class 6 months after object creation. Create another lifecycle policy that will archive the raw data into Amazon S3 Glacier after 2 years.
The option that says: Compress, partition, and transform the raw data into a columnar data format using an AWS Glue ETL job. Then, query the processed data using Amazon Athena. Create a lifecycle policy that will transfer the processed data in the S3 Standard-IA storage class 6 months after object creation. Create another lifecycle policy that will archive the raw data into Amazon S3 Glacier based on the last date that the object was accessed is incorrect. The Amazon S3 lifecycle policy is based on an objects age and not on the date it was last accessed.
The option that says: Compress, partition, and transform the raw data into a row-based data format using an AWS Glue ETL job. Then, query the processed data using Amazon Athena. Create a lifecycle policy that will transfer the processed data in the S3 Glacier storage class 6 months after object creation. Enable expedited retrieval. Create another lifecycle policy that will archive the raw data into Amazon S3 Glacier Deep Archive after 2 years is incorrect. Amazon Athena does not support querying the data in the GLACIER storage class. Athena ignores objects transitioned to the GLACIER storage class based on Amazon S3 lifecycle policy. Using a row-based data format is not suitable for aggregating large amounts of data.
The option that says: Compress, partition, and transform the raw data into a row-based data format using an AWS Glue ETL job. Then, query the processed data using Amazon Athena. Create a lifecycle policy that will transfer the processed data in the S3 Standard-IA storage class 6 months after object creation. Create another lifecycle policy that will archive the raw data into Amazon S3 Glacier based on the last date that the object was accessed is incorrect. Using a row-based data format performs poorly than a columnar data format and it is difficult to aggregate data from different sources by row. The columnar data format is faster at aggregating large volumes of data.
References: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html https://aws.amazon.com/s3/storage-classes/
Incorrect
In Athena, tables and databases are containers for the metadata definitions that define a schema for underlying source data. For each dataset, a table needs to exist in Athena. The metadata in the table tells Athena where the data is located in Amazon S3, and specifies the structure of the data, for example, column names, data types, and the name of the table. Databases are a logical grouping of tables, and also hold only metadata and schema information for a dataset.
Apache Parquet and Apache ORC are popular columnar data stores. They provide features that store data efficiently by employing column-wise compression, different encoding, compression based on data type, and predicate pushdown. They are also splittable. Generally, better compression ratios or skipping blocks of data means reading fewer bytes from Amazon S3, leading to better query performance.
Above is a diagram that explains the architecture of the correct answer. Here, we can create an AWS Glue ETL job to transform the raw CSV data into a columnar format. To manage your objects so that they are stored cost-effectively throughout their lifecycle, configure their Amazon S3 Lifecycle. An S3 Lifecycle configuration is a set of rules that define actions that Amazon S3 applies to a group of objects. In the scenario, we can create 2 lifecycle rules: 1 for archiving raw data to Glacier 2 years after object creation, and 1 for moving processed data to the S3-IA tier after 6 months.
Athena can only query the latest version of data on a versioned Amazon S3 bucket and cannot query previous versions of the data. You must have the appropriate permissions to work with data in the Amazon S3 location. Athena does not support querying the data in the GLACIER storage class. It ignores objects transitioned to the GLACIER storage class based on an Amazon S3 lifecycle policy.
Hence, the correct answer is the option that says: Compress, partition, and transform the raw data into a columnar data format using an AWS Glue ETL job. Then, query the processed data using Amazon Athena. Create a lifecycle policy that will transfer the processed data in the S3 Standard-IA storage class 6 months after object creation. Create another lifecycle policy that will archive the raw data into Amazon S3 Glacier after 2 years.
The option that says: Compress, partition, and transform the raw data into a columnar data format using an AWS Glue ETL job. Then, query the processed data using Amazon Athena. Create a lifecycle policy that will transfer the processed data in the S3 Standard-IA storage class 6 months after object creation. Create another lifecycle policy that will archive the raw data into Amazon S3 Glacier based on the last date that the object was accessed is incorrect. The Amazon S3 lifecycle policy is based on an objects age and not on the date it was last accessed.
The option that says: Compress, partition, and transform the raw data into a row-based data format using an AWS Glue ETL job. Then, query the processed data using Amazon Athena. Create a lifecycle policy that will transfer the processed data in the S3 Glacier storage class 6 months after object creation. Enable expedited retrieval. Create another lifecycle policy that will archive the raw data into Amazon S3 Glacier Deep Archive after 2 years is incorrect. Amazon Athena does not support querying the data in the GLACIER storage class. Athena ignores objects transitioned to the GLACIER storage class based on Amazon S3 lifecycle policy. Using a row-based data format is not suitable for aggregating large amounts of data.
The option that says: Compress, partition, and transform the raw data into a row-based data format using an AWS Glue ETL job. Then, query the processed data using Amazon Athena. Create a lifecycle policy that will transfer the processed data in the S3 Standard-IA storage class 6 months after object creation. Create another lifecycle policy that will archive the raw data into Amazon S3 Glacier based on the last date that the object was accessed is incorrect. Using a row-based data format performs poorly than a columnar data format and it is difficult to aggregate data from different sources by row. The columnar data format is faster at aggregating large volumes of data.
References: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html https://aws.amazon.com/s3/storage-classes/
Unattempted
In Athena, tables and databases are containers for the metadata definitions that define a schema for underlying source data. For each dataset, a table needs to exist in Athena. The metadata in the table tells Athena where the data is located in Amazon S3, and specifies the structure of the data, for example, column names, data types, and the name of the table. Databases are a logical grouping of tables, and also hold only metadata and schema information for a dataset.
Apache Parquet and Apache ORC are popular columnar data stores. They provide features that store data efficiently by employing column-wise compression, different encoding, compression based on data type, and predicate pushdown. They are also splittable. Generally, better compression ratios or skipping blocks of data means reading fewer bytes from Amazon S3, leading to better query performance.
Above is a diagram that explains the architecture of the correct answer. Here, we can create an AWS Glue ETL job to transform the raw CSV data into a columnar format. To manage your objects so that they are stored cost-effectively throughout their lifecycle, configure their Amazon S3 Lifecycle. An S3 Lifecycle configuration is a set of rules that define actions that Amazon S3 applies to a group of objects. In the scenario, we can create 2 lifecycle rules: 1 for archiving raw data to Glacier 2 years after object creation, and 1 for moving processed data to the S3-IA tier after 6 months.
Athena can only query the latest version of data on a versioned Amazon S3 bucket and cannot query previous versions of the data. You must have the appropriate permissions to work with data in the Amazon S3 location. Athena does not support querying the data in the GLACIER storage class. It ignores objects transitioned to the GLACIER storage class based on an Amazon S3 lifecycle policy.
Hence, the correct answer is the option that says: Compress, partition, and transform the raw data into a columnar data format using an AWS Glue ETL job. Then, query the processed data using Amazon Athena. Create a lifecycle policy that will transfer the processed data in the S3 Standard-IA storage class 6 months after object creation. Create another lifecycle policy that will archive the raw data into Amazon S3 Glacier after 2 years.
The option that says: Compress, partition, and transform the raw data into a columnar data format using an AWS Glue ETL job. Then, query the processed data using Amazon Athena. Create a lifecycle policy that will transfer the processed data in the S3 Standard-IA storage class 6 months after object creation. Create another lifecycle policy that will archive the raw data into Amazon S3 Glacier based on the last date that the object was accessed is incorrect. The Amazon S3 lifecycle policy is based on an objects age and not on the date it was last accessed.
The option that says: Compress, partition, and transform the raw data into a row-based data format using an AWS Glue ETL job. Then, query the processed data using Amazon Athena. Create a lifecycle policy that will transfer the processed data in the S3 Glacier storage class 6 months after object creation. Enable expedited retrieval. Create another lifecycle policy that will archive the raw data into Amazon S3 Glacier Deep Archive after 2 years is incorrect. Amazon Athena does not support querying the data in the GLACIER storage class. Athena ignores objects transitioned to the GLACIER storage class based on Amazon S3 lifecycle policy. Using a row-based data format is not suitable for aggregating large amounts of data.
The option that says: Compress, partition, and transform the raw data into a row-based data format using an AWS Glue ETL job. Then, query the processed data using Amazon Athena. Create a lifecycle policy that will transfer the processed data in the S3 Standard-IA storage class 6 months after object creation. Create another lifecycle policy that will archive the raw data into Amazon S3 Glacier based on the last date that the object was accessed is incorrect. Using a row-based data format performs poorly than a columnar data format and it is difficult to aggregate data from different sources by row. The columnar data format is faster at aggregating large volumes of data.
References: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html https://aws.amazon.com/s3/storage-classes/
Question 33 of 65
33. Question
A pharmaceutical company uses Amazon Redshift as a data warehousing solution. The results of simulations for new substances are stored on Amazon S3 and queries are run on the Redshift cluster to analyze these results. With an increasing number of scientists on the team, the compute nodes of the Redshift cluster cant keep up with all the requests it receives. The Data Analyst wants to solve this problem by resizing the cluster to meet the changing demands of the scientists.
Which of the following is the FASTEST way to accomplish this?
Correct
As your data warehousing capacity and performance needs change or grow, you can resize your cluster to make the best use of the computing and storage options that Amazon Redshift provides. You can use elastic resize to scale your cluster by changing the node type and number of nodes. Or, if your new node configuration is not available through elastic resize, you can use classic resize.
To resize your cluster, use one of the following approaches:
Elastic resize To quickly add or remove nodes from an existing cluster, use elastic resize. You can use it to change the node type, number of nodes, or both. If you only change the number of nodes then queries are temporarily paused and connections are held open if possible. During the resize operation, the cluster is read-only. Typically, elastic resize takes 1015 minutes.
Classic resize Use classic resize to change the node type, number of nodes, or both. Choose this option when you are resizing to a configuration that isnt available through elastic resize. A classic resize copies tables to a new cluster. The source cluster will be in read-only mode until the resize operation finishes. An example is to or from a single-node cluster. During the resize operation, the cluster is read-only. Typically, classic resize takes 2 hours2 days or longer, depending on your datas size.
Snapshot and restore with classic resize To keep your cluster available during a classic resize, you can first make a copy of an existing cluster, then resize the new cluster. Keep in mind that all data written to the source cluster after the snapshot is taken must be manually copied to the target cluster after the migration.
You can resize (both elastic resize and classic resize) your cluster on a schedule.
Elastic resize is the fastest method to resize a cluster. You can use elastic resize to add or remove nodes and change node types for an existing cluster. When a cluster is resized using elastic resize with the same node type, it automatically redistributes the data to the new nodes. Because it doesnt create a new cluster in this scenario, the elastic resize operation completes quickly, usually in a few minutes. You might notice a slight increase in execution time for some queries while the data is redistributed in the background.
Therefore the correct answer is: Use Elastic Resize to manually add or remove nodes on the cluster to handle the load.
The option that says: Use Classic Resize to manually add or remove nodes on the cluster to handle the load is incorrect because this type of operation takes several hours to complete.
The option that says: Use Auto Resize and have AWS automatically add or remove nodes on the cluster depending on the load is incorrect. You can schedule a cluster to resize at specified times, but there is no native Auto Resize feature on Redshift clusters.
The option that says: Since Amazon Redshift does not support online resize, create a snapshot of the current cluster, then restore the snapshot to a new cluster specifying a new size is incorrect because Redshift supports online resize. This approach takes several hours to complete.
References: https://aws.amazon.com/premiumsupport/knowledge-center/resize-redshift-cluster/ https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-operations.html#elastic-resize
Incorrect
As your data warehousing capacity and performance needs change or grow, you can resize your cluster to make the best use of the computing and storage options that Amazon Redshift provides. You can use elastic resize to scale your cluster by changing the node type and number of nodes. Or, if your new node configuration is not available through elastic resize, you can use classic resize.
To resize your cluster, use one of the following approaches:
Elastic resize To quickly add or remove nodes from an existing cluster, use elastic resize. You can use it to change the node type, number of nodes, or both. If you only change the number of nodes then queries are temporarily paused and connections are held open if possible. During the resize operation, the cluster is read-only. Typically, elastic resize takes 1015 minutes.
Classic resize Use classic resize to change the node type, number of nodes, or both. Choose this option when you are resizing to a configuration that isnt available through elastic resize. A classic resize copies tables to a new cluster. The source cluster will be in read-only mode until the resize operation finishes. An example is to or from a single-node cluster. During the resize operation, the cluster is read-only. Typically, classic resize takes 2 hours2 days or longer, depending on your datas size.
Snapshot and restore with classic resize To keep your cluster available during a classic resize, you can first make a copy of an existing cluster, then resize the new cluster. Keep in mind that all data written to the source cluster after the snapshot is taken must be manually copied to the target cluster after the migration.
You can resize (both elastic resize and classic resize) your cluster on a schedule.
Elastic resize is the fastest method to resize a cluster. You can use elastic resize to add or remove nodes and change node types for an existing cluster. When a cluster is resized using elastic resize with the same node type, it automatically redistributes the data to the new nodes. Because it doesnt create a new cluster in this scenario, the elastic resize operation completes quickly, usually in a few minutes. You might notice a slight increase in execution time for some queries while the data is redistributed in the background.
Therefore the correct answer is: Use Elastic Resize to manually add or remove nodes on the cluster to handle the load.
The option that says: Use Classic Resize to manually add or remove nodes on the cluster to handle the load is incorrect because this type of operation takes several hours to complete.
The option that says: Use Auto Resize and have AWS automatically add or remove nodes on the cluster depending on the load is incorrect. You can schedule a cluster to resize at specified times, but there is no native Auto Resize feature on Redshift clusters.
The option that says: Since Amazon Redshift does not support online resize, create a snapshot of the current cluster, then restore the snapshot to a new cluster specifying a new size is incorrect because Redshift supports online resize. This approach takes several hours to complete.
References: https://aws.amazon.com/premiumsupport/knowledge-center/resize-redshift-cluster/ https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-operations.html#elastic-resize
Unattempted
As your data warehousing capacity and performance needs change or grow, you can resize your cluster to make the best use of the computing and storage options that Amazon Redshift provides. You can use elastic resize to scale your cluster by changing the node type and number of nodes. Or, if your new node configuration is not available through elastic resize, you can use classic resize.
To resize your cluster, use one of the following approaches:
Elastic resize To quickly add or remove nodes from an existing cluster, use elastic resize. You can use it to change the node type, number of nodes, or both. If you only change the number of nodes then queries are temporarily paused and connections are held open if possible. During the resize operation, the cluster is read-only. Typically, elastic resize takes 1015 minutes.
Classic resize Use classic resize to change the node type, number of nodes, or both. Choose this option when you are resizing to a configuration that isnt available through elastic resize. A classic resize copies tables to a new cluster. The source cluster will be in read-only mode until the resize operation finishes. An example is to or from a single-node cluster. During the resize operation, the cluster is read-only. Typically, classic resize takes 2 hours2 days or longer, depending on your datas size.
Snapshot and restore with classic resize To keep your cluster available during a classic resize, you can first make a copy of an existing cluster, then resize the new cluster. Keep in mind that all data written to the source cluster after the snapshot is taken must be manually copied to the target cluster after the migration.
You can resize (both elastic resize and classic resize) your cluster on a schedule.
Elastic resize is the fastest method to resize a cluster. You can use elastic resize to add or remove nodes and change node types for an existing cluster. When a cluster is resized using elastic resize with the same node type, it automatically redistributes the data to the new nodes. Because it doesnt create a new cluster in this scenario, the elastic resize operation completes quickly, usually in a few minutes. You might notice a slight increase in execution time for some queries while the data is redistributed in the background.
Therefore the correct answer is: Use Elastic Resize to manually add or remove nodes on the cluster to handle the load.
The option that says: Use Classic Resize to manually add or remove nodes on the cluster to handle the load is incorrect because this type of operation takes several hours to complete.
The option that says: Use Auto Resize and have AWS automatically add or remove nodes on the cluster depending on the load is incorrect. You can schedule a cluster to resize at specified times, but there is no native Auto Resize feature on Redshift clusters.
The option that says: Since Amazon Redshift does not support online resize, create a snapshot of the current cluster, then restore the snapshot to a new cluster specifying a new size is incorrect because Redshift supports online resize. This approach takes several hours to complete.
References: https://aws.amazon.com/premiumsupport/knowledge-center/resize-redshift-cluster/ https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-operations.html#elastic-resize
Question 34 of 65
34. Question
A company runs a Reserved Amazon EC2 instance to process ETL jobs before sending the results into an Amazon Redshift cluster. Because of scaling issues, the company eventually replaced the EC2 instance with AWS Glue and Amazon S3. Since the architecture has changed, the Data Analyst must also make necessary changes in the workflow. Part of the new process is to save the Redshift query results to an external storage for occasional analysis.
Which of the following methods is the most cost-efficient solution for the new process?
Correct
Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.
The UNLOAD command unloads the result of an Amazon Redshift query to your Amazon S3 data lake in Apache Parquet, an efficient open columnar storage format for analytics. Parquet format is up to 2x faster to unload and consumes up to 6x less storage in Amazon S3, compared with text formats. This enables you to save the data transformation and enrichment you have done in Amazon S3 into your Amazon S3 data lake in an open format. You can then analyze your data with Redshift Spectrum and other AWS services such as Amazon Athena, Amazon EMR, and SageMaker.
Hence, the correct answer is: Save the Redshift query results to an Amazon S3 bucket using the UNLOAD command.
The option that says: Save the Redshift query results to an external table in Amazon Redshift Spectrum using the COPY command is incorrect because you cant COPY to an Amazon Redshift Spectrum external table. Amazon Redshift Spectrum external tables are read-only.
The option that says: Save the Redshift query results to an Amazon S3 bucket using the COPY command is incorrect because the COPY command is used to load data from a data source to a table and not the other way around.
The option that says: Save the Redshift query results to an external table in Amazon Redshift Spectrum using the UNLOAD command is incorrect because the UNLOAD command does not support Redshift Spectrum.
References: https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
Incorrect
Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.
The UNLOAD command unloads the result of an Amazon Redshift query to your Amazon S3 data lake in Apache Parquet, an efficient open columnar storage format for analytics. Parquet format is up to 2x faster to unload and consumes up to 6x less storage in Amazon S3, compared with text formats. This enables you to save the data transformation and enrichment you have done in Amazon S3 into your Amazon S3 data lake in an open format. You can then analyze your data with Redshift Spectrum and other AWS services such as Amazon Athena, Amazon EMR, and SageMaker.
Hence, the correct answer is: Save the Redshift query results to an Amazon S3 bucket using the UNLOAD command.
The option that says: Save the Redshift query results to an external table in Amazon Redshift Spectrum using the COPY command is incorrect because you cant COPY to an Amazon Redshift Spectrum external table. Amazon Redshift Spectrum external tables are read-only.
The option that says: Save the Redshift query results to an Amazon S3 bucket using the COPY command is incorrect because the COPY command is used to load data from a data source to a table and not the other way around.
The option that says: Save the Redshift query results to an external table in Amazon Redshift Spectrum using the UNLOAD command is incorrect because the UNLOAD command does not support Redshift Spectrum.
References: https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
Unattempted
Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.
The UNLOAD command unloads the result of an Amazon Redshift query to your Amazon S3 data lake in Apache Parquet, an efficient open columnar storage format for analytics. Parquet format is up to 2x faster to unload and consumes up to 6x less storage in Amazon S3, compared with text formats. This enables you to save the data transformation and enrichment you have done in Amazon S3 into your Amazon S3 data lake in an open format. You can then analyze your data with Redshift Spectrum and other AWS services such as Amazon Athena, Amazon EMR, and SageMaker.
Hence, the correct answer is: Save the Redshift query results to an Amazon S3 bucket using the UNLOAD command.
The option that says: Save the Redshift query results to an external table in Amazon Redshift Spectrum using the COPY command is incorrect because you cant COPY to an Amazon Redshift Spectrum external table. Amazon Redshift Spectrum external tables are read-only.
The option that says: Save the Redshift query results to an Amazon S3 bucket using the COPY command is incorrect because the COPY command is used to load data from a data source to a table and not the other way around.
The option that says: Save the Redshift query results to an external table in Amazon Redshift Spectrum using the UNLOAD command is incorrect because the UNLOAD command does not support Redshift Spectrum.
References: https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
Question 35 of 65
35. Question
A Data Engineer working for an advertising agency has to perform an advertisement split testing (A/B testing) for a customer. She needs to collate data based on user feedback and social media reactions. The collected data will be processed and analyzed to identify which ad is more effective. For future analysis, the Data Engineer must catalog the data on a data storage as key-value pairs that require immediate access. She should also have the ability to read, write, and manage petabytes of data using a SQL-like interface. A solution with low operational overhead is preferred.
Which method meets these requirements?
Correct
Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. With EMR you can run Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. For short-running jobs, you can spin up and spin down clusters and pay per second for the instances used. For long-running workloads, you can create highly available clusters that automatically scale to meet demand.
Apache Hive is an open-source, distributed, fault-tolerant system that provides data warehouse-like query capabilities. It enables users to read, write, and manage petabytes of data using a SQL-like interface.
Apache Hadoop and NoSQL databases are complementary technologies that together provide a powerful toolbox for managing, analyzing, and monetizing Big Data. AWS offers out-of-the-box Amazon Elastic MapReduce (Amazon EMR) integration with Amazon DynamoDB, providing customers an integrated solution that eliminates the often prohibitive costs of administration, maintenance, and upfront hardware.
Customers can now move vast amounts of data into and out of DynamoDB, as well as perform sophisticated analytics on that data, using EMRs highly parallelized environment to distribute the work across the number of servers of their choice.
Hence, the correct answer is: Analyze data with Apache Hive on Amazon EMR. Save the data to an Amazon DynamoDB table.
The option that says: Automate the data transformation with AWS Data Pipeline and use Amazon Kinesis Data Analytics for analysis. Save the data to an Amazon EBS Cold HDD (sc1) volume is incorrect because this solution requires more overhead. An Amazon EBS volume has to be mounted on an EC2 instance, which you need to manage and monitor regularly. In addition, Cold HDD volumes are not suitable for storing frequently accessed data.
The option that says: Automate the data transformation with AWS Data Pipeline and use Amazon Redshift Spectrum for analysis. Save the data to Amazon S3 Glacier is incorrect because S3 Glacier does not support immediate access.
The option that says: Stream and process data with Amazon Kinesis Data Firehose. Save the data to Amazon S3 Standard-IA and use Amazon Athena for analysis is incorrect. Although this could be possible, Amazon DynamoDB is more suitable for storing key-value pairs.
References: https://aws.amazon.com/emr/faqs/ https://aws.amazon.com/blogs/aws/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb/ https://aws.amazon.com/emr/faqs/
Incorrect
Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. With EMR you can run Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. For short-running jobs, you can spin up and spin down clusters and pay per second for the instances used. For long-running workloads, you can create highly available clusters that automatically scale to meet demand.
Apache Hive is an open-source, distributed, fault-tolerant system that provides data warehouse-like query capabilities. It enables users to read, write, and manage petabytes of data using a SQL-like interface.
Apache Hadoop and NoSQL databases are complementary technologies that together provide a powerful toolbox for managing, analyzing, and monetizing Big Data. AWS offers out-of-the-box Amazon Elastic MapReduce (Amazon EMR) integration with Amazon DynamoDB, providing customers an integrated solution that eliminates the often prohibitive costs of administration, maintenance, and upfront hardware.
Customers can now move vast amounts of data into and out of DynamoDB, as well as perform sophisticated analytics on that data, using EMRs highly parallelized environment to distribute the work across the number of servers of their choice.
Hence, the correct answer is: Analyze data with Apache Hive on Amazon EMR. Save the data to an Amazon DynamoDB table.
The option that says: Automate the data transformation with AWS Data Pipeline and use Amazon Kinesis Data Analytics for analysis. Save the data to an Amazon EBS Cold HDD (sc1) volume is incorrect because this solution requires more overhead. An Amazon EBS volume has to be mounted on an EC2 instance, which you need to manage and monitor regularly. In addition, Cold HDD volumes are not suitable for storing frequently accessed data.
The option that says: Automate the data transformation with AWS Data Pipeline and use Amazon Redshift Spectrum for analysis. Save the data to Amazon S3 Glacier is incorrect because S3 Glacier does not support immediate access.
The option that says: Stream and process data with Amazon Kinesis Data Firehose. Save the data to Amazon S3 Standard-IA and use Amazon Athena for analysis is incorrect. Although this could be possible, Amazon DynamoDB is more suitable for storing key-value pairs.
References: https://aws.amazon.com/emr/faqs/ https://aws.amazon.com/blogs/aws/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb/ https://aws.amazon.com/emr/faqs/
Unattempted
Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. With EMR you can run Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. For short-running jobs, you can spin up and spin down clusters and pay per second for the instances used. For long-running workloads, you can create highly available clusters that automatically scale to meet demand.
Apache Hive is an open-source, distributed, fault-tolerant system that provides data warehouse-like query capabilities. It enables users to read, write, and manage petabytes of data using a SQL-like interface.
Apache Hadoop and NoSQL databases are complementary technologies that together provide a powerful toolbox for managing, analyzing, and monetizing Big Data. AWS offers out-of-the-box Amazon Elastic MapReduce (Amazon EMR) integration with Amazon DynamoDB, providing customers an integrated solution that eliminates the often prohibitive costs of administration, maintenance, and upfront hardware.
Customers can now move vast amounts of data into and out of DynamoDB, as well as perform sophisticated analytics on that data, using EMRs highly parallelized environment to distribute the work across the number of servers of their choice.
Hence, the correct answer is: Analyze data with Apache Hive on Amazon EMR. Save the data to an Amazon DynamoDB table.
The option that says: Automate the data transformation with AWS Data Pipeline and use Amazon Kinesis Data Analytics for analysis. Save the data to an Amazon EBS Cold HDD (sc1) volume is incorrect because this solution requires more overhead. An Amazon EBS volume has to be mounted on an EC2 instance, which you need to manage and monitor regularly. In addition, Cold HDD volumes are not suitable for storing frequently accessed data.
The option that says: Automate the data transformation with AWS Data Pipeline and use Amazon Redshift Spectrum for analysis. Save the data to Amazon S3 Glacier is incorrect because S3 Glacier does not support immediate access.
The option that says: Stream and process data with Amazon Kinesis Data Firehose. Save the data to Amazon S3 Standard-IA and use Amazon Athena for analysis is incorrect. Although this could be possible, Amazon DynamoDB is more suitable for storing key-value pairs.
References: https://aws.amazon.com/emr/faqs/ https://aws.amazon.com/blogs/aws/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb/ https://aws.amazon.com/emr/faqs/
Question 36 of 65
36. Question
A company uses an S3 bucket as a repository for millions of raw data generated from an application. The bucket is configured with an S3 Intelligent-Tiering storage class to deliver automatic cost savings. The solution is also required to:
Support Open Database Connectivity (ODBC) connections
Manage metadata that allows federation for access control
Perform ETL batch workloads using PySpark and Scala
As much as possible, the operational management of the solution should be limited.
Which of the following combinations of services can satisfy the requirements? (Select THREE.)
Correct
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. You can create and run an ETL job with a few clicks in the AWS Glue visual editor.
The AWS Glue Data Catalog contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. To create your data warehouse or data lake, you must catalog this data. The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics of your data. You use the information in the Data Catalog to create and monitor your ETL jobs. Information in the Data Catalog is stored as metadata tables, where each table specifies a single datastore
AWS Glue supports an extension of the PySpark Scala dialect for scripting extract, transform, and load (ETL) jobs.
Customers can connect to Amazon Athena using ODBC and JDBC drivers. This allows you to report and visualize all of your data in S3 with the tools of your choice.
Based on the preceding explanation, we can achieve the requirements by using AWS Glue and Amazon Athena since they are both serverless, which eliminates operational management.
Hence, the correct answers are:
Use AWS Glue Data Catalog for managing the metadata.
AWS Glue with PySpark Scala dialect for ETL batch workloads
Use Amazon Athena for querying data in the Amazon S3 bucket using ODBC drivers
The following options are incorrect. Although EMR can meet the rest of the technical requirements, EMR is not a serverless solution and thus, it entails a lot of effort for operational management.
Amazon Elastic MapReduce (EMR) with Apache Spark for ETL batch workloads
Use Amazon Elastic MapReduce (EMR) with Apache Hive for ODBC clients.
Launch a new Amazon Elastic MapReduce (EMR) cluster with Apache Hive that uses an Amazon RDS with MySQL-compatible backed metastore.
References: https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html https://aws.amazon.com/about-aws/whats-new/2017/11/amazon-athena-adds-support-for-querying-data-using-an-odbc-driver/ https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-scala.html
Incorrect
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. You can create and run an ETL job with a few clicks in the AWS Glue visual editor.
The AWS Glue Data Catalog contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. To create your data warehouse or data lake, you must catalog this data. The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics of your data. You use the information in the Data Catalog to create and monitor your ETL jobs. Information in the Data Catalog is stored as metadata tables, where each table specifies a single datastore
AWS Glue supports an extension of the PySpark Scala dialect for scripting extract, transform, and load (ETL) jobs.
Customers can connect to Amazon Athena using ODBC and JDBC drivers. This allows you to report and visualize all of your data in S3 with the tools of your choice.
Based on the preceding explanation, we can achieve the requirements by using AWS Glue and Amazon Athena since they are both serverless, which eliminates operational management.
Hence, the correct answers are:
Use AWS Glue Data Catalog for managing the metadata.
AWS Glue with PySpark Scala dialect for ETL batch workloads
Use Amazon Athena for querying data in the Amazon S3 bucket using ODBC drivers
The following options are incorrect. Although EMR can meet the rest of the technical requirements, EMR is not a serverless solution and thus, it entails a lot of effort for operational management.
Amazon Elastic MapReduce (EMR) with Apache Spark for ETL batch workloads
Use Amazon Elastic MapReduce (EMR) with Apache Hive for ODBC clients.
Launch a new Amazon Elastic MapReduce (EMR) cluster with Apache Hive that uses an Amazon RDS with MySQL-compatible backed metastore.
References: https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html https://aws.amazon.com/about-aws/whats-new/2017/11/amazon-athena-adds-support-for-querying-data-using-an-odbc-driver/ https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-scala.html
Unattempted
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. You can create and run an ETL job with a few clicks in the AWS Glue visual editor.
The AWS Glue Data Catalog contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. To create your data warehouse or data lake, you must catalog this data. The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics of your data. You use the information in the Data Catalog to create and monitor your ETL jobs. Information in the Data Catalog is stored as metadata tables, where each table specifies a single datastore
AWS Glue supports an extension of the PySpark Scala dialect for scripting extract, transform, and load (ETL) jobs.
Customers can connect to Amazon Athena using ODBC and JDBC drivers. This allows you to report and visualize all of your data in S3 with the tools of your choice.
Based on the preceding explanation, we can achieve the requirements by using AWS Glue and Amazon Athena since they are both serverless, which eliminates operational management.
Hence, the correct answers are:
Use AWS Glue Data Catalog for managing the metadata.
AWS Glue with PySpark Scala dialect for ETL batch workloads
Use Amazon Athena for querying data in the Amazon S3 bucket using ODBC drivers
The following options are incorrect. Although EMR can meet the rest of the technical requirements, EMR is not a serverless solution and thus, it entails a lot of effort for operational management.
Amazon Elastic MapReduce (EMR) with Apache Spark for ETL batch workloads
Use Amazon Elastic MapReduce (EMR) with Apache Hive for ODBC clients.
Launch a new Amazon Elastic MapReduce (EMR) cluster with Apache Hive that uses an Amazon RDS with MySQL-compatible backed metastore.
References: https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html https://aws.amazon.com/about-aws/whats-new/2017/11/amazon-athena-adds-support-for-querying-data-using-an-odbc-driver/ https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-scala.html
Question 37 of 65
37. Question
A game retail company stores user purchase data on a MySQL database hosted in Amazon RDS. The company will regularly run queries and analytical workloads against the current 3-month worth of data, which is expected to be several terabytes in size. The older historical data needs to be stored outside the database but is still used for the quarterly trend reports. To generate this report, the historical data are joined with the more recent data.
Which of the following options will provide optimal performance and a cost-effective solution based on the requirements?
Correct
With Redshift Spectrum, Amazon Redshift customers can easily query their data in Amazon S3. Like Amazon EMR, you get the benefits of open data formats and inexpensive storage, and you can scale out to thousands of nodes to pull data, filter, project, aggregate, group, and sort. Like Amazon Athena, Redshift Spectrum is serverless and theres nothing to provision or manage. You just pay for the resources you consume for the duration of your Redshift Spectrum query. Like Amazon Redshift itself, you get the benefits of a sophisticated query optimizer, fast access to data on local disks, and standard SQL. And like nothing else, Redshift Spectrum can execute highly sophisticated queries against an exabyte of data or morein just minutes.
In the scenario, data from the MySQL database will be incrementally loaded to an S3 bucket. Only the current data will be stored in an Amazon Redshift table. Redshift is an excellent choice for frequent analytical queries of data that are terabytes in size. Since Redshift Spectrum can work as an intermediate layer between a Redshift table and an S3 bucket, joining the historical data (from S3) and current data (from Redshift table) will both optimize performance and cost. Redshift Spectrum uses the same resources you used on the Redshift database and it only charges for data queried from Amazon S3.
Therefore, the correct answer is: Schedule the export of your data from Amazon RDS to Amazon S3 every day. Load a years worth of data in Amazon Redshift and use it to run the regular queries. Use an Amazon RedShift Spectrum table to join the historical and current data
The option that says: Use AWS Glue to perform ETL and incrementally load a years worth of data into an Amazon Redshift cluster. Run the regular queries against this cluster. Create an AWS Glue Data Catalog of the data in Amazon S3 and use Amazon Athena to join the historical and current data to generate the reports is incorrect. For this method to work, youd have to copy first the more recent data from the Redshift cluster to Amazon S3 because Amazon Athena only works with data stored in S3. The pricing for per-data-scanned in Redshift Spectrum and Athena is the same. Since there is already a working Redshift cluster, taking advantage of its compute power to run join queries using Redshift Spectrum would be a better option than Athena. Take note that Amazon Athena relies on pooled resources that AWS manages while Redshift Spectrum runs on the dedicated resources allocated to your Redshift cluster, giving you a more consistent and more optimized performance.
The option that says: Set up a multi-AZ RDS database and run automated snapshots on the standby instance. Configure Amazon Athena to run historical queries on the S3 bucket containing the automated snapshots is incorrect because automated RDS snapshots are not viewable on any S3 bucket that you own. You have to export the snapshot first in order to access it. Remember that the underlying S3 bucket that hosts the automated snapshots is owned by AWS and not you. On the other hand, manual snapshots are different since you own the S3 bucket where they are stored. After exporting the snapshot data on S3, you will be able to query using Amazon Athena.
The option that says: Sync a years worth of data on an Amazon RDS read replica. Export the older historical data into an Amazon S3 bucket for long-term storage. From the data in Amazon S3 and Amazon RDS, create an AWS Glue Data Catalog and use Amazon Athena to run historical queries to generate the reports is incorrect because you cant select a subset of data to be synced on an RDS read replica. Additionally, analytical workloads are more optimized when done on an OLAP database like Amazon Redshift.
References: https://aws.amazon.com/blogs/big-data/amazon-redshift-spectrum-extends-data-warehousing-out-to-exabytes-no-loading-required/ https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
Incorrect
With Redshift Spectrum, Amazon Redshift customers can easily query their data in Amazon S3. Like Amazon EMR, you get the benefits of open data formats and inexpensive storage, and you can scale out to thousands of nodes to pull data, filter, project, aggregate, group, and sort. Like Amazon Athena, Redshift Spectrum is serverless and theres nothing to provision or manage. You just pay for the resources you consume for the duration of your Redshift Spectrum query. Like Amazon Redshift itself, you get the benefits of a sophisticated query optimizer, fast access to data on local disks, and standard SQL. And like nothing else, Redshift Spectrum can execute highly sophisticated queries against an exabyte of data or morein just minutes.
In the scenario, data from the MySQL database will be incrementally loaded to an S3 bucket. Only the current data will be stored in an Amazon Redshift table. Redshift is an excellent choice for frequent analytical queries of data that are terabytes in size. Since Redshift Spectrum can work as an intermediate layer between a Redshift table and an S3 bucket, joining the historical data (from S3) and current data (from Redshift table) will both optimize performance and cost. Redshift Spectrum uses the same resources you used on the Redshift database and it only charges for data queried from Amazon S3.
Therefore, the correct answer is: Schedule the export of your data from Amazon RDS to Amazon S3 every day. Load a years worth of data in Amazon Redshift and use it to run the regular queries. Use an Amazon RedShift Spectrum table to join the historical and current data
The option that says: Use AWS Glue to perform ETL and incrementally load a years worth of data into an Amazon Redshift cluster. Run the regular queries against this cluster. Create an AWS Glue Data Catalog of the data in Amazon S3 and use Amazon Athena to join the historical and current data to generate the reports is incorrect. For this method to work, youd have to copy first the more recent data from the Redshift cluster to Amazon S3 because Amazon Athena only works with data stored in S3. The pricing for per-data-scanned in Redshift Spectrum and Athena is the same. Since there is already a working Redshift cluster, taking advantage of its compute power to run join queries using Redshift Spectrum would be a better option than Athena. Take note that Amazon Athena relies on pooled resources that AWS manages while Redshift Spectrum runs on the dedicated resources allocated to your Redshift cluster, giving you a more consistent and more optimized performance.
The option that says: Set up a multi-AZ RDS database and run automated snapshots on the standby instance. Configure Amazon Athena to run historical queries on the S3 bucket containing the automated snapshots is incorrect because automated RDS snapshots are not viewable on any S3 bucket that you own. You have to export the snapshot first in order to access it. Remember that the underlying S3 bucket that hosts the automated snapshots is owned by AWS and not you. On the other hand, manual snapshots are different since you own the S3 bucket where they are stored. After exporting the snapshot data on S3, you will be able to query using Amazon Athena.
The option that says: Sync a years worth of data on an Amazon RDS read replica. Export the older historical data into an Amazon S3 bucket for long-term storage. From the data in Amazon S3 and Amazon RDS, create an AWS Glue Data Catalog and use Amazon Athena to run historical queries to generate the reports is incorrect because you cant select a subset of data to be synced on an RDS read replica. Additionally, analytical workloads are more optimized when done on an OLAP database like Amazon Redshift.
References: https://aws.amazon.com/blogs/big-data/amazon-redshift-spectrum-extends-data-warehousing-out-to-exabytes-no-loading-required/ https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
Unattempted
With Redshift Spectrum, Amazon Redshift customers can easily query their data in Amazon S3. Like Amazon EMR, you get the benefits of open data formats and inexpensive storage, and you can scale out to thousands of nodes to pull data, filter, project, aggregate, group, and sort. Like Amazon Athena, Redshift Spectrum is serverless and theres nothing to provision or manage. You just pay for the resources you consume for the duration of your Redshift Spectrum query. Like Amazon Redshift itself, you get the benefits of a sophisticated query optimizer, fast access to data on local disks, and standard SQL. And like nothing else, Redshift Spectrum can execute highly sophisticated queries against an exabyte of data or morein just minutes.
In the scenario, data from the MySQL database will be incrementally loaded to an S3 bucket. Only the current data will be stored in an Amazon Redshift table. Redshift is an excellent choice for frequent analytical queries of data that are terabytes in size. Since Redshift Spectrum can work as an intermediate layer between a Redshift table and an S3 bucket, joining the historical data (from S3) and current data (from Redshift table) will both optimize performance and cost. Redshift Spectrum uses the same resources you used on the Redshift database and it only charges for data queried from Amazon S3.
Therefore, the correct answer is: Schedule the export of your data from Amazon RDS to Amazon S3 every day. Load a years worth of data in Amazon Redshift and use it to run the regular queries. Use an Amazon RedShift Spectrum table to join the historical and current data
The option that says: Use AWS Glue to perform ETL and incrementally load a years worth of data into an Amazon Redshift cluster. Run the regular queries against this cluster. Create an AWS Glue Data Catalog of the data in Amazon S3 and use Amazon Athena to join the historical and current data to generate the reports is incorrect. For this method to work, youd have to copy first the more recent data from the Redshift cluster to Amazon S3 because Amazon Athena only works with data stored in S3. The pricing for per-data-scanned in Redshift Spectrum and Athena is the same. Since there is already a working Redshift cluster, taking advantage of its compute power to run join queries using Redshift Spectrum would be a better option than Athena. Take note that Amazon Athena relies on pooled resources that AWS manages while Redshift Spectrum runs on the dedicated resources allocated to your Redshift cluster, giving you a more consistent and more optimized performance.
The option that says: Set up a multi-AZ RDS database and run automated snapshots on the standby instance. Configure Amazon Athena to run historical queries on the S3 bucket containing the automated snapshots is incorrect because automated RDS snapshots are not viewable on any S3 bucket that you own. You have to export the snapshot first in order to access it. Remember that the underlying S3 bucket that hosts the automated snapshots is owned by AWS and not you. On the other hand, manual snapshots are different since you own the S3 bucket where they are stored. After exporting the snapshot data on S3, you will be able to query using Amazon Athena.
The option that says: Sync a years worth of data on an Amazon RDS read replica. Export the older historical data into an Amazon S3 bucket for long-term storage. From the data in Amazon S3 and Amazon RDS, create an AWS Glue Data Catalog and use Amazon Athena to run historical queries to generate the reports is incorrect because you cant select a subset of data to be synced on an RDS read replica. Additionally, analytical workloads are more optimized when done on an OLAP database like Amazon Redshift.
References: https://aws.amazon.com/blogs/big-data/amazon-redshift-spectrum-extends-data-warehousing-out-to-exabytes-no-loading-required/ https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
Question 38 of 65
38. Question
A Data Analyst found an anomaly during data processing coming from the Amazon Kinesis Data Stream with a Kinesis Java SDK custom-built consumer. The records did not arrive in order even though the data was written in the proper Kinesis shard, partitioned based on user IDs, and in maintained order.
The Data Analyst later observed that every time the stream was resharded, the out-of-order records from the same user ID arrived from different shards.
What is most likely causing the issue and how should the Data Analyst fix this?
Correct
Kinesis Data Streams is a real-time data streaming service, which is to say that your applications should assume that data is flowing continuously through the shards in your stream. Resharding enables you to increase or decrease the number of shards in a stream to adapt to changes in the streams data flow rate.
When you reshard, data records that were flowing to the parent shards are re-routed to flow to the child shards based on the hash key values that the data-record partition keys map to. However, any data records that were in the parent shards before the reshard remain in those shards. In other words, the parent shards do not disappear when the reshard occurs. They persist along with the data they contained before the reshard. The data records in the parent shards are accessible using the getShardIterator and getRecords operations in the Kinesis Data Streams API, or through the Kinesis Client Library.
After the reshard has occurred and the stream is again in an ACTIVE state, you could immediately begin to read data from the child shards. However, the parent shards that remain after the reshard could still contain data that you havent read yet that was added to the stream before the reshard. If you read data from the child shards before reading all data from the parent shards, you could read data for a particular hash key out of order given by the data records sequence numbers. Therefore, assuming that the data order is essential, you should, after a reshard, always continue to read data from the parent shards until it is exhausted. Only then should you begin reading data from the child shards.
After much explanation, the correct answer is the option that says: The Kinesis consumer is not processing the parent shard completely before processing the child shards after the stream has been resharded. The parent shard must be processed completely first before processing the rest of the child shards.
The option that says: The source of the Kinesis data stream uses a PutRecord API call to write incoming data. Use PutRecords API call with the SequenceNumberforOrdering parameter instead is incorrect because it was already confirmed that the data is written from the producer to the stream in the proper order. Furthermore, PutRecords API does not have a SequenceNumberforOrdering parameter.
The option that says: A stream configured to have multiple shards cannot maintain the order of data. Decrease the shards in the stream to a single shard and stop any future stream reshard is incorrect. Kinesis Data Stream supports the maintenance of the order of incoming and outgoing data.
The option that says: The hash key generation process for the records is malfunctioning as it writes to the stream. Rectify the process to generate an explicit hash key on the producer side so the records are accurately directed to the appropriate shard is incorrect. The Data Analyst already confirmed that the data is written from the producer to the stream in the proper order.
References: https://docs.aws.amazon.com/streams/latest/dev/kinesis-using-sdk-java-after-resharding.html https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-scaling.html https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html
Incorrect
Kinesis Data Streams is a real-time data streaming service, which is to say that your applications should assume that data is flowing continuously through the shards in your stream. Resharding enables you to increase or decrease the number of shards in a stream to adapt to changes in the streams data flow rate.
When you reshard, data records that were flowing to the parent shards are re-routed to flow to the child shards based on the hash key values that the data-record partition keys map to. However, any data records that were in the parent shards before the reshard remain in those shards. In other words, the parent shards do not disappear when the reshard occurs. They persist along with the data they contained before the reshard. The data records in the parent shards are accessible using the getShardIterator and getRecords operations in the Kinesis Data Streams API, or through the Kinesis Client Library.
After the reshard has occurred and the stream is again in an ACTIVE state, you could immediately begin to read data from the child shards. However, the parent shards that remain after the reshard could still contain data that you havent read yet that was added to the stream before the reshard. If you read data from the child shards before reading all data from the parent shards, you could read data for a particular hash key out of order given by the data records sequence numbers. Therefore, assuming that the data order is essential, you should, after a reshard, always continue to read data from the parent shards until it is exhausted. Only then should you begin reading data from the child shards.
After much explanation, the correct answer is the option that says: The Kinesis consumer is not processing the parent shard completely before processing the child shards after the stream has been resharded. The parent shard must be processed completely first before processing the rest of the child shards.
The option that says: The source of the Kinesis data stream uses a PutRecord API call to write incoming data. Use PutRecords API call with the SequenceNumberforOrdering parameter instead is incorrect because it was already confirmed that the data is written from the producer to the stream in the proper order. Furthermore, PutRecords API does not have a SequenceNumberforOrdering parameter.
The option that says: A stream configured to have multiple shards cannot maintain the order of data. Decrease the shards in the stream to a single shard and stop any future stream reshard is incorrect. Kinesis Data Stream supports the maintenance of the order of incoming and outgoing data.
The option that says: The hash key generation process for the records is malfunctioning as it writes to the stream. Rectify the process to generate an explicit hash key on the producer side so the records are accurately directed to the appropriate shard is incorrect. The Data Analyst already confirmed that the data is written from the producer to the stream in the proper order.
References: https://docs.aws.amazon.com/streams/latest/dev/kinesis-using-sdk-java-after-resharding.html https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-scaling.html https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html
Unattempted
Kinesis Data Streams is a real-time data streaming service, which is to say that your applications should assume that data is flowing continuously through the shards in your stream. Resharding enables you to increase or decrease the number of shards in a stream to adapt to changes in the streams data flow rate.
When you reshard, data records that were flowing to the parent shards are re-routed to flow to the child shards based on the hash key values that the data-record partition keys map to. However, any data records that were in the parent shards before the reshard remain in those shards. In other words, the parent shards do not disappear when the reshard occurs. They persist along with the data they contained before the reshard. The data records in the parent shards are accessible using the getShardIterator and getRecords operations in the Kinesis Data Streams API, or through the Kinesis Client Library.
After the reshard has occurred and the stream is again in an ACTIVE state, you could immediately begin to read data from the child shards. However, the parent shards that remain after the reshard could still contain data that you havent read yet that was added to the stream before the reshard. If you read data from the child shards before reading all data from the parent shards, you could read data for a particular hash key out of order given by the data records sequence numbers. Therefore, assuming that the data order is essential, you should, after a reshard, always continue to read data from the parent shards until it is exhausted. Only then should you begin reading data from the child shards.
After much explanation, the correct answer is the option that says: The Kinesis consumer is not processing the parent shard completely before processing the child shards after the stream has been resharded. The parent shard must be processed completely first before processing the rest of the child shards.
The option that says: The source of the Kinesis data stream uses a PutRecord API call to write incoming data. Use PutRecords API call with the SequenceNumberforOrdering parameter instead is incorrect because it was already confirmed that the data is written from the producer to the stream in the proper order. Furthermore, PutRecords API does not have a SequenceNumberforOrdering parameter.
The option that says: A stream configured to have multiple shards cannot maintain the order of data. Decrease the shards in the stream to a single shard and stop any future stream reshard is incorrect. Kinesis Data Stream supports the maintenance of the order of incoming and outgoing data.
The option that says: The hash key generation process for the records is malfunctioning as it writes to the stream. Rectify the process to generate an explicit hash key on the producer side so the records are accurately directed to the appropriate shard is incorrect. The Data Analyst already confirmed that the data is written from the producer to the stream in the proper order.
References: https://docs.aws.amazon.com/streams/latest/dev/kinesis-using-sdk-java-after-resharding.html https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-scaling.html https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html
Question 39 of 65
39. Question
A large enterprise plans to query data that resides in multiple AWS accounts from a central data lake. Each business unit has a separate account that uses an Amazon S3 bucket to store data unique to its business language. Each account also uses a data catalog using AWS Glue Data Catalog.
The administrator was tasked to enforce role-based access controls for the data lake. Junior Data Analysts from each unit should only have read access to their data. Senior Data Analysts, on the other hand, is allowed to have access in all business units, but for specific columns only.
Which solution will minimize operational overhead and reduce overall costs while meeting the required access patterns?
Correct
AWS Lake Formation allows cross-account access to Data Catalog metadata and underlying data. Large enterprises typically use multiple AWS accounts, and many of those accounts might need access to a data lake managed by a single AWS account. Users and AWS Glue ETL (extract, transform, and load) jobs can query and join tables across multiple accounts and still take advantage of Lake Formation table-level and column-level data protections.
Sharing Data Catalog databases and tables (Data Catalog resources) with other AWS accounts enables users to run queries and jobs that can join and query tables across multiple accounts. With some restrictions, when you share a Data Catalog resource with another account, principals in that account can operate on that resource as if the resource were in their Data Catalog. To share a Data Catalog resource, you grant one or more Lake Formation permissions with the grant option on the resource to an external account.
You dont share resources with specific principals in external AWS accountsyou share the resources with an AWS account or organization. When you share a resource with an AWS organization, youre sharing the resource with all accounts at all levels in that organization. The data lake administrator in each external account must then grant permissions on the shared resources to principals in their account
Hence, the correct answer is: Build a data lake storage in individual AWS accounts. Catalog data across multiple accounts to the central data lake account using AWS Lake Formation. Update the S3 bucket policy in each account to grant access to the AWS Lake Formation service-linked role. Use Lake Formation permissions to grant fine-grained access controls for the Senior Data Analysts to query specific tables and columns.
The option that says: Maintain the current account structure, create a secondary central data lake, and catalog data across multiple accounts to the new central data lake using AWS Glue Data Catalog. Grant cross-account access for the AWS Glue in the central account to crawl data from the S3 buckets in various accounts to populate the catalog table. Grant fine-grained access controls in the Data Catalog and Amazon S3 to allow the Senior Data Analysts to query specific tables and columns is incorrect. Although this may be possible, its a complex solution that gets even more complicated as more accounts are involved. AWS Glue uses IAM roles to grant cross-account access, which means the number of IAM roles that need to be configured depends on the number of AWS accounts. You can achieve the same fine-grained control mechanism through the AWS Lake Formation permissions. Lake Formation uses a simpler GRANT/REVOKE permissions model similar to the GRANT/REVOKE commands in a relational database system.
The option that says: Use AWS Organizations to centrally manage all AWS accounts. Use AWS Glue to migrate all the data from the various S3 buckets in every account to the central data lake account. Grant fine-grained permissions to each user with the corresponding access to specific tables and columns using IAM roles is incorrect. With the context given by the scenario, using AWS Organizations does not add any benefit at all to solve the problem. As a matter of fact, it just adds operational overhead. The AWS Organizations service is primarily used to define central configurations, security mechanisms, and simplify billing for multiple AWS accounts but not for data lakes.
The option that says: Build a data lake storage in individual AWS accounts. Create a central S3 bucket in the data lake account and use an AWS Lake Formation Blueprint to ingest data from the different S3 buckets into the central S3 bucket. Change the S3 bucket policy in each account to grant access to the AWS Lake Formation service-linked role. Use Lake Formation permissions to grant fine-grained access controls for the Junior and Senior Data Analysts to query specific tables and columns is incorrect because you dont have to provide Junior Data Analysts the permission to access tables in the Lake Formation. It is mentioned in the scenario that they must only read data that belongs to their respective accounts. This permission can easily be given by modifying the bucket policy of the S3 bucket that their account uses. Also, you dont have to use a Lake Formation blueprint to move data from different S3 buckets into the central S3 bucket as you only need to perform queries. Doing so will add additional charges.
References: https://docs.aws.amazon.com/lake-formation/latest/dg/access-control-cross-account.html https://docs.aws.amazon.com/lake-formation/latest/dg/sharing-catalog-resources.html https://aws.amazon.com/blogs/big-data/access-and-manage-data-from-multiple-accounts-from-a-central-aws-lake-formation-account/
Incorrect
AWS Lake Formation allows cross-account access to Data Catalog metadata and underlying data. Large enterprises typically use multiple AWS accounts, and many of those accounts might need access to a data lake managed by a single AWS account. Users and AWS Glue ETL (extract, transform, and load) jobs can query and join tables across multiple accounts and still take advantage of Lake Formation table-level and column-level data protections.
Sharing Data Catalog databases and tables (Data Catalog resources) with other AWS accounts enables users to run queries and jobs that can join and query tables across multiple accounts. With some restrictions, when you share a Data Catalog resource with another account, principals in that account can operate on that resource as if the resource were in their Data Catalog. To share a Data Catalog resource, you grant one or more Lake Formation permissions with the grant option on the resource to an external account.
You dont share resources with specific principals in external AWS accountsyou share the resources with an AWS account or organization. When you share a resource with an AWS organization, youre sharing the resource with all accounts at all levels in that organization. The data lake administrator in each external account must then grant permissions on the shared resources to principals in their account
Hence, the correct answer is: Build a data lake storage in individual AWS accounts. Catalog data across multiple accounts to the central data lake account using AWS Lake Formation. Update the S3 bucket policy in each account to grant access to the AWS Lake Formation service-linked role. Use Lake Formation permissions to grant fine-grained access controls for the Senior Data Analysts to query specific tables and columns.
The option that says: Maintain the current account structure, create a secondary central data lake, and catalog data across multiple accounts to the new central data lake using AWS Glue Data Catalog. Grant cross-account access for the AWS Glue in the central account to crawl data from the S3 buckets in various accounts to populate the catalog table. Grant fine-grained access controls in the Data Catalog and Amazon S3 to allow the Senior Data Analysts to query specific tables and columns is incorrect. Although this may be possible, its a complex solution that gets even more complicated as more accounts are involved. AWS Glue uses IAM roles to grant cross-account access, which means the number of IAM roles that need to be configured depends on the number of AWS accounts. You can achieve the same fine-grained control mechanism through the AWS Lake Formation permissions. Lake Formation uses a simpler GRANT/REVOKE permissions model similar to the GRANT/REVOKE commands in a relational database system.
The option that says: Use AWS Organizations to centrally manage all AWS accounts. Use AWS Glue to migrate all the data from the various S3 buckets in every account to the central data lake account. Grant fine-grained permissions to each user with the corresponding access to specific tables and columns using IAM roles is incorrect. With the context given by the scenario, using AWS Organizations does not add any benefit at all to solve the problem. As a matter of fact, it just adds operational overhead. The AWS Organizations service is primarily used to define central configurations, security mechanisms, and simplify billing for multiple AWS accounts but not for data lakes.
The option that says: Build a data lake storage in individual AWS accounts. Create a central S3 bucket in the data lake account and use an AWS Lake Formation Blueprint to ingest data from the different S3 buckets into the central S3 bucket. Change the S3 bucket policy in each account to grant access to the AWS Lake Formation service-linked role. Use Lake Formation permissions to grant fine-grained access controls for the Junior and Senior Data Analysts to query specific tables and columns is incorrect because you dont have to provide Junior Data Analysts the permission to access tables in the Lake Formation. It is mentioned in the scenario that they must only read data that belongs to their respective accounts. This permission can easily be given by modifying the bucket policy of the S3 bucket that their account uses. Also, you dont have to use a Lake Formation blueprint to move data from different S3 buckets into the central S3 bucket as you only need to perform queries. Doing so will add additional charges.
References: https://docs.aws.amazon.com/lake-formation/latest/dg/access-control-cross-account.html https://docs.aws.amazon.com/lake-formation/latest/dg/sharing-catalog-resources.html https://aws.amazon.com/blogs/big-data/access-and-manage-data-from-multiple-accounts-from-a-central-aws-lake-formation-account/
Unattempted
AWS Lake Formation allows cross-account access to Data Catalog metadata and underlying data. Large enterprises typically use multiple AWS accounts, and many of those accounts might need access to a data lake managed by a single AWS account. Users and AWS Glue ETL (extract, transform, and load) jobs can query and join tables across multiple accounts and still take advantage of Lake Formation table-level and column-level data protections.
Sharing Data Catalog databases and tables (Data Catalog resources) with other AWS accounts enables users to run queries and jobs that can join and query tables across multiple accounts. With some restrictions, when you share a Data Catalog resource with another account, principals in that account can operate on that resource as if the resource were in their Data Catalog. To share a Data Catalog resource, you grant one or more Lake Formation permissions with the grant option on the resource to an external account.
You dont share resources with specific principals in external AWS accountsyou share the resources with an AWS account or organization. When you share a resource with an AWS organization, youre sharing the resource with all accounts at all levels in that organization. The data lake administrator in each external account must then grant permissions on the shared resources to principals in their account
Hence, the correct answer is: Build a data lake storage in individual AWS accounts. Catalog data across multiple accounts to the central data lake account using AWS Lake Formation. Update the S3 bucket policy in each account to grant access to the AWS Lake Formation service-linked role. Use Lake Formation permissions to grant fine-grained access controls for the Senior Data Analysts to query specific tables and columns.
The option that says: Maintain the current account structure, create a secondary central data lake, and catalog data across multiple accounts to the new central data lake using AWS Glue Data Catalog. Grant cross-account access for the AWS Glue in the central account to crawl data from the S3 buckets in various accounts to populate the catalog table. Grant fine-grained access controls in the Data Catalog and Amazon S3 to allow the Senior Data Analysts to query specific tables and columns is incorrect. Although this may be possible, its a complex solution that gets even more complicated as more accounts are involved. AWS Glue uses IAM roles to grant cross-account access, which means the number of IAM roles that need to be configured depends on the number of AWS accounts. You can achieve the same fine-grained control mechanism through the AWS Lake Formation permissions. Lake Formation uses a simpler GRANT/REVOKE permissions model similar to the GRANT/REVOKE commands in a relational database system.
The option that says: Use AWS Organizations to centrally manage all AWS accounts. Use AWS Glue to migrate all the data from the various S3 buckets in every account to the central data lake account. Grant fine-grained permissions to each user with the corresponding access to specific tables and columns using IAM roles is incorrect. With the context given by the scenario, using AWS Organizations does not add any benefit at all to solve the problem. As a matter of fact, it just adds operational overhead. The AWS Organizations service is primarily used to define central configurations, security mechanisms, and simplify billing for multiple AWS accounts but not for data lakes.
The option that says: Build a data lake storage in individual AWS accounts. Create a central S3 bucket in the data lake account and use an AWS Lake Formation Blueprint to ingest data from the different S3 buckets into the central S3 bucket. Change the S3 bucket policy in each account to grant access to the AWS Lake Formation service-linked role. Use Lake Formation permissions to grant fine-grained access controls for the Junior and Senior Data Analysts to query specific tables and columns is incorrect because you dont have to provide Junior Data Analysts the permission to access tables in the Lake Formation. It is mentioned in the scenario that they must only read data that belongs to their respective accounts. This permission can easily be given by modifying the bucket policy of the S3 bucket that their account uses. Also, you dont have to use a Lake Formation blueprint to move data from different S3 buckets into the central S3 bucket as you only need to perform queries. Doing so will add additional charges.
References: https://docs.aws.amazon.com/lake-formation/latest/dg/access-control-cross-account.html https://docs.aws.amazon.com/lake-formation/latest/dg/sharing-catalog-resources.html https://aws.amazon.com/blogs/big-data/access-and-manage-data-from-multiple-accounts-from-a-central-aws-lake-formation-account/
Question 40 of 65
40. Question
You work for a financial institution that is building a mobile application to capture images of personal checks. The data from the check must be captured and stored within five minutes of the image capture, and it must be stored in multiple availability zones. Which solution is the most secure and meets these requirements?
Correct
Kinesis Data Firehose persists data across three separate availability zones (AZs) to ensure that the data is not lost. Amazon Cognito gives you the ability to authenticate users within your native applications by providing temporary credentials for users so they are able to interact with various AWS resources. Amazon Cognito User Pools Amazon Kinesis Data Firehose FAQs
Incorrect
Kinesis Data Firehose persists data across three separate availability zones (AZs) to ensure that the data is not lost. Amazon Cognito gives you the ability to authenticate users within your native applications by providing temporary credentials for users so they are able to interact with various AWS resources. Amazon Cognito User Pools Amazon Kinesis Data Firehose FAQs
Unattempted
Kinesis Data Firehose persists data across three separate availability zones (AZs) to ensure that the data is not lost. Amazon Cognito gives you the ability to authenticate users within your native applications by providing temporary credentials for users so they are able to interact with various AWS resources. Amazon Cognito User Pools Amazon Kinesis Data Firehose FAQs
Question 41 of 65
41. Question
A media company has several Amazon S3 buckets for storing customer data. The companys security compliance requires all buckets to be encrypted with auditable access trails. The companys data engineer plans to use an Amazon EMR cluster with EMR File System (EMRFS) to process and transform the data.
Which configuration will allow the cluster to access the encrypted data?
Correct
With Amazon EMR versions 4.8.0 and later, you can use a security configuration to specify settings for encrypting data at rest, data in transit, or both. When you enable at-rest data encryption, you can choose to encrypt EMRFS data in Amazon S3, data in local disks, or both. Each security configuration that you create is stored in Amazon EMR rather than in the cluster configuration, so you can easily reuse a configuration to specify data encryption settings whenever you create a cluster.
With S3 encryption on Amazon EMR, all the encryption modes use a single CMK by default to encrypt objects in S3. If you have highly sensitive content in specific S3 buckets, you may want to manage the encryption of these buckets separately by using different CMKs or encryption modes for individual buckets. You can accomplish this using the per bucket encryption overrides option in Amazon EMR.
Hence, the correct answer is: Modify the clusters security configuration by delegating the appropriate CMKs for each bucket under the per bucket encryption overrides.
The option that says: Set the default encryption mode of the clusters security configuration to use SSE-S3 is incorrect because SSE-S3 does not provide audit trails.
The option that says: Create an IAM role for each customer. Add an ALLOW statement to grant permission to the role to use the CMK in the Key Policy is incorrect because this has nothing to do with the EMR clusters permission.
The option that says: Export the CMK from the AWS KMS Console. Create a copy of the CMK and store it on the master node. Configure the cluster to use the encryption key is incorrect. Youre not allowed to export a CMK that was created using AWS KMS.
References: https://aws.amazon.com/blogs/big-data/secure-your-data-on-amazon-emr-using-native-ebs-and-per-bucket-s3-encryption-options/ https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html
Incorrect
With Amazon EMR versions 4.8.0 and later, you can use a security configuration to specify settings for encrypting data at rest, data in transit, or both. When you enable at-rest data encryption, you can choose to encrypt EMRFS data in Amazon S3, data in local disks, or both. Each security configuration that you create is stored in Amazon EMR rather than in the cluster configuration, so you can easily reuse a configuration to specify data encryption settings whenever you create a cluster.
With S3 encryption on Amazon EMR, all the encryption modes use a single CMK by default to encrypt objects in S3. If you have highly sensitive content in specific S3 buckets, you may want to manage the encryption of these buckets separately by using different CMKs or encryption modes for individual buckets. You can accomplish this using the per bucket encryption overrides option in Amazon EMR.
Hence, the correct answer is: Modify the clusters security configuration by delegating the appropriate CMKs for each bucket under the per bucket encryption overrides.
The option that says: Set the default encryption mode of the clusters security configuration to use SSE-S3 is incorrect because SSE-S3 does not provide audit trails.
The option that says: Create an IAM role for each customer. Add an ALLOW statement to grant permission to the role to use the CMK in the Key Policy is incorrect because this has nothing to do with the EMR clusters permission.
The option that says: Export the CMK from the AWS KMS Console. Create a copy of the CMK and store it on the master node. Configure the cluster to use the encryption key is incorrect. Youre not allowed to export a CMK that was created using AWS KMS.
References: https://aws.amazon.com/blogs/big-data/secure-your-data-on-amazon-emr-using-native-ebs-and-per-bucket-s3-encryption-options/ https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html
Unattempted
With Amazon EMR versions 4.8.0 and later, you can use a security configuration to specify settings for encrypting data at rest, data in transit, or both. When you enable at-rest data encryption, you can choose to encrypt EMRFS data in Amazon S3, data in local disks, or both. Each security configuration that you create is stored in Amazon EMR rather than in the cluster configuration, so you can easily reuse a configuration to specify data encryption settings whenever you create a cluster.
With S3 encryption on Amazon EMR, all the encryption modes use a single CMK by default to encrypt objects in S3. If you have highly sensitive content in specific S3 buckets, you may want to manage the encryption of these buckets separately by using different CMKs or encryption modes for individual buckets. You can accomplish this using the per bucket encryption overrides option in Amazon EMR.
Hence, the correct answer is: Modify the clusters security configuration by delegating the appropriate CMKs for each bucket under the per bucket encryption overrides.
The option that says: Set the default encryption mode of the clusters security configuration to use SSE-S3 is incorrect because SSE-S3 does not provide audit trails.
The option that says: Create an IAM role for each customer. Add an ALLOW statement to grant permission to the role to use the CMK in the Key Policy is incorrect because this has nothing to do with the EMR clusters permission.
The option that says: Export the CMK from the AWS KMS Console. Create a copy of the CMK and store it on the master node. Configure the cluster to use the encryption key is incorrect. Youre not allowed to export a CMK that was created using AWS KMS.
References: https://aws.amazon.com/blogs/big-data/secure-your-data-on-amazon-emr-using-native-ebs-and-per-bucket-s3-encryption-options/ https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html
Question 42 of 65
42. Question
A marketing company has visual dashboards containing product sales figures and market trends for its clients using Amazon QuickSight. The dashboards show key metrics for the clients market performance. Every client has a user on Amazon QuickSight and the dashboards are created from the CSV data files stored on the companys Amazon S3 bucket. To maintain client confidentiality, the data analytics team needs a solution to ensure that each QuickSight user can only view their own dashboards.
Which of the following should be implemented to achieve this requirement?
Correct
In the Enterprise edition of Amazon QuickSight, you can restrict access to a dataset by configuring row-level security (RLS) on it. You can do this before or after you have shared the dataset. Only the people you shared with can see any of the data. By adding row-level security, you can further control their access. To do this, you create a query or file that has one column named UserName, GroupName, or both. You can also think of this as adding a rule for that user or group.
To apply the dataset rules, you add the rules as a permissions dataset to your dataset. Then you choose to explicitly allow or deny access based on the dataset rules. Allowing access is the default.
Row-level security only works for fields containing textual data (string, char, varchar, and so on). It doesnt currently work for dates or numeric fields. Anomaly detection is not supported for datasets that use row-level security (RLS).
Hence, the correct answer is: On the Amazon QuickSight web console, create dataset rules with row-level security to restrict access on each dashboard.
The option that says: Create dedicated S3 buckets for each client to store their data, and configure S3 bucket policies for authorization is incorrect because configuring S3 bucket policies will require a lot of additional work for allowing each user permission and bucket policy. A better solution is to use the row-level security feature in Amazon QuickSight.
The option that says: Create dedicated S3 buckets for each client to store their data, and create necessary IAM policies for authorization is incorrect because using IAM Policies is not used directly for authorizing QuickSight users. This should be an IAM User, not IAM policy. Alternatively, you can just use the row-level security feature instead.
The option that says: On the Amazon QuickSight web console, create a manifest file with row-level security to restrict access on each dashboard is incorrect because row-level security is configured in the dataset, not in a manifest file.
References: https://docs.aws.amazon.com/quicksight/latest/user/restrict-access-to-a-data-set-using-row-level-security.html https://docs.aws.amazon.com/quicksight/latest/user/working-with-data-sets.html
Incorrect
In the Enterprise edition of Amazon QuickSight, you can restrict access to a dataset by configuring row-level security (RLS) on it. You can do this before or after you have shared the dataset. Only the people you shared with can see any of the data. By adding row-level security, you can further control their access. To do this, you create a query or file that has one column named UserName, GroupName, or both. You can also think of this as adding a rule for that user or group.
To apply the dataset rules, you add the rules as a permissions dataset to your dataset. Then you choose to explicitly allow or deny access based on the dataset rules. Allowing access is the default.
Row-level security only works for fields containing textual data (string, char, varchar, and so on). It doesnt currently work for dates or numeric fields. Anomaly detection is not supported for datasets that use row-level security (RLS).
Hence, the correct answer is: On the Amazon QuickSight web console, create dataset rules with row-level security to restrict access on each dashboard.
The option that says: Create dedicated S3 buckets for each client to store their data, and configure S3 bucket policies for authorization is incorrect because configuring S3 bucket policies will require a lot of additional work for allowing each user permission and bucket policy. A better solution is to use the row-level security feature in Amazon QuickSight.
The option that says: Create dedicated S3 buckets for each client to store their data, and create necessary IAM policies for authorization is incorrect because using IAM Policies is not used directly for authorizing QuickSight users. This should be an IAM User, not IAM policy. Alternatively, you can just use the row-level security feature instead.
The option that says: On the Amazon QuickSight web console, create a manifest file with row-level security to restrict access on each dashboard is incorrect because row-level security is configured in the dataset, not in a manifest file.
References: https://docs.aws.amazon.com/quicksight/latest/user/restrict-access-to-a-data-set-using-row-level-security.html https://docs.aws.amazon.com/quicksight/latest/user/working-with-data-sets.html
Unattempted
In the Enterprise edition of Amazon QuickSight, you can restrict access to a dataset by configuring row-level security (RLS) on it. You can do this before or after you have shared the dataset. Only the people you shared with can see any of the data. By adding row-level security, you can further control their access. To do this, you create a query or file that has one column named UserName, GroupName, or both. You can also think of this as adding a rule for that user or group.
To apply the dataset rules, you add the rules as a permissions dataset to your dataset. Then you choose to explicitly allow or deny access based on the dataset rules. Allowing access is the default.
Row-level security only works for fields containing textual data (string, char, varchar, and so on). It doesnt currently work for dates or numeric fields. Anomaly detection is not supported for datasets that use row-level security (RLS).
Hence, the correct answer is: On the Amazon QuickSight web console, create dataset rules with row-level security to restrict access on each dashboard.
The option that says: Create dedicated S3 buckets for each client to store their data, and configure S3 bucket policies for authorization is incorrect because configuring S3 bucket policies will require a lot of additional work for allowing each user permission and bucket policy. A better solution is to use the row-level security feature in Amazon QuickSight.
The option that says: Create dedicated S3 buckets for each client to store their data, and create necessary IAM policies for authorization is incorrect because using IAM Policies is not used directly for authorizing QuickSight users. This should be an IAM User, not IAM policy. Alternatively, you can just use the row-level security feature instead.
The option that says: On the Amazon QuickSight web console, create a manifest file with row-level security to restrict access on each dashboard is incorrect because row-level security is configured in the dataset, not in a manifest file.
References: https://docs.aws.amazon.com/quicksight/latest/user/restrict-access-to-a-data-set-using-row-level-security.html https://docs.aws.amazon.com/quicksight/latest/user/working-with-data-sets.html
Question 43 of 65
43. Question
A company wants to analyze their clients and the purchases they make through their platform. They created their payments solution and integrated it with their retail system. The data that passes through is stored in an Amazon DynamoDB table and is encrypted using DynamoDBs encryption client with AWS KMS managed keys before it gets written into the table.
An Amazon Redshift cluster is also used as the data warehouse and is currently used by various departments. The data analyst team needs to build a loading workflow and process them without compromising the sensitive data.
What should they do to achieve this task?
Correct
The Amazon DynamoDB Encryption Client helps you protect your table data before sending it to Amazon DynamoDB. Encrypting your sensitive data in-transit and at-rest helps ensure that your plaintext data isnt available to any third party, including AWS.
Amazon DynamoDB is integrated with AWS Lambda so that you can create code that automatically responds to events in DynamoDB Streams. If you enable DynamoDB Streams on a table, you can associate the stream with an AWS Lambda function. AWS Lambda polls the stream and invokes your Lambda function synchronously when it detects new stream records and perform any actions you specify, such as copying each stream record to persistent storage, such as Amazon S3.
The scenario tests the security of the transfer between the sensitive data from the S3 bucket to the Redshift Cluster. Amazon S3 supports both server-side encryption and client-side encryption. The COPY command supports the different types of Amazon S3 encryption, including Server-side encryption with AWS KMS-managed keys (SSE-KMS) and Client-side encryption using a client-side symmetric master key.
Hence, the correct answer is: Enable DynamoDB streams. Write an AWS Lambda function to transfer the sensitive data to a secured S3 bucket. Create a table in the Amazon Redshift Cluster with access granted to users approved to access the purchases data only. Use the COPY command to load the data from Amazon S3 to the Redshift table using the IAM role with access to the same KMS key.
The option that says: Enable DynamoDB streams. Write an AWS Lambda function to transfer the sensitive data to a secured S3 bucket while decrypting it using the same KMS key. Create a table in the Amazon Redshift Cluster with access granted to users approved to access the purchases data only. Use the COPY command to load the data from Amazon S3 to the Redshift table is incorrect. DynamoDB streams will decrypt the data before it writes to the S3 bucket and will be stored in plaintext. DynamoDB Encryption Client provides encryption-in-transit. However, the concern here is to secure the transfer of the same data from S3 to Amazon Redshift. You can use an IAM role with access to the same KMS key to transfer the data from the S3 bucket to the Redshift table.
The following options are both incorrect because using an Amazon EMR Cluster is unnecessary and expensive to implement in this scenario. These solutions do not comply with the AWS Well-Architected Framework. Enabling DynamoDB streams and an AWS Lambda function would suffice.
Create an Amazon EMR cluster with an EMR_EC2_DefaultRole role that has access to the KMS key. Create Apache Hive tables that reference the data stored in DynamoDB and a corresponding table in an Amazon Redshift cluster with access only to approved users. In Hive, select the data from DynamoDB and then insert the output to the Redshift table
Create an Amazon EMR cluster and Apache Hive tables that reference the data stored in Amazon DynamoDB. Insert the sensitive data to a secured Amazon S3 bucket. Create a table in the Amazon Redshift Cluster with access granted to users approved to access the purchases data only. Use the COPY command with the IAM role that has access to the same KMS key to load the data from the Amazon S3 bucket to the Redshift table
References: https://docs.aws.amazon.com/redshift/latest/dg/c_loading-encrypted-files.html https://aws.amazon.com/blogs/big-data/encrypt-your-amazon-redshift-loads-with-amazon-s3-and-aws-kms/ https://aws.amazon.com/blogs/security/how-to-encrypt-and-sign-dynamodb-data-in-your-application/
Incorrect
The Amazon DynamoDB Encryption Client helps you protect your table data before sending it to Amazon DynamoDB. Encrypting your sensitive data in-transit and at-rest helps ensure that your plaintext data isnt available to any third party, including AWS.
Amazon DynamoDB is integrated with AWS Lambda so that you can create code that automatically responds to events in DynamoDB Streams. If you enable DynamoDB Streams on a table, you can associate the stream with an AWS Lambda function. AWS Lambda polls the stream and invokes your Lambda function synchronously when it detects new stream records and perform any actions you specify, such as copying each stream record to persistent storage, such as Amazon S3.
The scenario tests the security of the transfer between the sensitive data from the S3 bucket to the Redshift Cluster. Amazon S3 supports both server-side encryption and client-side encryption. The COPY command supports the different types of Amazon S3 encryption, including Server-side encryption with AWS KMS-managed keys (SSE-KMS) and Client-side encryption using a client-side symmetric master key.
Hence, the correct answer is: Enable DynamoDB streams. Write an AWS Lambda function to transfer the sensitive data to a secured S3 bucket. Create a table in the Amazon Redshift Cluster with access granted to users approved to access the purchases data only. Use the COPY command to load the data from Amazon S3 to the Redshift table using the IAM role with access to the same KMS key.
The option that says: Enable DynamoDB streams. Write an AWS Lambda function to transfer the sensitive data to a secured S3 bucket while decrypting it using the same KMS key. Create a table in the Amazon Redshift Cluster with access granted to users approved to access the purchases data only. Use the COPY command to load the data from Amazon S3 to the Redshift table is incorrect. DynamoDB streams will decrypt the data before it writes to the S3 bucket and will be stored in plaintext. DynamoDB Encryption Client provides encryption-in-transit. However, the concern here is to secure the transfer of the same data from S3 to Amazon Redshift. You can use an IAM role with access to the same KMS key to transfer the data from the S3 bucket to the Redshift table.
The following options are both incorrect because using an Amazon EMR Cluster is unnecessary and expensive to implement in this scenario. These solutions do not comply with the AWS Well-Architected Framework. Enabling DynamoDB streams and an AWS Lambda function would suffice.
Create an Amazon EMR cluster with an EMR_EC2_DefaultRole role that has access to the KMS key. Create Apache Hive tables that reference the data stored in DynamoDB and a corresponding table in an Amazon Redshift cluster with access only to approved users. In Hive, select the data from DynamoDB and then insert the output to the Redshift table
Create an Amazon EMR cluster and Apache Hive tables that reference the data stored in Amazon DynamoDB. Insert the sensitive data to a secured Amazon S3 bucket. Create a table in the Amazon Redshift Cluster with access granted to users approved to access the purchases data only. Use the COPY command with the IAM role that has access to the same KMS key to load the data from the Amazon S3 bucket to the Redshift table
References: https://docs.aws.amazon.com/redshift/latest/dg/c_loading-encrypted-files.html https://aws.amazon.com/blogs/big-data/encrypt-your-amazon-redshift-loads-with-amazon-s3-and-aws-kms/ https://aws.amazon.com/blogs/security/how-to-encrypt-and-sign-dynamodb-data-in-your-application/
Unattempted
The Amazon DynamoDB Encryption Client helps you protect your table data before sending it to Amazon DynamoDB. Encrypting your sensitive data in-transit and at-rest helps ensure that your plaintext data isnt available to any third party, including AWS.
Amazon DynamoDB is integrated with AWS Lambda so that you can create code that automatically responds to events in DynamoDB Streams. If you enable DynamoDB Streams on a table, you can associate the stream with an AWS Lambda function. AWS Lambda polls the stream and invokes your Lambda function synchronously when it detects new stream records and perform any actions you specify, such as copying each stream record to persistent storage, such as Amazon S3.
The scenario tests the security of the transfer between the sensitive data from the S3 bucket to the Redshift Cluster. Amazon S3 supports both server-side encryption and client-side encryption. The COPY command supports the different types of Amazon S3 encryption, including Server-side encryption with AWS KMS-managed keys (SSE-KMS) and Client-side encryption using a client-side symmetric master key.
Hence, the correct answer is: Enable DynamoDB streams. Write an AWS Lambda function to transfer the sensitive data to a secured S3 bucket. Create a table in the Amazon Redshift Cluster with access granted to users approved to access the purchases data only. Use the COPY command to load the data from Amazon S3 to the Redshift table using the IAM role with access to the same KMS key.
The option that says: Enable DynamoDB streams. Write an AWS Lambda function to transfer the sensitive data to a secured S3 bucket while decrypting it using the same KMS key. Create a table in the Amazon Redshift Cluster with access granted to users approved to access the purchases data only. Use the COPY command to load the data from Amazon S3 to the Redshift table is incorrect. DynamoDB streams will decrypt the data before it writes to the S3 bucket and will be stored in plaintext. DynamoDB Encryption Client provides encryption-in-transit. However, the concern here is to secure the transfer of the same data from S3 to Amazon Redshift. You can use an IAM role with access to the same KMS key to transfer the data from the S3 bucket to the Redshift table.
The following options are both incorrect because using an Amazon EMR Cluster is unnecessary and expensive to implement in this scenario. These solutions do not comply with the AWS Well-Architected Framework. Enabling DynamoDB streams and an AWS Lambda function would suffice.
Create an Amazon EMR cluster with an EMR_EC2_DefaultRole role that has access to the KMS key. Create Apache Hive tables that reference the data stored in DynamoDB and a corresponding table in an Amazon Redshift cluster with access only to approved users. In Hive, select the data from DynamoDB and then insert the output to the Redshift table
Create an Amazon EMR cluster and Apache Hive tables that reference the data stored in Amazon DynamoDB. Insert the sensitive data to a secured Amazon S3 bucket. Create a table in the Amazon Redshift Cluster with access granted to users approved to access the purchases data only. Use the COPY command with the IAM role that has access to the same KMS key to load the data from the Amazon S3 bucket to the Redshift table
References: https://docs.aws.amazon.com/redshift/latest/dg/c_loading-encrypted-files.html https://aws.amazon.com/blogs/big-data/encrypt-your-amazon-redshift-loads-with-amazon-s3-and-aws-kms/ https://aws.amazon.com/blogs/security/how-to-encrypt-and-sign-dynamodb-data-in-your-application/
Question 44 of 65
44. Question
A company has a Java application hosted on-premises that processes Extract, Load, Transform (ETL) jobs to an Amazon EMR cluster. The company requires its Security Operations (SecOps) team to enable root device volume encryption on all nodes in the EMR cluster. The solution must reduce overhead for the system administrators without modifying the applications code. The SecOps team should also use AWS CloudFormation in creating AWS resources to comply with the company standards.
Which is the MOST suitable solution for the scenario?
Correct
You can create Amazon EMR clusters with custom Amazon Machine Images (AMI) running Amazon Linux. This enables you to preload additional software on your AMI and use AMIs that you customize and control. You can also encrypt the Amazon EBS root volume of your AMIs with AWS Key Management Service (KMS) keys. Additionally, you can adjust the Amazon EBS root volume size for instances in your Amazon EMR cluster.
To encrypt the Amazon EBS root device volume of an Amazon Linux AMI for Amazon EMR, copy a snapshot image from an unencrypted AMI to an encrypted target. The source AMI for the snapshot can be the base Amazon Linux AMI, or you can copy a snapshot from an AMI derived from the base Amazon Linux AMI that you customized.
Hence, the correct answer is: In the CloudFormation template, define an EMR cluster that uses a custom AMI with encrypted root device volume under the CustomAmild property.
The option that says: Provision an Amazon EC2 instance with encrypted root device volumes. Connect to the instance and install Apache Hadoop. Specify the instance in the CloudFormation template is incorrect because it is stated in the scenario that you are migrating the on-premises ETL jobs to Amazon EMR and not in EC2.
The option that says: In the CloudFormation template, define a custom bootstrap action under the BootstrapActionConfig property of the EMR cluster to enable Transport Layer Security (TLS) is incorrect because a TLS is primarily used to encrypt data in transit and not for root device volume encryption.
The option that says: In the CloudFormation template, define a custom bootstrap action under the BootstrapActionConfig property of the EMR cluster to encrypt the root device volume of the master node is incorrect because you cant set up a bootstrap action to encrypt the root device volume. You have to create a custom AMI with an encrypted root device volume using a KMS CMK in CloudFormation.
References: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html#CustomBootstrapCopyS3Object https://aws.amazon.com/about-aws/whats-new/2017/07/amazon-emr-now-supports-launching-clusters-with-custom-amazon-linux-amis/
Incorrect
You can create Amazon EMR clusters with custom Amazon Machine Images (AMI) running Amazon Linux. This enables you to preload additional software on your AMI and use AMIs that you customize and control. You can also encrypt the Amazon EBS root volume of your AMIs with AWS Key Management Service (KMS) keys. Additionally, you can adjust the Amazon EBS root volume size for instances in your Amazon EMR cluster.
To encrypt the Amazon EBS root device volume of an Amazon Linux AMI for Amazon EMR, copy a snapshot image from an unencrypted AMI to an encrypted target. The source AMI for the snapshot can be the base Amazon Linux AMI, or you can copy a snapshot from an AMI derived from the base Amazon Linux AMI that you customized.
Hence, the correct answer is: In the CloudFormation template, define an EMR cluster that uses a custom AMI with encrypted root device volume under the CustomAmild property.
The option that says: Provision an Amazon EC2 instance with encrypted root device volumes. Connect to the instance and install Apache Hadoop. Specify the instance in the CloudFormation template is incorrect because it is stated in the scenario that you are migrating the on-premises ETL jobs to Amazon EMR and not in EC2.
The option that says: In the CloudFormation template, define a custom bootstrap action under the BootstrapActionConfig property of the EMR cluster to enable Transport Layer Security (TLS) is incorrect because a TLS is primarily used to encrypt data in transit and not for root device volume encryption.
The option that says: In the CloudFormation template, define a custom bootstrap action under the BootstrapActionConfig property of the EMR cluster to encrypt the root device volume of the master node is incorrect because you cant set up a bootstrap action to encrypt the root device volume. You have to create a custom AMI with an encrypted root device volume using a KMS CMK in CloudFormation.
References: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html#CustomBootstrapCopyS3Object https://aws.amazon.com/about-aws/whats-new/2017/07/amazon-emr-now-supports-launching-clusters-with-custom-amazon-linux-amis/
Unattempted
You can create Amazon EMR clusters with custom Amazon Machine Images (AMI) running Amazon Linux. This enables you to preload additional software on your AMI and use AMIs that you customize and control. You can also encrypt the Amazon EBS root volume of your AMIs with AWS Key Management Service (KMS) keys. Additionally, you can adjust the Amazon EBS root volume size for instances in your Amazon EMR cluster.
To encrypt the Amazon EBS root device volume of an Amazon Linux AMI for Amazon EMR, copy a snapshot image from an unencrypted AMI to an encrypted target. The source AMI for the snapshot can be the base Amazon Linux AMI, or you can copy a snapshot from an AMI derived from the base Amazon Linux AMI that you customized.
Hence, the correct answer is: In the CloudFormation template, define an EMR cluster that uses a custom AMI with encrypted root device volume under the CustomAmild property.
The option that says: Provision an Amazon EC2 instance with encrypted root device volumes. Connect to the instance and install Apache Hadoop. Specify the instance in the CloudFormation template is incorrect because it is stated in the scenario that you are migrating the on-premises ETL jobs to Amazon EMR and not in EC2.
The option that says: In the CloudFormation template, define a custom bootstrap action under the BootstrapActionConfig property of the EMR cluster to enable Transport Layer Security (TLS) is incorrect because a TLS is primarily used to encrypt data in transit and not for root device volume encryption.
The option that says: In the CloudFormation template, define a custom bootstrap action under the BootstrapActionConfig property of the EMR cluster to encrypt the root device volume of the master node is incorrect because you cant set up a bootstrap action to encrypt the root device volume. You have to create a custom AMI with an encrypted root device volume using a KMS CMK in CloudFormation.
References: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html#CustomBootstrapCopyS3Object https://aws.amazon.com/about-aws/whats-new/2017/07/amazon-emr-now-supports-launching-clusters-with-custom-amazon-linux-amis/
Question 45 of 65
45. Question
A company runs a global web application that generates and loads large volumes of data into an Amazon S3 bucket. The company uses an Amazon EMR cluster for processing ETL jobs. The cluster persists the processed data into an S3 bucket using EMR File System (EMRFS) that uses client-side encryption. The company added an Elastic Block Store (EBS) to increase the instance storage to meet performance requirements. Due to a regulatory policy, the company should also encrypt the data residing on the Amazon EBS volume attached to the cluster instance.
Which encryption method should the company use?
Correct
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Amazon EMR is ideal for problems that necessitate the fast and efficient processing of large amounts of data. The web service interfaces allow you to build processing workflows and programmatically monitor the progress of running clusters. You can also use the simple web interface of the AWS Management Console to launch your clusters and monitor processing-intensive computation on clusters of Amazon EC2 instances.
To encrypt the root volumes of the EMR clusters, you need to create a security configuration. Security configuration allows you to specify data encryption and authentication settings when you create an Amazon EMR cluster. If you enable at-rest encryption for local disks, Amazon EC2 instance store volumes and the attached Amazon EBS storage volumes are encrypted using Linux Unified Key Setup (LUKS).
Alternatively, when using AWS KMS as your key provider, you can choose to turn on EBS encryption to encrypt the EBS root device and storage volumes. Take note that AWS KMS customer master keys (CMKs) require additional permissions for EBS encryption.
Hence, the correct answer is: Use Linux Unified Key Setup (LUKS).
The option that says: Use an open-source Hadoop Distributed File System (HDFS) encryption is incorrect because HDFS encryption is just used for protecting data exchanges between cluster instances during distributed processing. You have to use the Linux Unified Key Setup (LUKS) to satisfy the requirements of this scenario.
The option that says: Use Simple Authentication and Security Layer (SASL) for Spark shuffle encryption is incorrect because this method is only used for encrypting data in transit between nodes in a cluster.
The option that says: Use Secure Sockets Layer (SSL) is incorrect because SSL is simply used for in-transit encryption and not for at-rest data encryption.
References: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-specify-security-configuration.html https://aws.amazon.com/blogs/big-data/secure-amazon-emr-with-encryption/
Incorrect
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Amazon EMR is ideal for problems that necessitate the fast and efficient processing of large amounts of data. The web service interfaces allow you to build processing workflows and programmatically monitor the progress of running clusters. You can also use the simple web interface of the AWS Management Console to launch your clusters and monitor processing-intensive computation on clusters of Amazon EC2 instances.
To encrypt the root volumes of the EMR clusters, you need to create a security configuration. Security configuration allows you to specify data encryption and authentication settings when you create an Amazon EMR cluster. If you enable at-rest encryption for local disks, Amazon EC2 instance store volumes and the attached Amazon EBS storage volumes are encrypted using Linux Unified Key Setup (LUKS).
Alternatively, when using AWS KMS as your key provider, you can choose to turn on EBS encryption to encrypt the EBS root device and storage volumes. Take note that AWS KMS customer master keys (CMKs) require additional permissions for EBS encryption.
Hence, the correct answer is: Use Linux Unified Key Setup (LUKS).
The option that says: Use an open-source Hadoop Distributed File System (HDFS) encryption is incorrect because HDFS encryption is just used for protecting data exchanges between cluster instances during distributed processing. You have to use the Linux Unified Key Setup (LUKS) to satisfy the requirements of this scenario.
The option that says: Use Simple Authentication and Security Layer (SASL) for Spark shuffle encryption is incorrect because this method is only used for encrypting data in transit between nodes in a cluster.
The option that says: Use Secure Sockets Layer (SSL) is incorrect because SSL is simply used for in-transit encryption and not for at-rest data encryption.
References: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-specify-security-configuration.html https://aws.amazon.com/blogs/big-data/secure-amazon-emr-with-encryption/
Unattempted
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Amazon EMR is ideal for problems that necessitate the fast and efficient processing of large amounts of data. The web service interfaces allow you to build processing workflows and programmatically monitor the progress of running clusters. You can also use the simple web interface of the AWS Management Console to launch your clusters and monitor processing-intensive computation on clusters of Amazon EC2 instances.
To encrypt the root volumes of the EMR clusters, you need to create a security configuration. Security configuration allows you to specify data encryption and authentication settings when you create an Amazon EMR cluster. If you enable at-rest encryption for local disks, Amazon EC2 instance store volumes and the attached Amazon EBS storage volumes are encrypted using Linux Unified Key Setup (LUKS).
Alternatively, when using AWS KMS as your key provider, you can choose to turn on EBS encryption to encrypt the EBS root device and storage volumes. Take note that AWS KMS customer master keys (CMKs) require additional permissions for EBS encryption.
Hence, the correct answer is: Use Linux Unified Key Setup (LUKS).
The option that says: Use an open-source Hadoop Distributed File System (HDFS) encryption is incorrect because HDFS encryption is just used for protecting data exchanges between cluster instances during distributed processing. You have to use the Linux Unified Key Setup (LUKS) to satisfy the requirements of this scenario.
The option that says: Use Simple Authentication and Security Layer (SASL) for Spark shuffle encryption is incorrect because this method is only used for encrypting data in transit between nodes in a cluster.
The option that says: Use Secure Sockets Layer (SSL) is incorrect because SSL is simply used for in-transit encryption and not for at-rest data encryption.
References: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-specify-security-configuration.html https://aws.amazon.com/blogs/big-data/secure-amazon-emr-with-encryption/
Question 46 of 65
46. Question
An electronics manufacturing company wants to analyze their transaction data from the past fiscal half-year. The result will be used to forecast trends in different sectors of consumer electronics. The transaction data is stored in an existing Amazon Redshift cluster in the N. Virginia region. The data analytics team will be using Amazon QuickSight to create a visual representation of the trend.
During configuration, one of the members has launched Amazon QuickSight in the Singapore Region. The group is experiencing issues associating Amazon QuickSight from the Singapore Region to the cluster in N. Virginia Region.
Which of the following configurations should be performed to solve the issue?
Correct
For Amazon QuickSight to access your AWS resources, you must create security groups for them that authorize connections from the IP address ranges used by Amazon QuickSight servers. You must have AWS credentials that permit you to access these AWS resources to modify their security groups.
For Amazon QuickSight to connect to an Amazon Redshift instance, you must create a new security group for that instance. This security group contains an inbound rule authorizing access from the appropriate IP address range for the Amazon QuickSight servers in that AWS Region.
Hence, the correct answer is: Configure a new cluster security group that contains an inbound rule authorizing access from the appropriate IP address range for Amazon QuickSight in the Singapore region.
The option that says: Use a cluster connection string to privately access the Redshift cluster from Amazon Quicksight in the Singapore region is incorrect because a cluster connection is not needed. A cluster connection string is just used to connect to your cluster with a SQL client tool.
The option that says: Provision Amazon QuickSight in a custom VPC. Use a VPC endpoint to connect to the VPC of the Redshift cluster is incorrect because creating a VPC endpoint is not needed.
The option that says: Enable cross-region snapshots and select Singapore as the destination region from the Amazon Redshift console. Create a cluster from the snapshot and associate it to Amazon QuickSight is incorrect. Although this is possible, it doesnt really solve the issue and increases the cost because you are running two Redshift clusters with the same data.
References: https://docs.aws.amazon.com/quicksight/latest/user/enabling-access-redshift.html#redshift-classic-access https://docs.aws.amazon.com/quicksight/latest/user/working-with-aws-vpc.html
Incorrect
For Amazon QuickSight to access your AWS resources, you must create security groups for them that authorize connections from the IP address ranges used by Amazon QuickSight servers. You must have AWS credentials that permit you to access these AWS resources to modify their security groups.
For Amazon QuickSight to connect to an Amazon Redshift instance, you must create a new security group for that instance. This security group contains an inbound rule authorizing access from the appropriate IP address range for the Amazon QuickSight servers in that AWS Region.
Hence, the correct answer is: Configure a new cluster security group that contains an inbound rule authorizing access from the appropriate IP address range for Amazon QuickSight in the Singapore region.
The option that says: Use a cluster connection string to privately access the Redshift cluster from Amazon Quicksight in the Singapore region is incorrect because a cluster connection is not needed. A cluster connection string is just used to connect to your cluster with a SQL client tool.
The option that says: Provision Amazon QuickSight in a custom VPC. Use a VPC endpoint to connect to the VPC of the Redshift cluster is incorrect because creating a VPC endpoint is not needed.
The option that says: Enable cross-region snapshots and select Singapore as the destination region from the Amazon Redshift console. Create a cluster from the snapshot and associate it to Amazon QuickSight is incorrect. Although this is possible, it doesnt really solve the issue and increases the cost because you are running two Redshift clusters with the same data.
References: https://docs.aws.amazon.com/quicksight/latest/user/enabling-access-redshift.html#redshift-classic-access https://docs.aws.amazon.com/quicksight/latest/user/working-with-aws-vpc.html
Unattempted
For Amazon QuickSight to access your AWS resources, you must create security groups for them that authorize connections from the IP address ranges used by Amazon QuickSight servers. You must have AWS credentials that permit you to access these AWS resources to modify their security groups.
For Amazon QuickSight to connect to an Amazon Redshift instance, you must create a new security group for that instance. This security group contains an inbound rule authorizing access from the appropriate IP address range for the Amazon QuickSight servers in that AWS Region.
Hence, the correct answer is: Configure a new cluster security group that contains an inbound rule authorizing access from the appropriate IP address range for Amazon QuickSight in the Singapore region.
The option that says: Use a cluster connection string to privately access the Redshift cluster from Amazon Quicksight in the Singapore region is incorrect because a cluster connection is not needed. A cluster connection string is just used to connect to your cluster with a SQL client tool.
The option that says: Provision Amazon QuickSight in a custom VPC. Use a VPC endpoint to connect to the VPC of the Redshift cluster is incorrect because creating a VPC endpoint is not needed.
The option that says: Enable cross-region snapshots and select Singapore as the destination region from the Amazon Redshift console. Create a cluster from the snapshot and associate it to Amazon QuickSight is incorrect. Although this is possible, it doesnt really solve the issue and increases the cost because you are running two Redshift clusters with the same data.
References: https://docs.aws.amazon.com/quicksight/latest/user/enabling-access-redshift.html#redshift-classic-access https://docs.aws.amazon.com/quicksight/latest/user/working-with-aws-vpc.html
Question 47 of 65
47. Question
A Data Analyst is responsible for the security of the banks Amazon Redshift cluster. A new 20-column table will include columns that contain confidential data such as the account holders personal information. This table will be queried by various departments.
How can the Data Analyst secure this table and allow only privileged users to read the columns containing confidential data with the least maintenance overhead?
Correct
Confidential data on data warehouse tables are commonly found in many firms. Views or AWS Lake Formation on Amazon Redshift Spectrum were used previously to manage such scenarios; however, this adds extra overhead in creating and maintaining views or Amazon Redshift Spectrum. The view-based approach is also difficult to scale and can lead to a lack of security controls. Amazon Redshift column-level access control is a new feature that supports access control at a column-level for data in Amazon Redshift. You can use column-level GRANT and REVOKE statements to help meet your security and compliance needs similar to managing any database object.
When you create a database object, you are its owner. By default, only a superuser or the owner of an object can query, modify, or grant privileges on the object. For users to use an object, you must grant the necessary privileges to the user or the group that contains the user. Database superusers have the same privileges as database owners. Amazon Redshift supports the following privileges: SELECT, INSERT, UPDATE, DELETE, REFERENCES, CREATE, TEMPORARY, and USAGE. Different privileges are associated with different object types.
For an Amazon Redshift table, you can grant only the SELECT and UPDATE privileges at the column level. For an Amazon Redshift view, you can grant only the SELECT privilege at the column level.
Therefore, the correct answer is: Grant the privileged users Select access to the table using the GRANT SQL command. Run Grant Select on columns that do not contain confidential data for the rest of the users with the same command.
The option that says: Grant all users Select access to non-sensitive columns of the table using the GRANT SQL command. Attach an IAM policy to users with explicit ALLOW read access to the columns containing confidential data is incorrect. The issue here is to secure the permissions on the table. IAM policies do not govern table permissions.
The option that says: Instead of creating a single table, create two new tables with one of them containing the confidential data. Grant the privileged users Select access to confidential data using the GRANT SQL command. For queries that require both tables, use table joins is incorrect. There are better solutions that do not require this much overhead of managing two tables.
The option that says: Create a view of the table specifying all the columns that do not contain confidential data. Grant the privileged users Select access to the table using the GRANT SQL command. Grant the other users Select access to the view is incorrect. As explained above, view-based approach would require more maintenance overhead.
References: https://aws.amazon.com/blogs/big-data/achieve-finer-grained-data-security-with-column-level-access-control-in-amazon-redshift/ https://docs.aws.amazon.com/redshift/latest/dg/r_GRANT.html