You have already completed the Test before. Hence you can not start it again.
Test is loading...
You must sign in or sign up to start the Test.
You have to finish following quiz, to start this Test:
Your results are here!! for" AWS Data Analytics Specialty Practice Test 9 "
0 of 35 questions answered correctly
Your time:
Time has elapsed
Your Final Score is : 0
You have attempted : 0
Number of Correct Questions : 0 and scored 0
Number of Incorrect Questions : 0 and Negative marks 0
Average score
Your score
AWS Certified Data Analytics Specialty
You have attempted: 0
Number of Correct Questions: 0 and scored 0
Number of Incorrect Questions: 0 and Negative marks 0
You can review your answers by clicking on “View Answers” option. Important Note : Open Reference Documentation Links in New Tab (Right Click and Open in New Tab).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Answered
Review
Question 1 of 35
1. Question
You work as a data engineer for an organization responsible for tracking the spending on company issued tablets and the apps that are purchased. The tablet owners‘ spending on app usage is logged to a Kinesis Data Firehose, where the data is then delivered to S3 and copied onto Redshift every 15 minutes. Your job is to set up a billing alert system to notify tablet owners when they have spent too much on apps within 10 minutes. Currently, a DynamoDB table contains the cumulative app spending total, as well as the threshold amount. If the cumulative total surpasses the threshold amount, a notification must be sent out to the tablet owner. What is a solution that will allow for timely notifications to be sent to tablet owners when the spending threshold is surpassed?
Correct
Since the data is already being streamed through Kinesis Data Firehose, then using Kinesis Data Analytics enables you to use SQL aggregation to stream the data and calculate the spending in a timely manner. Streaming Data Solutions on AWS with Amazon Kinesis
Incorrect
Since the data is already being streamed through Kinesis Data Firehose, then using Kinesis Data Analytics enables you to use SQL aggregation to stream the data and calculate the spending in a timely manner. Streaming Data Solutions on AWS with Amazon Kinesis
Unattempted
Since the data is already being streamed through Kinesis Data Firehose, then using Kinesis Data Analytics enables you to use SQL aggregation to stream the data and calculate the spending in a timely manner. Streaming Data Solutions on AWS with Amazon Kinesis
Question 2 of 35
2. Question
You work for a coffee company which has thousands of branches all over the country. The sales system generates logs regarding transactions. The logs are aggregated and uploaded to an S3 bucket ‘transaction-logs‘ which has a subfolder for logs for each item like those shown below: transaction-logs/dt=11-22-2019-0700/Hot-Drinks/ transaction-logs/dt=11-22-2019-0800/Cold-Drinks/ transaction-logs/dt=11-22-2019-0900/Edibles-Sweet/ transaction-logs/dt=11-22-2019-1000/Edibles-Salty/ Some store locations are open from 8 AM to 5 PM, but there are many 24 hour locations as well, which means there‘s millions of transactions being reported per hour. Consequently, to parse and analyze the data, an Elastic MapReduce (EMR) cluster is used to process and upload data to a Redshift data warehouse. What changes should you make to the S3 bucket for better read performance without altering current architecture?
Correct
S3 is a massively distributed and scalable service and allows read throughput per S3 prefix, which means that a new unique S3 key prefix will offer better read performance. The S3 key could be named this way to create a new prefix for each flavor category and offer separate read performance for that prefix. In this scenario, for example, we could aggregate logs by hour and set the date, time and hour as a unique prefix. transaction-logs/dt=2020-11-22-0800/item-drinks/hot/mocha transaction-logs/dt=2020-11-22-0900/item-drinks/cold/iced_coffee transaction-logs/dt=2020-11-22-1000/item-edibles/sweet/donut transaction-logs/dt=2020-11-22-1100/item-edibles/salty/egg_roll
Incorrect
S3 is a massively distributed and scalable service and allows read throughput per S3 prefix, which means that a new unique S3 key prefix will offer better read performance. The S3 key could be named this way to create a new prefix for each flavor category and offer separate read performance for that prefix. In this scenario, for example, we could aggregate logs by hour and set the date, time and hour as a unique prefix. transaction-logs/dt=2020-11-22-0800/item-drinks/hot/mocha transaction-logs/dt=2020-11-22-0900/item-drinks/cold/iced_coffee transaction-logs/dt=2020-11-22-1000/item-edibles/sweet/donut transaction-logs/dt=2020-11-22-1100/item-edibles/salty/egg_roll
Unattempted
S3 is a massively distributed and scalable service and allows read throughput per S3 prefix, which means that a new unique S3 key prefix will offer better read performance. The S3 key could be named this way to create a new prefix for each flavor category and offer separate read performance for that prefix. In this scenario, for example, we could aggregate logs by hour and set the date, time and hour as a unique prefix. transaction-logs/dt=2020-11-22-0800/item-drinks/hot/mocha transaction-logs/dt=2020-11-22-0900/item-drinks/cold/iced_coffee transaction-logs/dt=2020-11-22-1000/item-edibles/sweet/donut transaction-logs/dt=2020-11-22-1100/item-edibles/salty/egg_roll
Question 3 of 35
3. Question
You‘ve been contacted by Terry‘s Temperature Tracking, which is a company that tracks body temperatures through IoT vests that their customers wear. As they have grown, their on-premises storage is getting dangerously close to being full. They need to move 30TB of data from their data center to AWS as part of their move to the cloud. Additionally, they need to store this data somewhere that will make it easily accessible to as many services as possible. Unfortunately, their data center bandwidth allotment is being consumed by their applications and they are unable to spare any bandwidth for the data migration. Which of the following solutions offers the best way to accomplish this?
Correct
This option provides the best path forward for moving live data to S3 without interrupting application functionality more than any other option would. Then, you can migrate the data with Snowball behind the scenes and, after another brief interruption, the storage layer of the application will be entirely in the cloud.
Incorrect
This option provides the best path forward for moving live data to S3 without interrupting application functionality more than any other option would. Then, you can migrate the data with Snowball behind the scenes and, after another brief interruption, the storage layer of the application will be entirely in the cloud.
Unattempted
This option provides the best path forward for moving live data to S3 without interrupting application functionality more than any other option would. Then, you can migrate the data with Snowball behind the scenes and, after another brief interruption, the storage layer of the application will be entirely in the cloud.
Question 4 of 35
4. Question
Congratulations, your website for people to comment on your collection of Magic Nose Goblins has gone viral! Unfortunately, the Elasticsearch domain you‘ve set up to make user comments and your written descriptions searchable is running out of storage. Fortunately, search volume is well served by the number of nodes in the domain, you just need more storage. The domain is configured to utilize Elastic Block Store (EBS) storage. How can you add additional storage?
Correct
This would be the best way to add additional storage. Since Elasticsearch Service is a fully-managed service, it is capable of managing your domain storage configuration with minimal work on the end user‘s part.
Incorrect
This would be the best way to add additional storage. Since Elasticsearch Service is a fully-managed service, it is capable of managing your domain storage configuration with minimal work on the end user‘s part.
Unattempted
This would be the best way to add additional storage. Since Elasticsearch Service is a fully-managed service, it is capable of managing your domain storage configuration with minimal work on the end user‘s part.
Question 5 of 35
5. Question
You work as a data analyst for a major airline company who operate flights scheduled all around the globe. The current ticketing system is going through a technical audit and has the requirement, by air traffic control law, that all parts of the ticketing system be digitized. The volume of ticketing data created on a daily basis is incredibly high. Your team has been tasked with collecting the ticketing data and storing it in S3, which is copied on a nightly basis to a company data lake for retrieval. There is also a requirement that the ticketing data be transformed and grouped into batches according to the flight departure location. The data must be optimized for high-performance retrieval rates, as well as collected and stored with high durability. Which solution would you use to ensure the data is collected and stored in a cost-effective, durable, and high-performing manner?
Correct
This is the best answer because it uses ORC files, which are partitioned in batches by Kinesis Data Firehose transformations and allow for highly optimized SQL queries in the company data lake.
Incorrect
This is the best answer because it uses ORC files, which are partitioned in batches by Kinesis Data Firehose transformations and allow for highly optimized SQL queries in the company data lake.
Unattempted
This is the best answer because it uses ORC files, which are partitioned in batches by Kinesis Data Firehose transformations and allow for highly optimized SQL queries in the company data lake.
Question 6 of 35
6. Question
A company has hundreds of web applications hosted in a fleet of EC2 instances. The company requires a cost-effective near real-time server log analysis solution without having to manage any infrastructure. The solution has the following requirements:
Collect and transform log files into JSON format.
Can handle delivery failures.
Can analyze and visualize log data.
Which approach is the most suitable and has the least operational overhead?
Correct
Amazon Elasticsearch Service provides Elasticsearch and Kibana in the AWS Cloud in a way thats easy to set up and operate. Amazon Kinesis Firehose provides reliable, serverless delivery of Apache web logs (or other log data) to Amazon Elasticsearch Service.
With Firehose, you can add an automatic call to an AWS Lambda function to transform records within Firehose. With these two technologies, you have an effective, easy-to-manage replacement for your existing ELK (Elasticsearch, Logstash, and Kibana) stack.
Amazon Elasticsearch Service indexes the data, makes it available for analysis in real-time, and allows you to visualize the performance metrics in real-time using Kibana dashboards.
Hence, the correct answer is: Create a Kinesis Data Firehose to ingest the log data and use a Lambda Function for format conversion. Send the formatted log files into Amazon Elasticsearch Service for log analysis and visualization. Send the failed deliveries to an Amazon S3 bucket.
The option that says: Create a Kinesis Data Streams stream to ingest the log data and use a Lambda Function for format conversion. Deliver the formatted log files into Amazon Elasticsearch Service for log analysis and visualization. Send the failed deliveries to an Amazon S3 bucket is incorrect because Amazon Kinesis Data Streams has no direct integration with Amazon Elasticsearch Service. You have to use Amazon Kinesis Firehose instead.
The option that says: Create a Kinesis Data Firehose to ingest the log data and use a Lambda Function for format conversion. Send the formatted log files into Amazon Kinesis Data Analytics for log analysis and store the results into an S3 bucket. Use Amazon QuickSight to visualize the logs. Send the failed deliveries to an Amazon S3 bucket is incorrect because Amazon Elasticsearch Service can already do the job of Amazon Kinesis Data Analytics and Amazon QuickSight at a much lower cost with lower operational overhead.
The option that says: Create a Kinesis Data Streams stream to ingest the log data and use a Lambda Function for format conversion. Deliver the formatted log files into an S3 bucket. Analyze the log files using Amazon Athena and store the results in a separate bucket. Use Amazon QuickSight to visualize the logs. Send the failed deliveries to an Amazon S3 bucket is incorrect. Technically, this implementation is possible but it is not a near real-time solution. Amazon Elasticsearch Service is the more appropriate AWS service to use for both log analysis and visualization.
References: https://aws.amazon.com/blogs/database/send-apache-web-logs-to-amazon-elasticsearch-service-with-kinesis-firehose/ https://docs.aws.amazon.com/streams/latest/dev/amazon-kinesis-consumers.html
Incorrect
Amazon Elasticsearch Service provides Elasticsearch and Kibana in the AWS Cloud in a way thats easy to set up and operate. Amazon Kinesis Firehose provides reliable, serverless delivery of Apache web logs (or other log data) to Amazon Elasticsearch Service.
With Firehose, you can add an automatic call to an AWS Lambda function to transform records within Firehose. With these two technologies, you have an effective, easy-to-manage replacement for your existing ELK (Elasticsearch, Logstash, and Kibana) stack.
Amazon Elasticsearch Service indexes the data, makes it available for analysis in real-time, and allows you to visualize the performance metrics in real-time using Kibana dashboards.
Hence, the correct answer is: Create a Kinesis Data Firehose to ingest the log data and use a Lambda Function for format conversion. Send the formatted log files into Amazon Elasticsearch Service for log analysis and visualization. Send the failed deliveries to an Amazon S3 bucket.
The option that says: Create a Kinesis Data Streams stream to ingest the log data and use a Lambda Function for format conversion. Deliver the formatted log files into Amazon Elasticsearch Service for log analysis and visualization. Send the failed deliveries to an Amazon S3 bucket is incorrect because Amazon Kinesis Data Streams has no direct integration with Amazon Elasticsearch Service. You have to use Amazon Kinesis Firehose instead.
The option that says: Create a Kinesis Data Firehose to ingest the log data and use a Lambda Function for format conversion. Send the formatted log files into Amazon Kinesis Data Analytics for log analysis and store the results into an S3 bucket. Use Amazon QuickSight to visualize the logs. Send the failed deliveries to an Amazon S3 bucket is incorrect because Amazon Elasticsearch Service can already do the job of Amazon Kinesis Data Analytics and Amazon QuickSight at a much lower cost with lower operational overhead.
The option that says: Create a Kinesis Data Streams stream to ingest the log data and use a Lambda Function for format conversion. Deliver the formatted log files into an S3 bucket. Analyze the log files using Amazon Athena and store the results in a separate bucket. Use Amazon QuickSight to visualize the logs. Send the failed deliveries to an Amazon S3 bucket is incorrect. Technically, this implementation is possible but it is not a near real-time solution. Amazon Elasticsearch Service is the more appropriate AWS service to use for both log analysis and visualization.
References: https://aws.amazon.com/blogs/database/send-apache-web-logs-to-amazon-elasticsearch-service-with-kinesis-firehose/ https://docs.aws.amazon.com/streams/latest/dev/amazon-kinesis-consumers.html
Unattempted
Amazon Elasticsearch Service provides Elasticsearch and Kibana in the AWS Cloud in a way thats easy to set up and operate. Amazon Kinesis Firehose provides reliable, serverless delivery of Apache web logs (or other log data) to Amazon Elasticsearch Service.
With Firehose, you can add an automatic call to an AWS Lambda function to transform records within Firehose. With these two technologies, you have an effective, easy-to-manage replacement for your existing ELK (Elasticsearch, Logstash, and Kibana) stack.
Amazon Elasticsearch Service indexes the data, makes it available for analysis in real-time, and allows you to visualize the performance metrics in real-time using Kibana dashboards.
Hence, the correct answer is: Create a Kinesis Data Firehose to ingest the log data and use a Lambda Function for format conversion. Send the formatted log files into Amazon Elasticsearch Service for log analysis and visualization. Send the failed deliveries to an Amazon S3 bucket.
The option that says: Create a Kinesis Data Streams stream to ingest the log data and use a Lambda Function for format conversion. Deliver the formatted log files into Amazon Elasticsearch Service for log analysis and visualization. Send the failed deliveries to an Amazon S3 bucket is incorrect because Amazon Kinesis Data Streams has no direct integration with Amazon Elasticsearch Service. You have to use Amazon Kinesis Firehose instead.
The option that says: Create a Kinesis Data Firehose to ingest the log data and use a Lambda Function for format conversion. Send the formatted log files into Amazon Kinesis Data Analytics for log analysis and store the results into an S3 bucket. Use Amazon QuickSight to visualize the logs. Send the failed deliveries to an Amazon S3 bucket is incorrect because Amazon Elasticsearch Service can already do the job of Amazon Kinesis Data Analytics and Amazon QuickSight at a much lower cost with lower operational overhead.
The option that says: Create a Kinesis Data Streams stream to ingest the log data and use a Lambda Function for format conversion. Deliver the formatted log files into an S3 bucket. Analyze the log files using Amazon Athena and store the results in a separate bucket. Use Amazon QuickSight to visualize the logs. Send the failed deliveries to an Amazon S3 bucket is incorrect. Technically, this implementation is possible but it is not a near real-time solution. Amazon Elasticsearch Service is the more appropriate AWS service to use for both log analysis and visualization.
References: https://aws.amazon.com/blogs/database/send-apache-web-logs-to-amazon-elasticsearch-service-with-kinesis-firehose/ https://docs.aws.amazon.com/streams/latest/dev/amazon-kinesis-consumers.html
Question 7 of 35
7. Question
An electronics company released its first flagship smartphone model last year. The sales figures were profitable so this year, the company plans to release two new models the Mini and Pro model, alongside the current flagship model. These new models require significant R&D funding so the management wants to get accurate insights on the trends and forecasted sales figures to see if they can achieve the business goals by the end of the year.
Which of the following options is the easiest way to help the company forecast sales figures?
Correct
Using ML-powered forecasting, you can forecast your key business metrics with point-and-click simplicity. No machine learning expertise is required. The built-in ML algorithm in Amazon QuickSight is designed to handle complex real-world scenarios. Amazon QuickSight uses machine learning to help provide more reliable forecasts than available by traditional means.
You can forecast your business revenue with multiple levels of seasonality (for example, sales with both weekly and quarterly trends). Amazon QuickSight automatically excludes anomalies in the data (for example, a spike in sales due to price drop or promotion) from influencing the forecast. You also dont have to clean and reprep the data with missing values because Amazon QuickSight automatically handles that. In addition, with ML-powered forecasting, you can perform interactive what-if analyses to determine the growth trajectory you need to meet business goals.
You can add a forecasting widget to your existing analysis, and publish it as a dashboard. With ML-powered forecasting, Amazon QuickSight enables you to forecast complex, real-world scenarios such as data with multiple seasonality. It automatically excludes outliers that it identifies and imputes missing values.
Therefore the correct answer is: Use ML-powered forecasting with Amazon QuickSight to forecast sales figures.
The option that says: Create graphical widgets and forecast of the sales figures using Amazon QuickSight Visuals is incorrect because QuickSight Visuals just generates a graph of the datasets. This means that you will still need to set up your custom analysis with ML-powered forecasting. A better solution is to use Amazon QuickSights built-in ML-powered forecasting feature.
The option that says: Import the previous sales figures on Amazon SageMaker to create a model of the forecasted sales figures is incorrect. Although this is possible, it is not easy to implement as you will need to build, train, and deploy your own Machine Learning model. It is stated in the scenario that the company is looking for the easiest way to forecast the sales figures. This option entails a lot of unnecessary steps.
The option that says: Create an analysis of previous sales figures on Amazon SageMaker and use Amazon QuickSight to generate the forecasted sales figures is incorrect. This may be possible but you dont need to use Amazon SageMaker at all to fulfill the requirements on this scenario. Amazon QuickSight already provides the tools you need for analysis and forecasting.
References: https://docs.aws.amazon.com/quicksight/latest/user/forecasts-and-whatifs.html https://docs.aws.amazon.com/quicksight/latest/user/making-data-driven-decisions-with-ml-in-quicksight.html https://docs.aws.amazon.com/quicksight/latest/user/computational-insights.html
Incorrect
Using ML-powered forecasting, you can forecast your key business metrics with point-and-click simplicity. No machine learning expertise is required. The built-in ML algorithm in Amazon QuickSight is designed to handle complex real-world scenarios. Amazon QuickSight uses machine learning to help provide more reliable forecasts than available by traditional means.
You can forecast your business revenue with multiple levels of seasonality (for example, sales with both weekly and quarterly trends). Amazon QuickSight automatically excludes anomalies in the data (for example, a spike in sales due to price drop or promotion) from influencing the forecast. You also dont have to clean and reprep the data with missing values because Amazon QuickSight automatically handles that. In addition, with ML-powered forecasting, you can perform interactive what-if analyses to determine the growth trajectory you need to meet business goals.
You can add a forecasting widget to your existing analysis, and publish it as a dashboard. With ML-powered forecasting, Amazon QuickSight enables you to forecast complex, real-world scenarios such as data with multiple seasonality. It automatically excludes outliers that it identifies and imputes missing values.
Therefore the correct answer is: Use ML-powered forecasting with Amazon QuickSight to forecast sales figures.
The option that says: Create graphical widgets and forecast of the sales figures using Amazon QuickSight Visuals is incorrect because QuickSight Visuals just generates a graph of the datasets. This means that you will still need to set up your custom analysis with ML-powered forecasting. A better solution is to use Amazon QuickSights built-in ML-powered forecasting feature.
The option that says: Import the previous sales figures on Amazon SageMaker to create a model of the forecasted sales figures is incorrect. Although this is possible, it is not easy to implement as you will need to build, train, and deploy your own Machine Learning model. It is stated in the scenario that the company is looking for the easiest way to forecast the sales figures. This option entails a lot of unnecessary steps.
The option that says: Create an analysis of previous sales figures on Amazon SageMaker and use Amazon QuickSight to generate the forecasted sales figures is incorrect. This may be possible but you dont need to use Amazon SageMaker at all to fulfill the requirements on this scenario. Amazon QuickSight already provides the tools you need for analysis and forecasting.
References: https://docs.aws.amazon.com/quicksight/latest/user/forecasts-and-whatifs.html https://docs.aws.amazon.com/quicksight/latest/user/making-data-driven-decisions-with-ml-in-quicksight.html https://docs.aws.amazon.com/quicksight/latest/user/computational-insights.html
Unattempted
Using ML-powered forecasting, you can forecast your key business metrics with point-and-click simplicity. No machine learning expertise is required. The built-in ML algorithm in Amazon QuickSight is designed to handle complex real-world scenarios. Amazon QuickSight uses machine learning to help provide more reliable forecasts than available by traditional means.
You can forecast your business revenue with multiple levels of seasonality (for example, sales with both weekly and quarterly trends). Amazon QuickSight automatically excludes anomalies in the data (for example, a spike in sales due to price drop or promotion) from influencing the forecast. You also dont have to clean and reprep the data with missing values because Amazon QuickSight automatically handles that. In addition, with ML-powered forecasting, you can perform interactive what-if analyses to determine the growth trajectory you need to meet business goals.
You can add a forecasting widget to your existing analysis, and publish it as a dashboard. With ML-powered forecasting, Amazon QuickSight enables you to forecast complex, real-world scenarios such as data with multiple seasonality. It automatically excludes outliers that it identifies and imputes missing values.
Therefore the correct answer is: Use ML-powered forecasting with Amazon QuickSight to forecast sales figures.
The option that says: Create graphical widgets and forecast of the sales figures using Amazon QuickSight Visuals is incorrect because QuickSight Visuals just generates a graph of the datasets. This means that you will still need to set up your custom analysis with ML-powered forecasting. A better solution is to use Amazon QuickSights built-in ML-powered forecasting feature.
The option that says: Import the previous sales figures on Amazon SageMaker to create a model of the forecasted sales figures is incorrect. Although this is possible, it is not easy to implement as you will need to build, train, and deploy your own Machine Learning model. It is stated in the scenario that the company is looking for the easiest way to forecast the sales figures. This option entails a lot of unnecessary steps.
The option that says: Create an analysis of previous sales figures on Amazon SageMaker and use Amazon QuickSight to generate the forecasted sales figures is incorrect. This may be possible but you dont need to use Amazon SageMaker at all to fulfill the requirements on this scenario. Amazon QuickSight already provides the tools you need for analysis and forecasting.
References: https://docs.aws.amazon.com/quicksight/latest/user/forecasts-and-whatifs.html https://docs.aws.amazon.com/quicksight/latest/user/making-data-driven-decisions-with-ml-in-quicksight.html https://docs.aws.amazon.com/quicksight/latest/user/computational-insights.html
Question 8 of 35
8. Question
A company hosts its web application in an Auto Scaling group of Amazon EC2 instances. The data analytics team needs to create a solution that will collect and analyze the logs from all of the EC2 instances running in production. The solution must be highly accessible and allows the viewing of the new log information in near real-time.
Which of the following is the most suitable solution to meet the requirement?
Correct
The Kinesis Producer Library is an easy-to-use, highly configurable library that helps you write to a Kinesis data stream. It acts as an intermediary between your producer application code and the Kinesis Data Streams API actions. You can collect, monitor, and analyze your Kinesis Data Streams producers using KPL. The KPL emits throughput, error, and other metrics to CloudWatch on your behalf and is configurable to monitor at the stream, shard, or producer level.
Amazon Kinesis Producer Library (KPL) agent continuously monitors a set of files and sends new data to your stream. The agent also handles the file rotation, checkpointing, and retry upon failures. It delivers all of your data in a reliable, timely, and simple manner. Using Amazon Kinesis Firehose, you can deliver the streaming data to Amazon Elasticsearch Service. The Amazon Elasticsearch Service and Kibana can be used together to store and visualize the logs data.
Hence, the correct answer is: Install the Amazon Kinesis Producer Library agent in the EC2 instances. Use the agent to collect and send the logs to a data stream. Use the data stream as a source for Amazon Kinesis Data Firehose, which will deliver the log data to Amazon Elasticsearch Service and Kibana.
The option that says: Enable the detailed monitoring feature on all the EC2 instances. Use CloudWatch to collect metrics and logs. Analyze the data using Amazon Kinesis Data Analytics is incorrect because the detailed monitoring feature only sends metric data of your instance to CloudWatch in 1-minute periods. This option wont help you send the data in real-time. Therefore, its incorrect.
The option that says: Install the Amazon Kinesis Producer Library agent in the EC2 instances. Use the agent to collect and send the data to Amazon Kinesis Data Streams, which will deliver the data to Amazon Elasticsearch Service and Amazon QuickSight is incorrect because you cant directly deliver data from Kinesis Data Streams to Amazon ES. Additionally, Amazon Elasticsearch does not have a direct integration with Amazon QuickSight.
The option that says: Use Amazon CloudWatch Logs subscriptions to process log data in real-time. Send the data to Amazon Kinesis Data Streams, which will deliver the data to Amazon Elasticsearch Service and Amazon QuickSight is incorrect because the subscription filter is only available for the data that are already sent to CloudWatch Logs. You cant directly integrate Amazon Elasticsearch with Amazon QuickSight. This option also didnt describe how it collected the data from the EC2 instance in the first place or mentioned anything about installing a CloudWatch Logs Agent to the instances. Also, just like the other incorrect answer, you cant directly deliver data from Kinesis Data Streams to Amazon ES.
References: https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html https://docs.aws.amazon.com/firehose/latest/dev/writing-with-agents.html
Incorrect
The Kinesis Producer Library is an easy-to-use, highly configurable library that helps you write to a Kinesis data stream. It acts as an intermediary between your producer application code and the Kinesis Data Streams API actions. You can collect, monitor, and analyze your Kinesis Data Streams producers using KPL. The KPL emits throughput, error, and other metrics to CloudWatch on your behalf and is configurable to monitor at the stream, shard, or producer level.
Amazon Kinesis Producer Library (KPL) agent continuously monitors a set of files and sends new data to your stream. The agent also handles the file rotation, checkpointing, and retry upon failures. It delivers all of your data in a reliable, timely, and simple manner. Using Amazon Kinesis Firehose, you can deliver the streaming data to Amazon Elasticsearch Service. The Amazon Elasticsearch Service and Kibana can be used together to store and visualize the logs data.
Hence, the correct answer is: Install the Amazon Kinesis Producer Library agent in the EC2 instances. Use the agent to collect and send the logs to a data stream. Use the data stream as a source for Amazon Kinesis Data Firehose, which will deliver the log data to Amazon Elasticsearch Service and Kibana.
The option that says: Enable the detailed monitoring feature on all the EC2 instances. Use CloudWatch to collect metrics and logs. Analyze the data using Amazon Kinesis Data Analytics is incorrect because the detailed monitoring feature only sends metric data of your instance to CloudWatch in 1-minute periods. This option wont help you send the data in real-time. Therefore, its incorrect.
The option that says: Install the Amazon Kinesis Producer Library agent in the EC2 instances. Use the agent to collect and send the data to Amazon Kinesis Data Streams, which will deliver the data to Amazon Elasticsearch Service and Amazon QuickSight is incorrect because you cant directly deliver data from Kinesis Data Streams to Amazon ES. Additionally, Amazon Elasticsearch does not have a direct integration with Amazon QuickSight.
The option that says: Use Amazon CloudWatch Logs subscriptions to process log data in real-time. Send the data to Amazon Kinesis Data Streams, which will deliver the data to Amazon Elasticsearch Service and Amazon QuickSight is incorrect because the subscription filter is only available for the data that are already sent to CloudWatch Logs. You cant directly integrate Amazon Elasticsearch with Amazon QuickSight. This option also didnt describe how it collected the data from the EC2 instance in the first place or mentioned anything about installing a CloudWatch Logs Agent to the instances. Also, just like the other incorrect answer, you cant directly deliver data from Kinesis Data Streams to Amazon ES.
References: https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html https://docs.aws.amazon.com/firehose/latest/dev/writing-with-agents.html
Unattempted
The Kinesis Producer Library is an easy-to-use, highly configurable library that helps you write to a Kinesis data stream. It acts as an intermediary between your producer application code and the Kinesis Data Streams API actions. You can collect, monitor, and analyze your Kinesis Data Streams producers using KPL. The KPL emits throughput, error, and other metrics to CloudWatch on your behalf and is configurable to monitor at the stream, shard, or producer level.
Amazon Kinesis Producer Library (KPL) agent continuously monitors a set of files and sends new data to your stream. The agent also handles the file rotation, checkpointing, and retry upon failures. It delivers all of your data in a reliable, timely, and simple manner. Using Amazon Kinesis Firehose, you can deliver the streaming data to Amazon Elasticsearch Service. The Amazon Elasticsearch Service and Kibana can be used together to store and visualize the logs data.
Hence, the correct answer is: Install the Amazon Kinesis Producer Library agent in the EC2 instances. Use the agent to collect and send the logs to a data stream. Use the data stream as a source for Amazon Kinesis Data Firehose, which will deliver the log data to Amazon Elasticsearch Service and Kibana.
The option that says: Enable the detailed monitoring feature on all the EC2 instances. Use CloudWatch to collect metrics and logs. Analyze the data using Amazon Kinesis Data Analytics is incorrect because the detailed monitoring feature only sends metric data of your instance to CloudWatch in 1-minute periods. This option wont help you send the data in real-time. Therefore, its incorrect.
The option that says: Install the Amazon Kinesis Producer Library agent in the EC2 instances. Use the agent to collect and send the data to Amazon Kinesis Data Streams, which will deliver the data to Amazon Elasticsearch Service and Amazon QuickSight is incorrect because you cant directly deliver data from Kinesis Data Streams to Amazon ES. Additionally, Amazon Elasticsearch does not have a direct integration with Amazon QuickSight.
The option that says: Use Amazon CloudWatch Logs subscriptions to process log data in real-time. Send the data to Amazon Kinesis Data Streams, which will deliver the data to Amazon Elasticsearch Service and Amazon QuickSight is incorrect because the subscription filter is only available for the data that are already sent to CloudWatch Logs. You cant directly integrate Amazon Elasticsearch with Amazon QuickSight. This option also didnt describe how it collected the data from the EC2 instance in the first place or mentioned anything about installing a CloudWatch Logs Agent to the instances. Also, just like the other incorrect answer, you cant directly deliver data from Kinesis Data Streams to Amazon ES.
References: https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html https://docs.aws.amazon.com/firehose/latest/dev/writing-with-agents.html
Question 9 of 35
9. Question
A manufacturing company is using IoT sensors to track the temperature and humidity of the environment. A group of analysts is assigned to each department and needs to produce a visual report every morning. The aggregated sensor data must be used in generating the reports.
Which of the following options would be the most cost-effective solution?
Correct
Amazon QuickSight is a fast business analytics service used to build visualizations, perform ad hoc analysis, and quickly get business insights from your data. Amazon QuickSight seamlessly discovers AWS data sources, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance using a robust in-memory engine (SPICE).
Amazon EMR has two cluster types, transient and persistent. Each can be useful, depending on your task and system configuration.
In this scenario, you must create a cost-effective solution. Amazon EMR can be used to quickly and cost-effectively perform data transformation workloads (ETL) such as sort, aggregate, and join on large datasets. You can use a transient cluster to aggregate the sensors data each night. After it is completed, the transient clusters are automatically terminated. This service will help you save costs since it will only run each night, and you can use Amazon QuickSight to get insights instantly and effortlessly.
Hence, the correct answer is: Create a transient cluster in Amazon EMR to collect the sensors data each night and use Amazon Quicksight to generate a report each morning.
The option that says: Collect the sensors data every 15 minutes using Spark streaming on Amazon EMR and use Amazon Quicksight to generate a report each morning is incorrect because collecting the data every 15 minutes is not cost-effective. The best solution is to collect the data at the end of the day and generate a report each morning.
The option that says: Generate a report by creating a Kinesis Client Library (KCL) Java application in an Amazon EC2 instance and use Amazon Quicksight to publish a report each morning is incorrect because it takes a lot of time to develop a custom application that uses KCL.
The option that says: Create a long-running cluster in Amazon EMR to collect the sensors data each night and use Zeppelin notebooks to generate a report each morning is incorrect because this solution uses a long-running cluster. This is not cost-effective in this scenario since you only need to process the data at night.
References: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-longrunning-transient.html https://docs.aws.amazon.com/quicksight/latest/user/how-quicksight-works.html
Incorrect
Amazon QuickSight is a fast business analytics service used to build visualizations, perform ad hoc analysis, and quickly get business insights from your data. Amazon QuickSight seamlessly discovers AWS data sources, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance using a robust in-memory engine (SPICE).
Amazon EMR has two cluster types, transient and persistent. Each can be useful, depending on your task and system configuration.
In this scenario, you must create a cost-effective solution. Amazon EMR can be used to quickly and cost-effectively perform data transformation workloads (ETL) such as sort, aggregate, and join on large datasets. You can use a transient cluster to aggregate the sensors data each night. After it is completed, the transient clusters are automatically terminated. This service will help you save costs since it will only run each night, and you can use Amazon QuickSight to get insights instantly and effortlessly.
Hence, the correct answer is: Create a transient cluster in Amazon EMR to collect the sensors data each night and use Amazon Quicksight to generate a report each morning.
The option that says: Collect the sensors data every 15 minutes using Spark streaming on Amazon EMR and use Amazon Quicksight to generate a report each morning is incorrect because collecting the data every 15 minutes is not cost-effective. The best solution is to collect the data at the end of the day and generate a report each morning.
The option that says: Generate a report by creating a Kinesis Client Library (KCL) Java application in an Amazon EC2 instance and use Amazon Quicksight to publish a report each morning is incorrect because it takes a lot of time to develop a custom application that uses KCL.
The option that says: Create a long-running cluster in Amazon EMR to collect the sensors data each night and use Zeppelin notebooks to generate a report each morning is incorrect because this solution uses a long-running cluster. This is not cost-effective in this scenario since you only need to process the data at night.
References: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-longrunning-transient.html https://docs.aws.amazon.com/quicksight/latest/user/how-quicksight-works.html
Unattempted
Amazon QuickSight is a fast business analytics service used to build visualizations, perform ad hoc analysis, and quickly get business insights from your data. Amazon QuickSight seamlessly discovers AWS data sources, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance using a robust in-memory engine (SPICE).
Amazon EMR has two cluster types, transient and persistent. Each can be useful, depending on your task and system configuration.
In this scenario, you must create a cost-effective solution. Amazon EMR can be used to quickly and cost-effectively perform data transformation workloads (ETL) such as sort, aggregate, and join on large datasets. You can use a transient cluster to aggregate the sensors data each night. After it is completed, the transient clusters are automatically terminated. This service will help you save costs since it will only run each night, and you can use Amazon QuickSight to get insights instantly and effortlessly.
Hence, the correct answer is: Create a transient cluster in Amazon EMR to collect the sensors data each night and use Amazon Quicksight to generate a report each morning.
The option that says: Collect the sensors data every 15 minutes using Spark streaming on Amazon EMR and use Amazon Quicksight to generate a report each morning is incorrect because collecting the data every 15 minutes is not cost-effective. The best solution is to collect the data at the end of the day and generate a report each morning.
The option that says: Generate a report by creating a Kinesis Client Library (KCL) Java application in an Amazon EC2 instance and use Amazon Quicksight to publish a report each morning is incorrect because it takes a lot of time to develop a custom application that uses KCL.
The option that says: Create a long-running cluster in Amazon EMR to collect the sensors data each night and use Zeppelin notebooks to generate a report each morning is incorrect because this solution uses a long-running cluster. This is not cost-effective in this scenario since you only need to process the data at night.
References: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-longrunning-transient.html https://docs.aws.amazon.com/quicksight/latest/user/how-quicksight-works.html
Question 10 of 35
10. Question
A company has a clickstream analytics solution using Amazon Elasticsearch Service. The solution ingests 2 TB of data from Amazon Kinesis Data Firehose and stores the latest data collected within 24 hours in an Amazon ES cluster. The cluster is running on a single index that has 12 data nodes and 3 dedicated master nodes. The cluster is configured with 3,000 shards and each node has 3 TB of EBS storage attached. The Data Analyst noticed that the query performance of Elasticsearch is sluggish, and some intermittent errors are produced by the Kinesis Data Firehose when it tries to write to the index. Upon further investigation, there were occasional JVMMemoryPressure errors found in Amazon ES logs.
What should be done to improve the performance of the Amazon Elasticsearch Service cluster?
Correct
Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. With Amazon ES, you get direct access to the Elasticsearch APIs; existing code and applications work seamlessly with the service.
Each Elasticsearch index is split into some number of shards. You should decide the shard count before indexing your first document. The overarching goal of choosing a number of shards is to distribute an index evenly across all data nodes in the cluster. However, these shards shouldnt be too large or too numerous.
A good rule of thumb is to try to keep a shard size between 1050 GiB. Large shards can make it difficult for Elasticsearch to recover from failure, but because each shard uses some amount of CPU and memory, having too many small shards can cause performance issues and out of memory errors. In other words, shards should be small enough that the underlying Amazon ES instance can handle them, but not so small that they place needless strain on the hardware.
Hence, the correct answer is: Improve the cluster performance by decreasing the number of shards of Amazon Elasticsearch index.
The option that says: Improve the cluster performance by increasing the number of master nodes of Amazon Elasticsearch is incorrect because dedicated master nodes are only used to increase cluster stability. Therefore, this option wont help you improve the performance of the cluster.
The option that says: Improve the cluster performance by decreasing the number of data nodes of Amazon Elasticsearch is incorrect because these nodes carry all the data in your indexes (storage) and do all the processing for your requests (CPU). If you decrease the number of data nodes, the performance of the cluster still wont improve.
The option that says: Improve the cluster performance by increasing the number of shards of Amazon Elasticsearch index is incorrect. The JVMMemoryPressure error signifies that there is an unbalanced shard allocations across nodes. This means that there are too many shards in the Amazon ES cluster and not the other way around. To improve the performance of the cluster, you must decrease the number of shards.
References: https://aws.amazon.com/premiumsupport/knowledge-center/elasticsearch-node-crash/ https://aws.amazon.com/premiumsupport/knowledge-center/high-jvm-memory-pressure-elasticsearch/
Incorrect
Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. With Amazon ES, you get direct access to the Elasticsearch APIs; existing code and applications work seamlessly with the service.
Each Elasticsearch index is split into some number of shards. You should decide the shard count before indexing your first document. The overarching goal of choosing a number of shards is to distribute an index evenly across all data nodes in the cluster. However, these shards shouldnt be too large or too numerous.
A good rule of thumb is to try to keep a shard size between 1050 GiB. Large shards can make it difficult for Elasticsearch to recover from failure, but because each shard uses some amount of CPU and memory, having too many small shards can cause performance issues and out of memory errors. In other words, shards should be small enough that the underlying Amazon ES instance can handle them, but not so small that they place needless strain on the hardware.
Hence, the correct answer is: Improve the cluster performance by decreasing the number of shards of Amazon Elasticsearch index.
The option that says: Improve the cluster performance by increasing the number of master nodes of Amazon Elasticsearch is incorrect because dedicated master nodes are only used to increase cluster stability. Therefore, this option wont help you improve the performance of the cluster.
The option that says: Improve the cluster performance by decreasing the number of data nodes of Amazon Elasticsearch is incorrect because these nodes carry all the data in your indexes (storage) and do all the processing for your requests (CPU). If you decrease the number of data nodes, the performance of the cluster still wont improve.
The option that says: Improve the cluster performance by increasing the number of shards of Amazon Elasticsearch index is incorrect. The JVMMemoryPressure error signifies that there is an unbalanced shard allocations across nodes. This means that there are too many shards in the Amazon ES cluster and not the other way around. To improve the performance of the cluster, you must decrease the number of shards.
References: https://aws.amazon.com/premiumsupport/knowledge-center/elasticsearch-node-crash/ https://aws.amazon.com/premiumsupport/knowledge-center/high-jvm-memory-pressure-elasticsearch/
Unattempted
Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. With Amazon ES, you get direct access to the Elasticsearch APIs; existing code and applications work seamlessly with the service.
Each Elasticsearch index is split into some number of shards. You should decide the shard count before indexing your first document. The overarching goal of choosing a number of shards is to distribute an index evenly across all data nodes in the cluster. However, these shards shouldnt be too large or too numerous.
A good rule of thumb is to try to keep a shard size between 1050 GiB. Large shards can make it difficult for Elasticsearch to recover from failure, but because each shard uses some amount of CPU and memory, having too many small shards can cause performance issues and out of memory errors. In other words, shards should be small enough that the underlying Amazon ES instance can handle them, but not so small that they place needless strain on the hardware.
Hence, the correct answer is: Improve the cluster performance by decreasing the number of shards of Amazon Elasticsearch index.
The option that says: Improve the cluster performance by increasing the number of master nodes of Amazon Elasticsearch is incorrect because dedicated master nodes are only used to increase cluster stability. Therefore, this option wont help you improve the performance of the cluster.
The option that says: Improve the cluster performance by decreasing the number of data nodes of Amazon Elasticsearch is incorrect because these nodes carry all the data in your indexes (storage) and do all the processing for your requests (CPU). If you decrease the number of data nodes, the performance of the cluster still wont improve.
The option that says: Improve the cluster performance by increasing the number of shards of Amazon Elasticsearch index is incorrect. The JVMMemoryPressure error signifies that there is an unbalanced shard allocations across nodes. This means that there are too many shards in the Amazon ES cluster and not the other way around. To improve the performance of the cluster, you must decrease the number of shards.
References: https://aws.amazon.com/premiumsupport/knowledge-center/elasticsearch-node-crash/ https://aws.amazon.com/premiumsupport/knowledge-center/high-jvm-memory-pressure-elasticsearch/
Question 11 of 35
11. Question
A large financial institution recently launched a new feature for its online customers. The management tasked the Data Analyst to create a dashboard that will visualize customer transactions made through its online platform. The transactional data will be streamed to Amazon Kinesis Data Firehose with a buffer interval of 60 seconds. The dashboard will display the near-real-time status of the transactions so the analysis of the Kinesis Firehose stream is time-sensitive.
Which of the following should the Data Analyst implement to meet the visualization requirements?
Correct
Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards youre already using today.
AWS offers Amazon Elasticsearch Service, a fully managed service that delivers Elasticsearch with built-in Kibana. Amazon ES service provides support for open-source Elasticsearch APIs, managed Kibana, integration with Logstash and other AWS services, and built-in alerting and SQL querying.
Kibana is an open-source data visualization and exploration tool used for log and time-series analytics, application monitoring, and operational intelligence use cases. It is a data aggregation and visualization tool that enables you to explore, visualize, analyze, and discover data in real-time with Amazon ES.
Therefore, the correct answer is: Deliver the streaming data of Kinesis Data Firehose to Amazon Elastisearch Service. Use the Elastisearch data to create a Kibana dashboard that will display the required analyses and visualizations.
The option that says: Deliver the streaming data of Kinesis Data Firehose to an Amazon S3 bucket. Set the S3 bucket as a source for Amazon SageMaker Jupyter notebook. Run the required analyses and generate visualizations from this notebook is incorrect because Amazon Sagemaker Jupyter notebook cannot directly read data from Amazon Kinesis Data Firehose. Take note that Kinesis Firehose only supports Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk as destinations.
The option that says: Deliver the streaming data of Kinesis Data Firehose to Amazon Redshift. Connect the cluster to Amazon QuickSight with SPICE. Use QuickSight to analyze and generate the required visualizations is incorrect. Although the integration of Redshift, Kinesis Data Firehose, and QuickSight are valid, this visualization solution is not capable of providing near-real-time data.
The option that says: Deliver the streaming data of Kinesis Data Firehose to an Amazon S3 bucket. Create an AWS Glue Catalog from this data and use Amazon Athena to analyze it. Use Amazon Neptune to generate the graphs and visualizations is incorrect. This approach may be suitable for analyzing non-time-critical data such as transaction history, but it will not be adequate for displaying near-real-time transactions as required by the scenario due to the additional processing time caused by AWS Glue, Amazon Athena, and Amazon Neptune.
References: https://aws.amazon.com/kinesis/data-firehose/faqs https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-kibana.html https://aws.amazon.com/blogs/contact-center/use-amazon-connect-data-in-real-time-with-elasticsearch-and-kibana/
Incorrect
Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards youre already using today.
AWS offers Amazon Elasticsearch Service, a fully managed service that delivers Elasticsearch with built-in Kibana. Amazon ES service provides support for open-source Elasticsearch APIs, managed Kibana, integration with Logstash and other AWS services, and built-in alerting and SQL querying.
Kibana is an open-source data visualization and exploration tool used for log and time-series analytics, application monitoring, and operational intelligence use cases. It is a data aggregation and visualization tool that enables you to explore, visualize, analyze, and discover data in real-time with Amazon ES.
Therefore, the correct answer is: Deliver the streaming data of Kinesis Data Firehose to Amazon Elastisearch Service. Use the Elastisearch data to create a Kibana dashboard that will display the required analyses and visualizations.
The option that says: Deliver the streaming data of Kinesis Data Firehose to an Amazon S3 bucket. Set the S3 bucket as a source for Amazon SageMaker Jupyter notebook. Run the required analyses and generate visualizations from this notebook is incorrect because Amazon Sagemaker Jupyter notebook cannot directly read data from Amazon Kinesis Data Firehose. Take note that Kinesis Firehose only supports Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk as destinations.
The option that says: Deliver the streaming data of Kinesis Data Firehose to Amazon Redshift. Connect the cluster to Amazon QuickSight with SPICE. Use QuickSight to analyze and generate the required visualizations is incorrect. Although the integration of Redshift, Kinesis Data Firehose, and QuickSight are valid, this visualization solution is not capable of providing near-real-time data.
The option that says: Deliver the streaming data of Kinesis Data Firehose to an Amazon S3 bucket. Create an AWS Glue Catalog from this data and use Amazon Athena to analyze it. Use Amazon Neptune to generate the graphs and visualizations is incorrect. This approach may be suitable for analyzing non-time-critical data such as transaction history, but it will not be adequate for displaying near-real-time transactions as required by the scenario due to the additional processing time caused by AWS Glue, Amazon Athena, and Amazon Neptune.
References: https://aws.amazon.com/kinesis/data-firehose/faqs https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-kibana.html https://aws.amazon.com/blogs/contact-center/use-amazon-connect-data-in-real-time-with-elasticsearch-and-kibana/
Unattempted
Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards youre already using today.
AWS offers Amazon Elasticsearch Service, a fully managed service that delivers Elasticsearch with built-in Kibana. Amazon ES service provides support for open-source Elasticsearch APIs, managed Kibana, integration with Logstash and other AWS services, and built-in alerting and SQL querying.
Kibana is an open-source data visualization and exploration tool used for log and time-series analytics, application monitoring, and operational intelligence use cases. It is a data aggregation and visualization tool that enables you to explore, visualize, analyze, and discover data in real-time with Amazon ES.
Therefore, the correct answer is: Deliver the streaming data of Kinesis Data Firehose to Amazon Elastisearch Service. Use the Elastisearch data to create a Kibana dashboard that will display the required analyses and visualizations.
The option that says: Deliver the streaming data of Kinesis Data Firehose to an Amazon S3 bucket. Set the S3 bucket as a source for Amazon SageMaker Jupyter notebook. Run the required analyses and generate visualizations from this notebook is incorrect because Amazon Sagemaker Jupyter notebook cannot directly read data from Amazon Kinesis Data Firehose. Take note that Kinesis Firehose only supports Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk as destinations.
The option that says: Deliver the streaming data of Kinesis Data Firehose to Amazon Redshift. Connect the cluster to Amazon QuickSight with SPICE. Use QuickSight to analyze and generate the required visualizations is incorrect. Although the integration of Redshift, Kinesis Data Firehose, and QuickSight are valid, this visualization solution is not capable of providing near-real-time data.
The option that says: Deliver the streaming data of Kinesis Data Firehose to an Amazon S3 bucket. Create an AWS Glue Catalog from this data and use Amazon Athena to analyze it. Use Amazon Neptune to generate the graphs and visualizations is incorrect. This approach may be suitable for analyzing non-time-critical data such as transaction history, but it will not be adequate for displaying near-real-time transactions as required by the scenario due to the additional processing time caused by AWS Glue, Amazon Athena, and Amazon Neptune.
References: https://aws.amazon.com/kinesis/data-firehose/faqs https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-kibana.html https://aws.amazon.com/blogs/contact-center/use-amazon-connect-data-in-real-time-with-elasticsearch-and-kibana/
Question 12 of 35
12. Question
An insurance company is using Amazon S3 to store daily financial records. The Data Analyst must create a visualization report on the datasets stored in the S3 bucket using Amazon QuickSight. The report must display all the records, including recent data.
Which of the following is the most suitable solution to meet the requirement?
Correct
Amazon QuickSight is a fast business analytics service to build visualizations, perform ad hoc analysis, and quickly get business insights from your data. Amazon QuickSight seamlessly discovers AWS data sources, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance by using a robust in-memory engine (SPICE).
When refreshing data, Amazon QuickSight handles datasets differently depending on the connection properties and the storage location of the data. If QuickSight connects to the data store by using a direct query, the data automatically refreshes when you open an associated dataset, analysis, or dashboard. To refresh SPICE datasets, QuickSight must independently authenticate using stored credentials to connect to the data. QuickSight cant refresh manually uploaded dataeven from S3 buckets, even though its stored in SPICEbecause QuickSight doesnt store its connection and location metadata. If you want to automatically refresh data that is stored in an S3 bucket, create a dataset by using the S3 data source card.
You can refresh your SPICE datasets at any time. Refreshing imports the data into SPICE again, so the data includes any changes since the last import.
You can refresh SPICE data by using any of the following approaches:
You can use the options on the Your Data Sets page.
You can refresh a dataset while editing a dataset.
You can schedule refreshes in the dataset settings.
You can use the CreateIngestion API operation to refresh the data.
In this scenario, you need to display all the records, including the recent data. When you create a dataset using Amazon S3, the file data is automatically imported into SPICE. You can refresh your SPICE datasets at any time. Since the insurance company is storing daily financial records in an S3 bucket, you can create a scheduled refresh and set the repeats to daily. Take note that you can only create five schedules for each dataset. When you have created five, the create button is disabled.
Hence, the correct answer is: Select the dataset and create a daily schedule refresh in the dataset settings.
The option that says: Create a scheduled Lambda function and tag the dataset is incorrect because setting up a scheduled Lambda function with CloudWatch Events entails a lot of unnecessary effort. You will have to develop a custom function that invokes the CreateIngestion API operation to refresh the SPICE datasets. Amazon QuickSight already has a built-in feature that can refresh the datasets automatically which is why theres no need to set up a custom Lambda function.
The option that says: Schedule the dataset refresh weekly in the dataset settings is incorrect because the company is storing daily financial records. Therefore, you must create a scheduled refresh daily and not weekly.
The option that says: Refresh the data by opening and editing the dataset is incorrect because this approach is done manually. You must create an automated task by creating a daily schedule refresh on the dataset in Amazon QuickSight.
References: https://docs.aws.amazon.com/quicksight/latest/user/create-a-data-set-s3.html https://docs.aws.amazon.com/quicksight/latest/user/refreshing-imported-data.html#schedule-data-refresh
Incorrect
Amazon QuickSight is a fast business analytics service to build visualizations, perform ad hoc analysis, and quickly get business insights from your data. Amazon QuickSight seamlessly discovers AWS data sources, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance by using a robust in-memory engine (SPICE).
When refreshing data, Amazon QuickSight handles datasets differently depending on the connection properties and the storage location of the data. If QuickSight connects to the data store by using a direct query, the data automatically refreshes when you open an associated dataset, analysis, or dashboard. To refresh SPICE datasets, QuickSight must independently authenticate using stored credentials to connect to the data. QuickSight cant refresh manually uploaded dataeven from S3 buckets, even though its stored in SPICEbecause QuickSight doesnt store its connection and location metadata. If you want to automatically refresh data that is stored in an S3 bucket, create a dataset by using the S3 data source card.
You can refresh your SPICE datasets at any time. Refreshing imports the data into SPICE again, so the data includes any changes since the last import.
You can refresh SPICE data by using any of the following approaches:
You can use the options on the Your Data Sets page.
You can refresh a dataset while editing a dataset.
You can schedule refreshes in the dataset settings.
You can use the CreateIngestion API operation to refresh the data.
In this scenario, you need to display all the records, including the recent data. When you create a dataset using Amazon S3, the file data is automatically imported into SPICE. You can refresh your SPICE datasets at any time. Since the insurance company is storing daily financial records in an S3 bucket, you can create a scheduled refresh and set the repeats to daily. Take note that you can only create five schedules for each dataset. When you have created five, the create button is disabled.
Hence, the correct answer is: Select the dataset and create a daily schedule refresh in the dataset settings.
The option that says: Create a scheduled Lambda function and tag the dataset is incorrect because setting up a scheduled Lambda function with CloudWatch Events entails a lot of unnecessary effort. You will have to develop a custom function that invokes the CreateIngestion API operation to refresh the SPICE datasets. Amazon QuickSight already has a built-in feature that can refresh the datasets automatically which is why theres no need to set up a custom Lambda function.
The option that says: Schedule the dataset refresh weekly in the dataset settings is incorrect because the company is storing daily financial records. Therefore, you must create a scheduled refresh daily and not weekly.
The option that says: Refresh the data by opening and editing the dataset is incorrect because this approach is done manually. You must create an automated task by creating a daily schedule refresh on the dataset in Amazon QuickSight.
References: https://docs.aws.amazon.com/quicksight/latest/user/create-a-data-set-s3.html https://docs.aws.amazon.com/quicksight/latest/user/refreshing-imported-data.html#schedule-data-refresh
Unattempted
Amazon QuickSight is a fast business analytics service to build visualizations, perform ad hoc analysis, and quickly get business insights from your data. Amazon QuickSight seamlessly discovers AWS data sources, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance by using a robust in-memory engine (SPICE).
When refreshing data, Amazon QuickSight handles datasets differently depending on the connection properties and the storage location of the data. If QuickSight connects to the data store by using a direct query, the data automatically refreshes when you open an associated dataset, analysis, or dashboard. To refresh SPICE datasets, QuickSight must independently authenticate using stored credentials to connect to the data. QuickSight cant refresh manually uploaded dataeven from S3 buckets, even though its stored in SPICEbecause QuickSight doesnt store its connection and location metadata. If you want to automatically refresh data that is stored in an S3 bucket, create a dataset by using the S3 data source card.
You can refresh your SPICE datasets at any time. Refreshing imports the data into SPICE again, so the data includes any changes since the last import.
You can refresh SPICE data by using any of the following approaches:
You can use the options on the Your Data Sets page.
You can refresh a dataset while editing a dataset.
You can schedule refreshes in the dataset settings.
You can use the CreateIngestion API operation to refresh the data.
In this scenario, you need to display all the records, including the recent data. When you create a dataset using Amazon S3, the file data is automatically imported into SPICE. You can refresh your SPICE datasets at any time. Since the insurance company is storing daily financial records in an S3 bucket, you can create a scheduled refresh and set the repeats to daily. Take note that you can only create five schedules for each dataset. When you have created five, the create button is disabled.
Hence, the correct answer is: Select the dataset and create a daily schedule refresh in the dataset settings.
The option that says: Create a scheduled Lambda function and tag the dataset is incorrect because setting up a scheduled Lambda function with CloudWatch Events entails a lot of unnecessary effort. You will have to develop a custom function that invokes the CreateIngestion API operation to refresh the SPICE datasets. Amazon QuickSight already has a built-in feature that can refresh the datasets automatically which is why theres no need to set up a custom Lambda function.
The option that says: Schedule the dataset refresh weekly in the dataset settings is incorrect because the company is storing daily financial records. Therefore, you must create a scheduled refresh daily and not weekly.
The option that says: Refresh the data by opening and editing the dataset is incorrect because this approach is done manually. You must create an automated task by creating a daily schedule refresh on the dataset in Amazon QuickSight.
References: https://docs.aws.amazon.com/quicksight/latest/user/create-a-data-set-s3.html https://docs.aws.amazon.com/quicksight/latest/user/refreshing-imported-data.html#schedule-data-refresh
Question 13 of 35
13. Question
A Data Analyst is running a data profiler using Amazon EMR. The results are stored in AWS Glue Data Catalog and an S3 bucket. The Data Analyst uses Amazon Athena and Amazon QuickSight for analysis and data visualization. The Data Catalog is updated to include a new data profiler which stores metrics to a separate Amazon S3 bucket. A new Amazon Athena table is created to reference the new S3 bucket. The Data Analyst used the Athena table as a new data source on Amazon QuickSight, however, the import into SPICE (Super-fast, Parallel, In-memory Calculation Engine) failed.
How should the Data Analyst resolve the issue?
Correct
Amazon QuickSight is an analytics service that you can use to create datasets, perform one-time analyses, and build visualizations and dashboards. In an enterprise deployment of QuickSight, you can have multiple dashboards, and each dashboard can have multiple visualizations based on multiple datasets. This can quickly become a management overhead to view all the datasets status with their latest refresh timestamp.
Amazon Athena is an interactive query service that makes it easy to analyze data in an S3 bucket using standard SQL. You can visualize the results of your Amazon Athena queries in Amazon QuickSight. Connecting to Athena from QuickSight is a 1-click process. Theres no need to provide endpoints, username, and password. Simply select Athena as your data source, select the database and tables you want to analyze, and start visualizing in QuickSight.
To successfully connect Amazon QuickSight to the Amazon S3 buckets used by Athena, make sure that you authorized Amazon QuickSight to access the S3 account. Its not enough that you, the user, are authorized. Amazon QuickSight must be authorized separately.
To authorize Amazon QuickSight to access your Amazon S3 bucket, go to the QuickSight Console and locate the S3 bucket that you want to access from Amazon QuickSight by clicking Manage QuickSight -> Security & permissions. Then do one of the following:
These actions open the screen where you can choose S3 buckets.
If the check box is clear, enable the check box next to Amazon S3.
If the check box is already enabled, choose Details, and then choose Select S3 buckets.
Choose the buckets that you want to access from Amazon QuickSight. Then choose Select.
Choose Update.
Hence, the correct answer is: Configure the permissions for the new S3 bucket from the Amazon QuickSight console.
The option that says: Configure the permissions for the S3 bucket from the Amazon Athena and Amazon QuickSight console is incorrect because the import to SPICE error is specific to Amazon QuickSight only, therefore, you only need to update the permissions from within the Amazon QuickSight console.
The option that says: Configure the permissions for the AWS Glue Data Catalog from the AWS Glue console is incorrect because you have to use the Amazon QuickSight Console to provide permission to your S3 bucket. Provisioning access to AWS Glue Data Catalog is not needed since it only contains the metadata and not the actual data.
The option that says: Configure the permissions for the new S3 bucket from the Amazon S3 console is incorrect because you need to authorize Amazon QuickSight to access your S3 bucket from within the Amazon QuickSight Console.
References: https://docs.aws.amazon.com/quicksight/latest/user/troubleshoot-athena-insufficient-permissions.html https://docs.aws.amazon.com/quicksight/latest/user/troubleshoot-connect-S3.html
Incorrect
Amazon QuickSight is an analytics service that you can use to create datasets, perform one-time analyses, and build visualizations and dashboards. In an enterprise deployment of QuickSight, you can have multiple dashboards, and each dashboard can have multiple visualizations based on multiple datasets. This can quickly become a management overhead to view all the datasets status with their latest refresh timestamp.
Amazon Athena is an interactive query service that makes it easy to analyze data in an S3 bucket using standard SQL. You can visualize the results of your Amazon Athena queries in Amazon QuickSight. Connecting to Athena from QuickSight is a 1-click process. Theres no need to provide endpoints, username, and password. Simply select Athena as your data source, select the database and tables you want to analyze, and start visualizing in QuickSight.
To successfully connect Amazon QuickSight to the Amazon S3 buckets used by Athena, make sure that you authorized Amazon QuickSight to access the S3 account. Its not enough that you, the user, are authorized. Amazon QuickSight must be authorized separately.
To authorize Amazon QuickSight to access your Amazon S3 bucket, go to the QuickSight Console and locate the S3 bucket that you want to access from Amazon QuickSight by clicking Manage QuickSight -> Security & permissions. Then do one of the following:
These actions open the screen where you can choose S3 buckets.
If the check box is clear, enable the check box next to Amazon S3.
If the check box is already enabled, choose Details, and then choose Select S3 buckets.
Choose the buckets that you want to access from Amazon QuickSight. Then choose Select.
Choose Update.
Hence, the correct answer is: Configure the permissions for the new S3 bucket from the Amazon QuickSight console.
The option that says: Configure the permissions for the S3 bucket from the Amazon Athena and Amazon QuickSight console is incorrect because the import to SPICE error is specific to Amazon QuickSight only, therefore, you only need to update the permissions from within the Amazon QuickSight console.
The option that says: Configure the permissions for the AWS Glue Data Catalog from the AWS Glue console is incorrect because you have to use the Amazon QuickSight Console to provide permission to your S3 bucket. Provisioning access to AWS Glue Data Catalog is not needed since it only contains the metadata and not the actual data.
The option that says: Configure the permissions for the new S3 bucket from the Amazon S3 console is incorrect because you need to authorize Amazon QuickSight to access your S3 bucket from within the Amazon QuickSight Console.
References: https://docs.aws.amazon.com/quicksight/latest/user/troubleshoot-athena-insufficient-permissions.html https://docs.aws.amazon.com/quicksight/latest/user/troubleshoot-connect-S3.html
Unattempted
Amazon QuickSight is an analytics service that you can use to create datasets, perform one-time analyses, and build visualizations and dashboards. In an enterprise deployment of QuickSight, you can have multiple dashboards, and each dashboard can have multiple visualizations based on multiple datasets. This can quickly become a management overhead to view all the datasets status with their latest refresh timestamp.
Amazon Athena is an interactive query service that makes it easy to analyze data in an S3 bucket using standard SQL. You can visualize the results of your Amazon Athena queries in Amazon QuickSight. Connecting to Athena from QuickSight is a 1-click process. Theres no need to provide endpoints, username, and password. Simply select Athena as your data source, select the database and tables you want to analyze, and start visualizing in QuickSight.
To successfully connect Amazon QuickSight to the Amazon S3 buckets used by Athena, make sure that you authorized Amazon QuickSight to access the S3 account. Its not enough that you, the user, are authorized. Amazon QuickSight must be authorized separately.
To authorize Amazon QuickSight to access your Amazon S3 bucket, go to the QuickSight Console and locate the S3 bucket that you want to access from Amazon QuickSight by clicking Manage QuickSight -> Security & permissions. Then do one of the following:
These actions open the screen where you can choose S3 buckets.
If the check box is clear, enable the check box next to Amazon S3.
If the check box is already enabled, choose Details, and then choose Select S3 buckets.
Choose the buckets that you want to access from Amazon QuickSight. Then choose Select.
Choose Update.
Hence, the correct answer is: Configure the permissions for the new S3 bucket from the Amazon QuickSight console.
The option that says: Configure the permissions for the S3 bucket from the Amazon Athena and Amazon QuickSight console is incorrect because the import to SPICE error is specific to Amazon QuickSight only, therefore, you only need to update the permissions from within the Amazon QuickSight console.
The option that says: Configure the permissions for the AWS Glue Data Catalog from the AWS Glue console is incorrect because you have to use the Amazon QuickSight Console to provide permission to your S3 bucket. Provisioning access to AWS Glue Data Catalog is not needed since it only contains the metadata and not the actual data.
The option that says: Configure the permissions for the new S3 bucket from the Amazon S3 console is incorrect because you need to authorize Amazon QuickSight to access your S3 bucket from within the Amazon QuickSight Console.
References: https://docs.aws.amazon.com/quicksight/latest/user/troubleshoot-athena-insufficient-permissions.html https://docs.aws.amazon.com/quicksight/latest/user/troubleshoot-connect-S3.html
Question 14 of 35
14. Question
A group of data scientists is conducting analytical research on the current and past criminal activities in a particular city. Thousands of records are collected and dumped into a private Amazon S3 data lake. The group wants to analyze historical logs dating back 10 years to identify activity patterns and find out where the highest crime hour of the day occurs. The logs contain information such as date, district, address, and the NCIC (National Crime Information Center) code which describes the nature of the offense.
Which of the following methods will optimally improve query performance?
Correct
When you partition your data, you can restrict the amount of data that Redshift Spectrum scans by filtering on the partition key. You can partition your data by any key. A common practice is to partition the data based on time.
For example, you might choose to partition by year, month, date, and hour. If you have data coming from multiple sources, you might partition by a data source identifier and date.
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost.
Hence, the correct answer is: Use Apache ORC. Partition by date and sort by NCIC code.
The option that says: Use Apache Parquet. Partition by NCIC code and sort by date is incorrect because the common practice is to partition the data based on time. It is stated in the scenario that the data scientists analyze historical logs dating back 10 years to identify activity patterns and find out where the highest crime hour of the day occurs. It didnt say that it will query data based on NCIC code.
The following options are both incorrect because they are not compressed and splittable unlike ORC and Apache Parquet. Hence, they dont provide optimal performance.
Use compressed nested JSON partitioned by NCIC code and sorted by date
Use compressed .csv partitioned by date and sorted by NCIC code
References: https://docs.aws.amazon.com/athena/latest/ug/partitions.html https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html
Incorrect
When you partition your data, you can restrict the amount of data that Redshift Spectrum scans by filtering on the partition key. You can partition your data by any key. A common practice is to partition the data based on time.
For example, you might choose to partition by year, month, date, and hour. If you have data coming from multiple sources, you might partition by a data source identifier and date.
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost.
Hence, the correct answer is: Use Apache ORC. Partition by date and sort by NCIC code.
The option that says: Use Apache Parquet. Partition by NCIC code and sort by date is incorrect because the common practice is to partition the data based on time. It is stated in the scenario that the data scientists analyze historical logs dating back 10 years to identify activity patterns and find out where the highest crime hour of the day occurs. It didnt say that it will query data based on NCIC code.
The following options are both incorrect because they are not compressed and splittable unlike ORC and Apache Parquet. Hence, they dont provide optimal performance.
Use compressed nested JSON partitioned by NCIC code and sorted by date
Use compressed .csv partitioned by date and sorted by NCIC code
References: https://docs.aws.amazon.com/athena/latest/ug/partitions.html https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html
Unattempted
When you partition your data, you can restrict the amount of data that Redshift Spectrum scans by filtering on the partition key. You can partition your data by any key. A common practice is to partition the data based on time.
For example, you might choose to partition by year, month, date, and hour. If you have data coming from multiple sources, you might partition by a data source identifier and date.
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost.
Hence, the correct answer is: Use Apache ORC. Partition by date and sort by NCIC code.
The option that says: Use Apache Parquet. Partition by NCIC code and sort by date is incorrect because the common practice is to partition the data based on time. It is stated in the scenario that the data scientists analyze historical logs dating back 10 years to identify activity patterns and find out where the highest crime hour of the day occurs. It didnt say that it will query data based on NCIC code.
The following options are both incorrect because they are not compressed and splittable unlike ORC and Apache Parquet. Hence, they dont provide optimal performance.
Use compressed nested JSON partitioned by NCIC code and sorted by date
Use compressed .csv partitioned by date and sorted by NCIC code
References: https://docs.aws.amazon.com/athena/latest/ug/partitions.html https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html
Question 15 of 35
15. Question
A real estate company wants to find the hottest suburbs for apartment rentals in the city. Currently, the data is stored in an Amazon S3 file in Apache Parquet format. The data analytics team needs to analyze the real estate data and present it to investors.
Which solution would offer the best visuals with the least amount of operational effort?
Correct
Amazon QuickSight is a business analytics service you can use to build visualizations, perform ad hoc analysis, and get business insights from your data. It can automatically discover AWS data sources and also works with your data sources. Amazon QuickSight enables organizations to scale to hundreds of thousands of users and deliver responsive performance using a robust in-memory engine (SPICE). Furthermore, Amazon QuickSight supports various data sources that you can use to provide data for analyses, such as Amazon Athena, Amazon Aurora, Amazon Redshift, and Amazon S3.
You can use files in Amazon S3 or on your local (on-premises) network as data sources. QuickSight supports files in the following formats:
CSV and TSV Comma-delimited and tab-delimited text files
ELF and CLF Extended and common log format files
JSON Flat or semistructured data files
XLSX Microsoft Excel files
Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using standard SQL. It enables you to analyze a wide variety of data. It also includes tabular data in comma-separated value (CSV) or Apache Parquet files, data extracted from log files using regular expressions, and JSON-formatted data. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries you run.
Data visualization depends on the story you want to tell. Maps are great at visualizing your geographic data by location. The data on a map is often displayed in a colored area map or a bubble map. If the location is not part of the story, a map could be messy. With a table, you can display a large number of precise measures and dimensions. You can quickly look up or compare individual values while also showing grand totals. However, given the amount of data, tables take longer to digest.
Heat maps and pivot tables display data in a similar tabular fashion. Use a heat map if you want to identify trends and outliers because color makes these easier to spot. Use a pivot table if youre going to further analyze data on the visual, for example, by changing column sort order or applying aggregate functions across rows or columns.
Hence, the correct answer is: Use Amazon Athena for the data source. Pair it with Amazon Quicksight and display data with geospatial charts.
The option that says: Use Amazon S3 for the data source. Pair it with Amazon Quicksight and display data with heat maps is incorrect. This solution is not possible because Amazon QuickSight does not directly support data sources in Parquet format with Amazon S3.
The option that says: Use Amazon Redshift Cluster for the data source. Pair it with Amazon Quicksight and display data with heat maps is incorrect. Although this solution works, creating a Redshift cluster is more effortful than using Amazon Athena.
The option that says: Use Amazon Athena for the data source. Pair it with Amazon Quicksight and display data with pivot tables is incorrect. A more visual data story chart, such as geospatial maps or heat maps, would work better than pivot tables.
References: https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html https://aws.amazon.com/blogs/big-data/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight/ https://docs.aws.amazon.com/quicksight/latest/user/working-with-visual-types.html
Incorrect
Amazon QuickSight is a business analytics service you can use to build visualizations, perform ad hoc analysis, and get business insights from your data. It can automatically discover AWS data sources and also works with your data sources. Amazon QuickSight enables organizations to scale to hundreds of thousands of users and deliver responsive performance using a robust in-memory engine (SPICE). Furthermore, Amazon QuickSight supports various data sources that you can use to provide data for analyses, such as Amazon Athena, Amazon Aurora, Amazon Redshift, and Amazon S3.
You can use files in Amazon S3 or on your local (on-premises) network as data sources. QuickSight supports files in the following formats:
CSV and TSV Comma-delimited and tab-delimited text files
ELF and CLF Extended and common log format files
JSON Flat or semistructured data files
XLSX Microsoft Excel files
Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using standard SQL. It enables you to analyze a wide variety of data. It also includes tabular data in comma-separated value (CSV) or Apache Parquet files, data extracted from log files using regular expressions, and JSON-formatted data. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries you run.
Data visualization depends on the story you want to tell. Maps are great at visualizing your geographic data by location. The data on a map is often displayed in a colored area map or a bubble map. If the location is not part of the story, a map could be messy. With a table, you can display a large number of precise measures and dimensions. You can quickly look up or compare individual values while also showing grand totals. However, given the amount of data, tables take longer to digest.
Heat maps and pivot tables display data in a similar tabular fashion. Use a heat map if you want to identify trends and outliers because color makes these easier to spot. Use a pivot table if youre going to further analyze data on the visual, for example, by changing column sort order or applying aggregate functions across rows or columns.
Hence, the correct answer is: Use Amazon Athena for the data source. Pair it with Amazon Quicksight and display data with geospatial charts.
The option that says: Use Amazon S3 for the data source. Pair it with Amazon Quicksight and display data with heat maps is incorrect. This solution is not possible because Amazon QuickSight does not directly support data sources in Parquet format with Amazon S3.
The option that says: Use Amazon Redshift Cluster for the data source. Pair it with Amazon Quicksight and display data with heat maps is incorrect. Although this solution works, creating a Redshift cluster is more effortful than using Amazon Athena.
The option that says: Use Amazon Athena for the data source. Pair it with Amazon Quicksight and display data with pivot tables is incorrect. A more visual data story chart, such as geospatial maps or heat maps, would work better than pivot tables.
References: https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html https://aws.amazon.com/blogs/big-data/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight/ https://docs.aws.amazon.com/quicksight/latest/user/working-with-visual-types.html
Unattempted
Amazon QuickSight is a business analytics service you can use to build visualizations, perform ad hoc analysis, and get business insights from your data. It can automatically discover AWS data sources and also works with your data sources. Amazon QuickSight enables organizations to scale to hundreds of thousands of users and deliver responsive performance using a robust in-memory engine (SPICE). Furthermore, Amazon QuickSight supports various data sources that you can use to provide data for analyses, such as Amazon Athena, Amazon Aurora, Amazon Redshift, and Amazon S3.
You can use files in Amazon S3 or on your local (on-premises) network as data sources. QuickSight supports files in the following formats:
CSV and TSV Comma-delimited and tab-delimited text files
ELF and CLF Extended and common log format files
JSON Flat or semistructured data files
XLSX Microsoft Excel files
Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using standard SQL. It enables you to analyze a wide variety of data. It also includes tabular data in comma-separated value (CSV) or Apache Parquet files, data extracted from log files using regular expressions, and JSON-formatted data. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries you run.
Data visualization depends on the story you want to tell. Maps are great at visualizing your geographic data by location. The data on a map is often displayed in a colored area map or a bubble map. If the location is not part of the story, a map could be messy. With a table, you can display a large number of precise measures and dimensions. You can quickly look up or compare individual values while also showing grand totals. However, given the amount of data, tables take longer to digest.
Heat maps and pivot tables display data in a similar tabular fashion. Use a heat map if you want to identify trends and outliers because color makes these easier to spot. Use a pivot table if youre going to further analyze data on the visual, for example, by changing column sort order or applying aggregate functions across rows or columns.
Hence, the correct answer is: Use Amazon Athena for the data source. Pair it with Amazon Quicksight and display data with geospatial charts.
The option that says: Use Amazon S3 for the data source. Pair it with Amazon Quicksight and display data with heat maps is incorrect. This solution is not possible because Amazon QuickSight does not directly support data sources in Parquet format with Amazon S3.
The option that says: Use Amazon Redshift Cluster for the data source. Pair it with Amazon Quicksight and display data with heat maps is incorrect. Although this solution works, creating a Redshift cluster is more effortful than using Amazon Athena.
The option that says: Use Amazon Athena for the data source. Pair it with Amazon Quicksight and display data with pivot tables is incorrect. A more visual data story chart, such as geospatial maps or heat maps, would work better than pivot tables.
References: https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html https://aws.amazon.com/blogs/big-data/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight/ https://docs.aws.amazon.com/quicksight/latest/user/working-with-visual-types.html
Question 16 of 35
16. Question
A Security Analyst uses AWS Web Application Firewall (WAF) to protect a web application hosted on an EC2 instance from common web exploits. The AWS WAF sends web ACL traffic logs to an Amazon Kinesis Data Firehose delivery stream for format conversion and uses an Amazon S3 bucket to store the processed logs.
The analyst is looking for a cost-efficient solution to perform infrequent log analysis and data visualizations with minimal development effort.
Which approach best fits the requirements?
Correct
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Forensic data grows quickly, so using a relational database means you might quickly outgrow your capacity. Instead, take advantage of AWS serverless technologies like AWS Glue, Athena, and Amazon QuickSight. These technologies enable forensic analysis without the operational overhead you would experience with Elasticsearch or a relational database.
With your forensic tools now in place, you can use Athena to query your data and analyze the results. This lets you refine the data for your Kibana visualizations, or directly load it into Amazon QuickSight for additional visualization.
Hence, the correct answer is: Run an AWS Glue crawler that connects to the Amazon S3 bucket and write tables in the AWS Glue Data Catalog. Perform ad-hoc analysis using Amazon Athena and save the query results in a separate S3 bucket. Use Amazon QuickSight to create data visualizations from the S3 bucket.
The option that says: Stream the log data through a separate Kinesis Data Firehose delivery stream and deliver the processed files to Amazon Elasticsearch Service. Execute text-based queries of the logs for ad-hoc analysis. Use Kibana to create data visualizations is incorrect. Although this could be a possible solution, it is not cost-efficient enough for the scenario. Amazon Elasticsearch Service can be relatively more expensive as you have to run an EC2 instance that charges you per-hour. Amazon Athena is more suitable for ad-hoc analysis.
The option that says: Configure an AWS Lambda Function for the Kinesis Data Firehose delivery stream to transform logs into CSV format. Create an Amazon Redshift cluster to perform ad-hoc analysis using SQL queries and send the results in an Amazon S3 bucket. Use Amazon QuickSight to create data visualizations is incorrect because you need to provision a Redshift cluster to perform analysis, which needs to be running all the time. Therefore, this is not a cost-efficient solution.
The option that says: Use Apache Spark running in an Amazon EMR cluster to perform ad-hoc analysis against the logs stored in the S3 bucket. Use Amazon QuickSight to create data visualizations is incorrect. Just like the previous option that uses Amazon Redshift, you are provisioning an EMR cluster to run ad-hoc analysis, which is not a cost-effective solution as well. Amazon Athena should be used for ad-hoc analysis.
References: https://aws.amazon.com/blogs/big-data/analyzing-aws-waf-logs-with-amazon-es-amazon-athena-and-amazon-quicksight/ https://aws.amazon.com/athena/faqs/
Incorrect
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Forensic data grows quickly, so using a relational database means you might quickly outgrow your capacity. Instead, take advantage of AWS serverless technologies like AWS Glue, Athena, and Amazon QuickSight. These technologies enable forensic analysis without the operational overhead you would experience with Elasticsearch or a relational database.
With your forensic tools now in place, you can use Athena to query your data and analyze the results. This lets you refine the data for your Kibana visualizations, or directly load it into Amazon QuickSight for additional visualization.
Hence, the correct answer is: Run an AWS Glue crawler that connects to the Amazon S3 bucket and write tables in the AWS Glue Data Catalog. Perform ad-hoc analysis using Amazon Athena and save the query results in a separate S3 bucket. Use Amazon QuickSight to create data visualizations from the S3 bucket.
The option that says: Stream the log data through a separate Kinesis Data Firehose delivery stream and deliver the processed files to Amazon Elasticsearch Service. Execute text-based queries of the logs for ad-hoc analysis. Use Kibana to create data visualizations is incorrect. Although this could be a possible solution, it is not cost-efficient enough for the scenario. Amazon Elasticsearch Service can be relatively more expensive as you have to run an EC2 instance that charges you per-hour. Amazon Athena is more suitable for ad-hoc analysis.
The option that says: Configure an AWS Lambda Function for the Kinesis Data Firehose delivery stream to transform logs into CSV format. Create an Amazon Redshift cluster to perform ad-hoc analysis using SQL queries and send the results in an Amazon S3 bucket. Use Amazon QuickSight to create data visualizations is incorrect because you need to provision a Redshift cluster to perform analysis, which needs to be running all the time. Therefore, this is not a cost-efficient solution.
The option that says: Use Apache Spark running in an Amazon EMR cluster to perform ad-hoc analysis against the logs stored in the S3 bucket. Use Amazon QuickSight to create data visualizations is incorrect. Just like the previous option that uses Amazon Redshift, you are provisioning an EMR cluster to run ad-hoc analysis, which is not a cost-effective solution as well. Amazon Athena should be used for ad-hoc analysis.
References: https://aws.amazon.com/blogs/big-data/analyzing-aws-waf-logs-with-amazon-es-amazon-athena-and-amazon-quicksight/ https://aws.amazon.com/athena/faqs/
Unattempted
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Forensic data grows quickly, so using a relational database means you might quickly outgrow your capacity. Instead, take advantage of AWS serverless technologies like AWS Glue, Athena, and Amazon QuickSight. These technologies enable forensic analysis without the operational overhead you would experience with Elasticsearch or a relational database.
With your forensic tools now in place, you can use Athena to query your data and analyze the results. This lets you refine the data for your Kibana visualizations, or directly load it into Amazon QuickSight for additional visualization.
Hence, the correct answer is: Run an AWS Glue crawler that connects to the Amazon S3 bucket and write tables in the AWS Glue Data Catalog. Perform ad-hoc analysis using Amazon Athena and save the query results in a separate S3 bucket. Use Amazon QuickSight to create data visualizations from the S3 bucket.
The option that says: Stream the log data through a separate Kinesis Data Firehose delivery stream and deliver the processed files to Amazon Elasticsearch Service. Execute text-based queries of the logs for ad-hoc analysis. Use Kibana to create data visualizations is incorrect. Although this could be a possible solution, it is not cost-efficient enough for the scenario. Amazon Elasticsearch Service can be relatively more expensive as you have to run an EC2 instance that charges you per-hour. Amazon Athena is more suitable for ad-hoc analysis.
The option that says: Configure an AWS Lambda Function for the Kinesis Data Firehose delivery stream to transform logs into CSV format. Create an Amazon Redshift cluster to perform ad-hoc analysis using SQL queries and send the results in an Amazon S3 bucket. Use Amazon QuickSight to create data visualizations is incorrect because you need to provision a Redshift cluster to perform analysis, which needs to be running all the time. Therefore, this is not a cost-efficient solution.
The option that says: Use Apache Spark running in an Amazon EMR cluster to perform ad-hoc analysis against the logs stored in the S3 bucket. Use Amazon QuickSight to create data visualizations is incorrect. Just like the previous option that uses Amazon Redshift, you are provisioning an EMR cluster to run ad-hoc analysis, which is not a cost-effective solution as well. Amazon Athena should be used for ad-hoc analysis.
References: https://aws.amazon.com/blogs/big-data/analyzing-aws-waf-logs-with-amazon-es-amazon-athena-and-amazon-quicksight/ https://aws.amazon.com/athena/faqs/
Question 17 of 35
17. Question
A digital banking firm with more than a million clients globally uploads platform activity data in compressed files and stores it directly into an Amazon S3 bucket. Every four hours, a cron job running in an Amazon EC2 instance extracts that data from S3 and processes it for the business intelligence reporting dashboard. The product management team heavily relies on the reports to understand its clients further. The team of data analysts wants to upgrade this feature by speeding up the analysis workflow and refreshing the dashboard with aggregated data in minutes instead. Moreover, they want to include a search feature that gets refreshed with more options. Which steps should the team take to achieve this feature in the most efficient and cost-effective way? (Select TWO.)
Correct
Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. With Amazon ES, you get direct access to the Elasticsearch APIs; existing code and applications work seamlessly with the service. You can use AWS Lambda to send data to your Amazon ES domain from Amazon S3. New data that arrives in an S3 bucket triggers an event notification to Lambda, which then runs your custom code to perform the indexing. Once you create the function, you must add a trigger. Meanwhile, Kibana is tightly integrated with Amazon Elasticsearch Service (Amazon ES), a search and analytics engine, to simplify the analysis of large volumes of data. With its simple, browser-based interface, Amazon ES enables you to create and share dynamic dashboards quickly. Using Kibana is free, and you only have to pay for the infrastructure where you installed the software. Hence, in this scenario, the best answers would be: Write an AWS Lambda function that sends that data to Amazon Elasticsearch Service directly from Amazon S3 at the desired schedule interval. Use Kibana as a visualization tool that processes the data in the Amazon Elasticsearch Service and set the refresh interval at the desired value. The option that says: Create an Amazon Kinesis Data stream and create a custom-built consumer that sends data to Amazon Elasticsearch Service is incorrect since Amazon Kinesis data stream cannot send data directly to Amazon Elasticsearch Service. Moreover, the use of a Kinesis data stream is not necessary, given that the team does not need real-time data analysis. The option that says: Create an Amazon Kinesis Data Firehose delivery stream and embed an AWS Lambda function that uploads the compressed files to Amazon Elasticsearch Service is incorrect since Amazon Kinesis Data Firehose incurs costs, and that is unnecessary given that the team does not require Firehoses near real-time data analysis. An AWS Lambda function that runs every couple of minutes would be enough. The option that says: Use Amazon Quicksight as a visualization tool that processes the data in the Amazon Elasticsearch Service is incorrect because Kibana is a better choice. It is tightly integrated with Amazon Elasticsearch service, and the use of the tool is free. Amazon Quicksight is only free if it only has one user. References: https://aws.amazon.com/blogs/database/configuring-and-authoring-kibana-dashboards/ https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-aws-integrations.html https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-kibana.html
Incorrect
Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. With Amazon ES, you get direct access to the Elasticsearch APIs; existing code and applications work seamlessly with the service. You can use AWS Lambda to send data to your Amazon ES domain from Amazon S3. New data that arrives in an S3 bucket triggers an event notification to Lambda, which then runs your custom code to perform the indexing. Once you create the function, you must add a trigger. Meanwhile, Kibana is tightly integrated with Amazon Elasticsearch Service (Amazon ES), a search and analytics engine, to simplify the analysis of large volumes of data. With its simple, browser-based interface, Amazon ES enables you to create and share dynamic dashboards quickly. Using Kibana is free, and you only have to pay for the infrastructure where you installed the software. Hence, in this scenario, the best answers would be: Write an AWS Lambda function that sends that data to Amazon Elasticsearch Service directly from Amazon S3 at the desired schedule interval. Use Kibana as a visualization tool that processes the data in the Amazon Elasticsearch Service and set the refresh interval at the desired value. The option that says: Create an Amazon Kinesis Data stream and create a custom-built consumer that sends data to Amazon Elasticsearch Service is incorrect since Amazon Kinesis data stream cannot send data directly to Amazon Elasticsearch Service. Moreover, the use of a Kinesis data stream is not necessary, given that the team does not need real-time data analysis. The option that says: Create an Amazon Kinesis Data Firehose delivery stream and embed an AWS Lambda function that uploads the compressed files to Amazon Elasticsearch Service is incorrect since Amazon Kinesis Data Firehose incurs costs, and that is unnecessary given that the team does not require Firehoses near real-time data analysis. An AWS Lambda function that runs every couple of minutes would be enough. The option that says: Use Amazon Quicksight as a visualization tool that processes the data in the Amazon Elasticsearch Service is incorrect because Kibana is a better choice. It is tightly integrated with Amazon Elasticsearch service, and the use of the tool is free. Amazon Quicksight is only free if it only has one user. References: https://aws.amazon.com/blogs/database/configuring-and-authoring-kibana-dashboards/ https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-aws-integrations.html https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-kibana.html
Unattempted
Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. With Amazon ES, you get direct access to the Elasticsearch APIs; existing code and applications work seamlessly with the service. You can use AWS Lambda to send data to your Amazon ES domain from Amazon S3. New data that arrives in an S3 bucket triggers an event notification to Lambda, which then runs your custom code to perform the indexing. Once you create the function, you must add a trigger. Meanwhile, Kibana is tightly integrated with Amazon Elasticsearch Service (Amazon ES), a search and analytics engine, to simplify the analysis of large volumes of data. With its simple, browser-based interface, Amazon ES enables you to create and share dynamic dashboards quickly. Using Kibana is free, and you only have to pay for the infrastructure where you installed the software. Hence, in this scenario, the best answers would be: Write an AWS Lambda function that sends that data to Amazon Elasticsearch Service directly from Amazon S3 at the desired schedule interval. Use Kibana as a visualization tool that processes the data in the Amazon Elasticsearch Service and set the refresh interval at the desired value. The option that says: Create an Amazon Kinesis Data stream and create a custom-built consumer that sends data to Amazon Elasticsearch Service is incorrect since Amazon Kinesis data stream cannot send data directly to Amazon Elasticsearch Service. Moreover, the use of a Kinesis data stream is not necessary, given that the team does not need real-time data analysis. The option that says: Create an Amazon Kinesis Data Firehose delivery stream and embed an AWS Lambda function that uploads the compressed files to Amazon Elasticsearch Service is incorrect since Amazon Kinesis Data Firehose incurs costs, and that is unnecessary given that the team does not require Firehoses near real-time data analysis. An AWS Lambda function that runs every couple of minutes would be enough. The option that says: Use Amazon Quicksight as a visualization tool that processes the data in the Amazon Elasticsearch Service is incorrect because Kibana is a better choice. It is tightly integrated with Amazon Elasticsearch service, and the use of the tool is free. Amazon Quicksight is only free if it only has one user. References: https://aws.amazon.com/blogs/database/configuring-and-authoring-kibana-dashboards/ https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-aws-integrations.html https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-kibana.html
Question 18 of 35
18. Question
A company is receiving large amounts of files from multiple sources. All files are merged and compressed into a single 150 GB Gzip file and uploaded to an S3 bucket at the end of the business day. The file is then loaded into an Amazon Redshift cluster using the COPY command. What changes should be made to the current configuration to expedite the COPY process?
Correct
The option that says: Dont merge the files. Upload them individually to the S3 bucket and run multiple COPY commands is incorrect because this solution will not process the files in parallel. Instead, it will force Redshift to perform a serialized load which is much slower. The option that says: Split the gzip file into smaller files in such a way that the number of files is a multiple of the number of the Redshift clusters compute nodes then upload them to the S3 bucket is incorrect. You must ensure that the number of files is a multiple of the number of slices in the Amazon Redshift cluster, and not the number of its compute nodes. The option that says: Use the AUTO distribution style to optimize the distribution of data across the cluster nodes is incorrect. The distribution style has nothing to do with how fast you load data from Amazon S3 to Redshift. References: https://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-source-s3.html
Incorrect
The option that says: Dont merge the files. Upload them individually to the S3 bucket and run multiple COPY commands is incorrect because this solution will not process the files in parallel. Instead, it will force Redshift to perform a serialized load which is much slower. The option that says: Split the gzip file into smaller files in such a way that the number of files is a multiple of the number of the Redshift clusters compute nodes then upload them to the S3 bucket is incorrect. You must ensure that the number of files is a multiple of the number of slices in the Amazon Redshift cluster, and not the number of its compute nodes. The option that says: Use the AUTO distribution style to optimize the distribution of data across the cluster nodes is incorrect. The distribution style has nothing to do with how fast you load data from Amazon S3 to Redshift. References: https://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-source-s3.html
Unattempted
The option that says: Dont merge the files. Upload them individually to the S3 bucket and run multiple COPY commands is incorrect because this solution will not process the files in parallel. Instead, it will force Redshift to perform a serialized load which is much slower. The option that says: Split the gzip file into smaller files in such a way that the number of files is a multiple of the number of the Redshift clusters compute nodes then upload them to the S3 bucket is incorrect. You must ensure that the number of files is a multiple of the number of slices in the Amazon Redshift cluster, and not the number of its compute nodes. The option that says: Use the AUTO distribution style to optimize the distribution of data across the cluster nodes is incorrect. The distribution style has nothing to do with how fast you load data from Amazon S3 to Redshift. References: https://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-source-s3.html
Question 19 of 35
19. Question
A food delivery service startup has thousands of riders that serve hundreds of thousands of customers every day. The number of users is expected to increase due to the effect of the pandemic. As a response, the companys Data Analyst has decided to move the existing data to Amazon Redshift with the following schema:
A trips fact table that contains details about completed deliveries.
A riders dimension table for rider profiles.
A customer fact table for customer profiles.
The Data Analyst wants to evaluate profitability by analyzing the delivery date and time as well as the destination of each trip. The riders data almost dont change while the customers data changes frequently.
How should the Data Analyst design the table to achieve optimal query performance?
Correct
DISTSTYLE defines the data distribution style for the whole table. Amazon Redshift distributes the rows of a table to the compute nodes according the distribution style specified for the table. The distribution style that you select for tables affects the overall performance of your database.
DISTSTYLE EVEN The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column. EVEN distribution is appropriate when a table does not participate in joins or when there is not a clear choice between KEY distribution and ALL distribution.
DISTSTYLE KEY The rows are distributed according to the values in one column. The leader node places matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together.
DISTSTYLE ALL A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place only a portion of a tables rows on each node, ALL distribution ensures that every row is collocated for every join that the table participates in. ALL distribution is appropriate only for relatively slow-moving tables; that is, tables that are not updated frequently or extensively. Because the cost of redistributing small tables during a query is low, there isnt a significant benefit to define small dimension tables as DISTSTYLE ALL.
DISTSTYLE ALL is more appropriate for the riders table since it is not updated frequently. You should apply DISTSTYLE EVEN to the customers table because it is more suitable for frequently changing data. Finally, apply the DISTSTYLE KEY to the trips table and use the destination as the DISTKEY.
Hence, the correct answer is: Designate a DISTSTYLE KEY (destination) distribution for the Trips table and sort by delivery time. Use DISTSTYLE ALL for the Riders table. Use DISTSTYLE EVEN for the Customers table.
The option that says: Designate a DISTSTYLE KEY (destination) distribution for the Trips table and sort by delivery time. Use a DISTSTYLE ALL distribution for the Riders and Customers tables is incorrect. Although DISTSTYLE KEY is the suitable type for the Trips table, the DISTSTYLE ALL is not the right type to use for data that frequently changes, like the Customers table.
The option that says: Designate a DISTSTYLE EVEN distribution for the Trips table and sort by delivery time. Use DISTSTYLE ALL for the Riders table. Use DISTSTYLE EVEN for the Customers table is incorrect because DISTSTYLE KEY is the suitable type for the Trips table where the rows are distributed according to the values of one column, which should be the destination column.
The option that says: Designate a DISTSTYLE EVEN distribution for the Riders table and sort by delivery time. Use DISTSTYLE ALL for both fact tables is incorrect because the DISTSTYLE ALL type is not suitable for data that frequently changes, like the Customers table. You should also use the DISTSTYLE KEY (destination) for the Trips table and sort by delivery time.
References: https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html https://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html https://www.youtube.com/watch?v=TFLoCLXulU0
Incorrect
DISTSTYLE defines the data distribution style for the whole table. Amazon Redshift distributes the rows of a table to the compute nodes according the distribution style specified for the table. The distribution style that you select for tables affects the overall performance of your database.
DISTSTYLE EVEN The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column. EVEN distribution is appropriate when a table does not participate in joins or when there is not a clear choice between KEY distribution and ALL distribution.
DISTSTYLE KEY The rows are distributed according to the values in one column. The leader node places matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together.
DISTSTYLE ALL A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place only a portion of a tables rows on each node, ALL distribution ensures that every row is collocated for every join that the table participates in. ALL distribution is appropriate only for relatively slow-moving tables; that is, tables that are not updated frequently or extensively. Because the cost of redistributing small tables during a query is low, there isnt a significant benefit to define small dimension tables as DISTSTYLE ALL.
DISTSTYLE ALL is more appropriate for the riders table since it is not updated frequently. You should apply DISTSTYLE EVEN to the customers table because it is more suitable for frequently changing data. Finally, apply the DISTSTYLE KEY to the trips table and use the destination as the DISTKEY.
Hence, the correct answer is: Designate a DISTSTYLE KEY (destination) distribution for the Trips table and sort by delivery time. Use DISTSTYLE ALL for the Riders table. Use DISTSTYLE EVEN for the Customers table.
The option that says: Designate a DISTSTYLE KEY (destination) distribution for the Trips table and sort by delivery time. Use a DISTSTYLE ALL distribution for the Riders and Customers tables is incorrect. Although DISTSTYLE KEY is the suitable type for the Trips table, the DISTSTYLE ALL is not the right type to use for data that frequently changes, like the Customers table.
The option that says: Designate a DISTSTYLE EVEN distribution for the Trips table and sort by delivery time. Use DISTSTYLE ALL for the Riders table. Use DISTSTYLE EVEN for the Customers table is incorrect because DISTSTYLE KEY is the suitable type for the Trips table where the rows are distributed according to the values of one column, which should be the destination column.
The option that says: Designate a DISTSTYLE EVEN distribution for the Riders table and sort by delivery time. Use DISTSTYLE ALL for both fact tables is incorrect because the DISTSTYLE ALL type is not suitable for data that frequently changes, like the Customers table. You should also use the DISTSTYLE KEY (destination) for the Trips table and sort by delivery time.
References: https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html https://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html https://www.youtube.com/watch?v=TFLoCLXulU0
Unattempted
DISTSTYLE defines the data distribution style for the whole table. Amazon Redshift distributes the rows of a table to the compute nodes according the distribution style specified for the table. The distribution style that you select for tables affects the overall performance of your database.
DISTSTYLE EVEN The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column. EVEN distribution is appropriate when a table does not participate in joins or when there is not a clear choice between KEY distribution and ALL distribution.
DISTSTYLE KEY The rows are distributed according to the values in one column. The leader node places matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together.
DISTSTYLE ALL A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place only a portion of a tables rows on each node, ALL distribution ensures that every row is collocated for every join that the table participates in. ALL distribution is appropriate only for relatively slow-moving tables; that is, tables that are not updated frequently or extensively. Because the cost of redistributing small tables during a query is low, there isnt a significant benefit to define small dimension tables as DISTSTYLE ALL.
DISTSTYLE ALL is more appropriate for the riders table since it is not updated frequently. You should apply DISTSTYLE EVEN to the customers table because it is more suitable for frequently changing data. Finally, apply the DISTSTYLE KEY to the trips table and use the destination as the DISTKEY.
Hence, the correct answer is: Designate a DISTSTYLE KEY (destination) distribution for the Trips table and sort by delivery time. Use DISTSTYLE ALL for the Riders table. Use DISTSTYLE EVEN for the Customers table.
The option that says: Designate a DISTSTYLE KEY (destination) distribution for the Trips table and sort by delivery time. Use a DISTSTYLE ALL distribution for the Riders and Customers tables is incorrect. Although DISTSTYLE KEY is the suitable type for the Trips table, the DISTSTYLE ALL is not the right type to use for data that frequently changes, like the Customers table.
The option that says: Designate a DISTSTYLE EVEN distribution for the Trips table and sort by delivery time. Use DISTSTYLE ALL for the Riders table. Use DISTSTYLE EVEN for the Customers table is incorrect because DISTSTYLE KEY is the suitable type for the Trips table where the rows are distributed according to the values of one column, which should be the destination column.
The option that says: Designate a DISTSTYLE EVEN distribution for the Riders table and sort by delivery time. Use DISTSTYLE ALL for both fact tables is incorrect because the DISTSTYLE ALL type is not suitable for data that frequently changes, like the Customers table. You should also use the DISTSTYLE KEY (destination) for the Trips table and sort by delivery time.
References: https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html https://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html https://www.youtube.com/watch?v=TFLoCLXulU0
Question 20 of 35
20. Question
A company wants to implement a continuous monitoring system for the advertisement videos on its social media application. The system will be designed to detect sentiment changes in user feeds and track all video playback issues. The company will collect and analyze data to react to the user sentiment in less than 30 seconds. The transmitted data is in JSON format with a consistent, well-defined schema.
Which collection and processing methods should the company do to meet these requirements?
Correct
You can use Amazon Kinesis Data Streams to build your own streaming application. This application can process and analyze streaming data by continuously capturing and storing terabytes of data per hour from hundreds of thousands of sources.
Kinesis Data Analytics provides an easy and familiar standard SQL language to analyze streaming data in real time. One of its most powerful features is that there are no new languages, processing frameworks, or complex machine learning algorithms that you need to learn.
Amazon Kinesis Data Analytics uses JSONPath expressions in the applications source schema to identify data elements in a streaming source that contains JSON-format data. Kinesis Data Analytics supports Amazon Kinesis Data Firehose (Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk), AWS Lambda, and Amazon Kinesis Data Streams as destinations.
In the scenario, we can use a Kinesis Data stream to ingest data in near-real-time (30 seconds). Then, use Kinesis Data Analytics to consume and process the streaming data in real time. The raw data will be stored on an Amazon S3 bucket.
Hence, the correct answer is: Ingest the incoming data into Amazon Kinesis Data Streams, and choose an Amazon Kinesis Data Analytics (KDA) application as the destination to process, detect, and react to a sentiment change. Configure a Kinesis Data Firehose delivery stream as an output of the KDA application to store the raw JSON data in an Amazon S3 bucket.
The option that says: Ingest the incoming data into Amazon Kinesis Data Firehose, and deliver the data to an Amazon Kinesis Data Analytics (KDA) application to process, detect, and react to a sentiment change. Configure the KDA application to directly send the raw JSON data to an Amazon S3 bucket is incorrect because you must configure Kinesis Data Analytics to use an Amazon Kinesis Data Firehose first, in order to send the output to Amazon S3.
The option that says: Ingest the data to Amazon Managed Streaming for Kafka (Amazon MSK), and choose a Kinesis Data Analytics (KDA) application as the destination to process, detect, and react to a sentiment change. Directly store the raw output data in a DynamoDB table from the KDA application is incorrect. Amazon Kinesis Data Analytics doesnt directly support DynamoDB as a destination. It only supports Amazon Kinesis Data Firehose, AWS Lambda, and Amazon Kinesis Data Streams as an output. You can use Kinesis Data Firehose to move the output data to Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk.
The option that says: Ingest the incoming data into Amazon Kinesis Data Streams and deliver the data into an S3 bucket. Enable event notifications to trigger a Lambda function that will process, detect, and react to a sentiment change. Store the raw data in a DynamoDB table is incorrect because Amazon Kinesis Data Streams cannot directly deliver data to Amazon S3. It only supports Lambda, Kinesis Data Analytics, and applications hosted in Amazon EC2 that uses the Kinesis Client Library (KCL).
References: https://aws.amazon.com/blogs/big-data/perform-near-real-time-analytics-on-streaming-data-with-amazon-kinesis-and-amazon-elasticsearch-service/ https://aws.amazon.com/kinesis/data-streams/faqs/ https://aws.amazon.com/about-aws/whats-new/2019/11/you-can-now-run-fully-managed-apache-flink-applications-with-apache-kafka/
Incorrect
You can use Amazon Kinesis Data Streams to build your own streaming application. This application can process and analyze streaming data by continuously capturing and storing terabytes of data per hour from hundreds of thousands of sources.
Kinesis Data Analytics provides an easy and familiar standard SQL language to analyze streaming data in real time. One of its most powerful features is that there are no new languages, processing frameworks, or complex machine learning algorithms that you need to learn.
Amazon Kinesis Data Analytics uses JSONPath expressions in the applications source schema to identify data elements in a streaming source that contains JSON-format data. Kinesis Data Analytics supports Amazon Kinesis Data Firehose (Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk), AWS Lambda, and Amazon Kinesis Data Streams as destinations.
In the scenario, we can use a Kinesis Data stream to ingest data in near-real-time (30 seconds). Then, use Kinesis Data Analytics to consume and process the streaming data in real time. The raw data will be stored on an Amazon S3 bucket.
Hence, the correct answer is: Ingest the incoming data into Amazon Kinesis Data Streams, and choose an Amazon Kinesis Data Analytics (KDA) application as the destination to process, detect, and react to a sentiment change. Configure a Kinesis Data Firehose delivery stream as an output of the KDA application to store the raw JSON data in an Amazon S3 bucket.
The option that says: Ingest the incoming data into Amazon Kinesis Data Firehose, and deliver the data to an Amazon Kinesis Data Analytics (KDA) application to process, detect, and react to a sentiment change. Configure the KDA application to directly send the raw JSON data to an Amazon S3 bucket is incorrect because you must configure Kinesis Data Analytics to use an Amazon Kinesis Data Firehose first, in order to send the output to Amazon S3.
The option that says: Ingest the data to Amazon Managed Streaming for Kafka (Amazon MSK), and choose a Kinesis Data Analytics (KDA) application as the destination to process, detect, and react to a sentiment change. Directly store the raw output data in a DynamoDB table from the KDA application is incorrect. Amazon Kinesis Data Analytics doesnt directly support DynamoDB as a destination. It only supports Amazon Kinesis Data Firehose, AWS Lambda, and Amazon Kinesis Data Streams as an output. You can use Kinesis Data Firehose to move the output data to Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk.
The option that says: Ingest the incoming data into Amazon Kinesis Data Streams and deliver the data into an S3 bucket. Enable event notifications to trigger a Lambda function that will process, detect, and react to a sentiment change. Store the raw data in a DynamoDB table is incorrect because Amazon Kinesis Data Streams cannot directly deliver data to Amazon S3. It only supports Lambda, Kinesis Data Analytics, and applications hosted in Amazon EC2 that uses the Kinesis Client Library (KCL).
References: https://aws.amazon.com/blogs/big-data/perform-near-real-time-analytics-on-streaming-data-with-amazon-kinesis-and-amazon-elasticsearch-service/ https://aws.amazon.com/kinesis/data-streams/faqs/ https://aws.amazon.com/about-aws/whats-new/2019/11/you-can-now-run-fully-managed-apache-flink-applications-with-apache-kafka/
Unattempted
You can use Amazon Kinesis Data Streams to build your own streaming application. This application can process and analyze streaming data by continuously capturing and storing terabytes of data per hour from hundreds of thousands of sources.
Kinesis Data Analytics provides an easy and familiar standard SQL language to analyze streaming data in real time. One of its most powerful features is that there are no new languages, processing frameworks, or complex machine learning algorithms that you need to learn.
Amazon Kinesis Data Analytics uses JSONPath expressions in the applications source schema to identify data elements in a streaming source that contains JSON-format data. Kinesis Data Analytics supports Amazon Kinesis Data Firehose (Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk), AWS Lambda, and Amazon Kinesis Data Streams as destinations.
In the scenario, we can use a Kinesis Data stream to ingest data in near-real-time (30 seconds). Then, use Kinesis Data Analytics to consume and process the streaming data in real time. The raw data will be stored on an Amazon S3 bucket.
Hence, the correct answer is: Ingest the incoming data into Amazon Kinesis Data Streams, and choose an Amazon Kinesis Data Analytics (KDA) application as the destination to process, detect, and react to a sentiment change. Configure a Kinesis Data Firehose delivery stream as an output of the KDA application to store the raw JSON data in an Amazon S3 bucket.
The option that says: Ingest the incoming data into Amazon Kinesis Data Firehose, and deliver the data to an Amazon Kinesis Data Analytics (KDA) application to process, detect, and react to a sentiment change. Configure the KDA application to directly send the raw JSON data to an Amazon S3 bucket is incorrect because you must configure Kinesis Data Analytics to use an Amazon Kinesis Data Firehose first, in order to send the output to Amazon S3.
The option that says: Ingest the data to Amazon Managed Streaming for Kafka (Amazon MSK), and choose a Kinesis Data Analytics (KDA) application as the destination to process, detect, and react to a sentiment change. Directly store the raw output data in a DynamoDB table from the KDA application is incorrect. Amazon Kinesis Data Analytics doesnt directly support DynamoDB as a destination. It only supports Amazon Kinesis Data Firehose, AWS Lambda, and Amazon Kinesis Data Streams as an output. You can use Kinesis Data Firehose to move the output data to Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk.
The option that says: Ingest the incoming data into Amazon Kinesis Data Streams and deliver the data into an S3 bucket. Enable event notifications to trigger a Lambda function that will process, detect, and react to a sentiment change. Store the raw data in a DynamoDB table is incorrect because Amazon Kinesis Data Streams cannot directly deliver data to Amazon S3. It only supports Lambda, Kinesis Data Analytics, and applications hosted in Amazon EC2 that uses the Kinesis Client Library (KCL).
References: https://aws.amazon.com/blogs/big-data/perform-near-real-time-analytics-on-streaming-data-with-amazon-kinesis-and-amazon-elasticsearch-service/ https://aws.amazon.com/kinesis/data-streams/faqs/ https://aws.amazon.com/about-aws/whats-new/2019/11/you-can-now-run-fully-managed-apache-flink-applications-with-apache-kafka/
Question 21 of 35
21. Question
A digital marketing firm has been managing social media activity for a client. The posts are collected in an Amazon Kinesis Data Stream and its shards are partitioned based on username. Posts from each user must be validated in the same order they were received before transferring them into an Amazon Elasticsearch cluster.
Lately, the Data Analyst observed that the posts are slow to show in the Elasticsearch service and would frequently take more than 30 minutes to appear during peak hours.
What should the Data Analyst do to reduce the latency issues?
Correct
AWS Lambda supports Parallelization Factor, a feature that allows you to process one shard of a Kinesis or DynamoDB data stream with more than one Lambda invocation simultaneously. This feature allows you to build more agile stream processing applications on volatile data traffic. By default, Lambda invokes a function with one batch of data records from one shard at a time. For a single event source mapping, the maximum number of concurrent Lambda invocations is equal to the number of Kinesis or DynamoDB shards.
A consumer is an application that processes all data from a Kinesis data stream. Parallelization Factor can be set to increase concurrent Lambda invocations for each shard, which by default is 1. It allows for faster stream processing without the need to over-scale the number of shards while still guaranteeing the order of records processed.
Conversely, enhanced fan-out allows developers to scale up the number of stream consumers (applications reading data from a stream in real-time) by offering each stream consumer its own read throughput. The HTTP/2 data retrieval API allows data to be delivered from producers to consumers in 70 milliseconds or better (a 65% improvement) in typical scenarios. These new features enable developers to build faster, more reactive, highly parallel, and latency-sensitive applications on top of Kinesis Data Streams.
In this scenario, the performance issue asks how to minimize latency at specific events and implies that the stream chokes whenever there is more data to process. Hence, the correct answer is: Use multiple AWS Lambda functions to process the Kinesis data stream using the Parallelization Factor feature.
The option that says: Reshard the stream to increase the number of shards and change the partition key to social media post views instead is incorrect because this does not directly address the latency issue, which occurs only at the consumer application and not exactly at the process which stores the records into the stream.
The option that says: Use Amazon Kinesis Data Firehose to read and validate the social media posts before transferring to the Elasticsearch cluster is incorrect. The change could most likely slow down the performance of the processing altogether.
The option that says: Instead of a standard data stream iterator, use an HTTP/2 stream consumer instead is incorrect. Although this potentially speeds up the validation process, you have to configure stream HTTP/2 consumers with an enhanced fan-out feature to benefit from the performance advantages.
References: https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html https://aws.amazon.com/blogs/compute/new-aws-lambda-scaling-controls-for-kinesis-and-dynamodb-event-sources/ https://aws.amazon.com/blogs/compute/increasing-real-time-stream-processing-performance-with-amazon-kinesis-data-streams-enhanced-fan-out-and-aws-lambda/
Incorrect
AWS Lambda supports Parallelization Factor, a feature that allows you to process one shard of a Kinesis or DynamoDB data stream with more than one Lambda invocation simultaneously. This feature allows you to build more agile stream processing applications on volatile data traffic. By default, Lambda invokes a function with one batch of data records from one shard at a time. For a single event source mapping, the maximum number of concurrent Lambda invocations is equal to the number of Kinesis or DynamoDB shards.
A consumer is an application that processes all data from a Kinesis data stream. Parallelization Factor can be set to increase concurrent Lambda invocations for each shard, which by default is 1. It allows for faster stream processing without the need to over-scale the number of shards while still guaranteeing the order of records processed.
Conversely, enhanced fan-out allows developers to scale up the number of stream consumers (applications reading data from a stream in real-time) by offering each stream consumer its own read throughput. The HTTP/2 data retrieval API allows data to be delivered from producers to consumers in 70 milliseconds or better (a 65% improvement) in typical scenarios. These new features enable developers to build faster, more reactive, highly parallel, and latency-sensitive applications on top of Kinesis Data Streams.
In this scenario, the performance issue asks how to minimize latency at specific events and implies that the stream chokes whenever there is more data to process. Hence, the correct answer is: Use multiple AWS Lambda functions to process the Kinesis data stream using the Parallelization Factor feature.
The option that says: Reshard the stream to increase the number of shards and change the partition key to social media post views instead is incorrect because this does not directly address the latency issue, which occurs only at the consumer application and not exactly at the process which stores the records into the stream.
The option that says: Use Amazon Kinesis Data Firehose to read and validate the social media posts before transferring to the Elasticsearch cluster is incorrect. The change could most likely slow down the performance of the processing altogether.
The option that says: Instead of a standard data stream iterator, use an HTTP/2 stream consumer instead is incorrect. Although this potentially speeds up the validation process, you have to configure stream HTTP/2 consumers with an enhanced fan-out feature to benefit from the performance advantages.
References: https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html https://aws.amazon.com/blogs/compute/new-aws-lambda-scaling-controls-for-kinesis-and-dynamodb-event-sources/ https://aws.amazon.com/blogs/compute/increasing-real-time-stream-processing-performance-with-amazon-kinesis-data-streams-enhanced-fan-out-and-aws-lambda/
Unattempted
AWS Lambda supports Parallelization Factor, a feature that allows you to process one shard of a Kinesis or DynamoDB data stream with more than one Lambda invocation simultaneously. This feature allows you to build more agile stream processing applications on volatile data traffic. By default, Lambda invokes a function with one batch of data records from one shard at a time. For a single event source mapping, the maximum number of concurrent Lambda invocations is equal to the number of Kinesis or DynamoDB shards.
A consumer is an application that processes all data from a Kinesis data stream. Parallelization Factor can be set to increase concurrent Lambda invocations for each shard, which by default is 1. It allows for faster stream processing without the need to over-scale the number of shards while still guaranteeing the order of records processed.
Conversely, enhanced fan-out allows developers to scale up the number of stream consumers (applications reading data from a stream in real-time) by offering each stream consumer its own read throughput. The HTTP/2 data retrieval API allows data to be delivered from producers to consumers in 70 milliseconds or better (a 65% improvement) in typical scenarios. These new features enable developers to build faster, more reactive, highly parallel, and latency-sensitive applications on top of Kinesis Data Streams.
In this scenario, the performance issue asks how to minimize latency at specific events and implies that the stream chokes whenever there is more data to process. Hence, the correct answer is: Use multiple AWS Lambda functions to process the Kinesis data stream using the Parallelization Factor feature.
The option that says: Reshard the stream to increase the number of shards and change the partition key to social media post views instead is incorrect because this does not directly address the latency issue, which occurs only at the consumer application and not exactly at the process which stores the records into the stream.
The option that says: Use Amazon Kinesis Data Firehose to read and validate the social media posts before transferring to the Elasticsearch cluster is incorrect. The change could most likely slow down the performance of the processing altogether.
The option that says: Instead of a standard data stream iterator, use an HTTP/2 stream consumer instead is incorrect. Although this potentially speeds up the validation process, you have to configure stream HTTP/2 consumers with an enhanced fan-out feature to benefit from the performance advantages.
References: https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html https://aws.amazon.com/blogs/compute/new-aws-lambda-scaling-controls-for-kinesis-and-dynamodb-event-sources/ https://aws.amazon.com/blogs/compute/increasing-real-time-stream-processing-performance-with-amazon-kinesis-data-streams-enhanced-fan-out-and-aws-lambda/
Question 22 of 35
22. Question
A company has moved its data transformation job to an Amazon EMR cluster with Apache Pig. The cluster uses On-Demand instances to process large datasets, and the output is critical to operations. It typically takes around 1 hour to complete the job. Even so, the company must ensure that the whole process strictly adheres to the service level agreement (SLA) of 2 hours. The company is looking for a solution that will provide cost reduction and negligibly impacts availability.
Which combination of solutions should be implemented to meet these requirements? (Select TWO.)
Correct
The instance fleets configuration for a cluster offers the widest variety of provisioning options for EC2 instances. For instance fleets, you specify target capacities for On-Demand Instances and Spot Instances within each fleet. When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. You can specify up to five EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets. You can also select multiple subnets for different Availability Zones. When Amazon EMR launches the cluster, it looks across those subnets to find the instances and purchasing options you specify.
This option will allow you to run your EMR Instance Fleet on Spot Blocks, which are uninterrupted Spot Instances, available for 1-6 hours, at a lower discount compared to Spot Instances.
You can determine that after a set amount of minutes if EMR is unable to provision your selected Spot Instances due to lack of capacity, it will either start On-Demand instances instead or terminate the cluster. This can be determined according to the business definition of the cluster or Spark application if it is SLA bound and should complete even at On-Demand price then the Switch to On-Demand option might be suitable.
Hence, the correct answers are:
Configure an Amazon EMR cluster that uses instance fleets.
Assign Spot capacity for all node types and enable the Switch to the On-demand instances option.
The option that says: Configure an Amazon EMR cluster that uses uniform instance groups is incorrect because instances fleet provides a greater cost reduction than uniform instance groups.
The option that says: Add a task node that runs on Spot instance is incorrect. A task node is an optional cluster node that you can use to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Since the current time for the job to complete is less than the SLA, there is no need to add a task node as it will just incur additional costs.
The option that says: Use Spot instance for all node types is incorrect. Spot instances alone are prone to interruption, hence, it is not suitable for running core nodes. Instance fleet uses Spot Blocks which are uninterrupted spot instances and are available for 1 6 hours.
References: https://ec2spotworkshops.com/running_spark_apps_with_emr_on_spot_instances/fleet_config_options.html https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html#emr-plan-task https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html
Incorrect
The instance fleets configuration for a cluster offers the widest variety of provisioning options for EC2 instances. For instance fleets, you specify target capacities for On-Demand Instances and Spot Instances within each fleet. When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. You can specify up to five EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets. You can also select multiple subnets for different Availability Zones. When Amazon EMR launches the cluster, it looks across those subnets to find the instances and purchasing options you specify.
This option will allow you to run your EMR Instance Fleet on Spot Blocks, which are uninterrupted Spot Instances, available for 1-6 hours, at a lower discount compared to Spot Instances.
You can determine that after a set amount of minutes if EMR is unable to provision your selected Spot Instances due to lack of capacity, it will either start On-Demand instances instead or terminate the cluster. This can be determined according to the business definition of the cluster or Spark application if it is SLA bound and should complete even at On-Demand price then the Switch to On-Demand option might be suitable.
Hence, the correct answers are:
Configure an Amazon EMR cluster that uses instance fleets.
Assign Spot capacity for all node types and enable the Switch to the On-demand instances option.
The option that says: Configure an Amazon EMR cluster that uses uniform instance groups is incorrect because instances fleet provides a greater cost reduction than uniform instance groups.
The option that says: Add a task node that runs on Spot instance is incorrect. A task node is an optional cluster node that you can use to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Since the current time for the job to complete is less than the SLA, there is no need to add a task node as it will just incur additional costs.
The option that says: Use Spot instance for all node types is incorrect. Spot instances alone are prone to interruption, hence, it is not suitable for running core nodes. Instance fleet uses Spot Blocks which are uninterrupted spot instances and are available for 1 6 hours.
References: https://ec2spotworkshops.com/running_spark_apps_with_emr_on_spot_instances/fleet_config_options.html https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html#emr-plan-task https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html
Unattempted
The instance fleets configuration for a cluster offers the widest variety of provisioning options for EC2 instances. For instance fleets, you specify target capacities for On-Demand Instances and Spot Instances within each fleet. When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. You can specify up to five EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets. You can also select multiple subnets for different Availability Zones. When Amazon EMR launches the cluster, it looks across those subnets to find the instances and purchasing options you specify.
This option will allow you to run your EMR Instance Fleet on Spot Blocks, which are uninterrupted Spot Instances, available for 1-6 hours, at a lower discount compared to Spot Instances.
You can determine that after a set amount of minutes if EMR is unable to provision your selected Spot Instances due to lack of capacity, it will either start On-Demand instances instead or terminate the cluster. This can be determined according to the business definition of the cluster or Spark application if it is SLA bound and should complete even at On-Demand price then the Switch to On-Demand option might be suitable.
Hence, the correct answers are:
Configure an Amazon EMR cluster that uses instance fleets.
Assign Spot capacity for all node types and enable the Switch to the On-demand instances option.
The option that says: Configure an Amazon EMR cluster that uses uniform instance groups is incorrect because instances fleet provides a greater cost reduction than uniform instance groups.
The option that says: Add a task node that runs on Spot instance is incorrect. A task node is an optional cluster node that you can use to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Since the current time for the job to complete is less than the SLA, there is no need to add a task node as it will just incur additional costs.
The option that says: Use Spot instance for all node types is incorrect. Spot instances alone are prone to interruption, hence, it is not suitable for running core nodes. Instance fleet uses Spot Blocks which are uninterrupted spot instances and are available for 1 6 hours.
References: https://ec2spotworkshops.com/running_spark_apps_with_emr_on_spot_instances/fleet_config_options.html https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html#emr-plan-task https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html
Question 23 of 35
23. Question
A company is running an iterative data processing on an Amazon EMR cluster. Each day, the workflow begins by loading log files into an Amazon S3 bucket. The EMR cluster processes them in 20 batch jobs, which takes each job about 30 minutes to complete. The company wants to further reduce EMR cost.
Which configuration should be done to meet these requirements?
Correct
Transient EMR clusters are just clusters that shut down when the job or the steps (series of jobs) are complete. In contrast, persistent clusters continue to run after data processing is complete. If you determine that your cluster will be idle for the majority of the time, it is best to use transient clusters. For example, if you have a batch-processing job that pulls your weblogs from Amazon S3 and processes the data once a day, it is more cost-effective to use transient clusters to process your data and shut down the nodes when the processing is complete.
As a rule of thumb, use transient clusters if your total number of Amazon EMR processing hours per day is less than 24. Since the total processing time for the scenario is about 600 minutes/10 hours, we can further reduce EMR costs by using a transient cluster.
Hence, the correct answer is: Use transient Amazon EMR clusters. Shut down the cluster when the log processing is done.
The option that says: Use persistent Amazon EMR clusters. Shut down the cluster when the log processing is done is incorrect because using persistent EMR clusters means that you continue running the EMR clusters even after data processing is complete.
The option that says: Trigger a Lambda function to process the batch jobs is incorrect because an AWS Lambda function can not run for more than 15 minutes.
The option that says: Create a long-running EMR cluster that uses instance fleets is incorrect. Although you can reduce EMR cost this way, it is not the best option for the scenario. The batch process only runs for 10 hours a day, which means running the EMR cluster for the remaining 14 hours is just a waste of money.
References: https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview-benefits.html
Incorrect
Transient EMR clusters are just clusters that shut down when the job or the steps (series of jobs) are complete. In contrast, persistent clusters continue to run after data processing is complete. If you determine that your cluster will be idle for the majority of the time, it is best to use transient clusters. For example, if you have a batch-processing job that pulls your weblogs from Amazon S3 and processes the data once a day, it is more cost-effective to use transient clusters to process your data and shut down the nodes when the processing is complete.
As a rule of thumb, use transient clusters if your total number of Amazon EMR processing hours per day is less than 24. Since the total processing time for the scenario is about 600 minutes/10 hours, we can further reduce EMR costs by using a transient cluster.
Hence, the correct answer is: Use transient Amazon EMR clusters. Shut down the cluster when the log processing is done.
The option that says: Use persistent Amazon EMR clusters. Shut down the cluster when the log processing is done is incorrect because using persistent EMR clusters means that you continue running the EMR clusters even after data processing is complete.
The option that says: Trigger a Lambda function to process the batch jobs is incorrect because an AWS Lambda function can not run for more than 15 minutes.
The option that says: Create a long-running EMR cluster that uses instance fleets is incorrect. Although you can reduce EMR cost this way, it is not the best option for the scenario. The batch process only runs for 10 hours a day, which means running the EMR cluster for the remaining 14 hours is just a waste of money.
References: https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview-benefits.html
Unattempted
Transient EMR clusters are just clusters that shut down when the job or the steps (series of jobs) are complete. In contrast, persistent clusters continue to run after data processing is complete. If you determine that your cluster will be idle for the majority of the time, it is best to use transient clusters. For example, if you have a batch-processing job that pulls your weblogs from Amazon S3 and processes the data once a day, it is more cost-effective to use transient clusters to process your data and shut down the nodes when the processing is complete.
As a rule of thumb, use transient clusters if your total number of Amazon EMR processing hours per day is less than 24. Since the total processing time for the scenario is about 600 minutes/10 hours, we can further reduce EMR costs by using a transient cluster.
Hence, the correct answer is: Use transient Amazon EMR clusters. Shut down the cluster when the log processing is done.
The option that says: Use persistent Amazon EMR clusters. Shut down the cluster when the log processing is done is incorrect because using persistent EMR clusters means that you continue running the EMR clusters even after data processing is complete.
The option that says: Trigger a Lambda function to process the batch jobs is incorrect because an AWS Lambda function can not run for more than 15 minutes.
The option that says: Create a long-running EMR cluster that uses instance fleets is incorrect. Although you can reduce EMR cost this way, it is not the best option for the scenario. The batch process only runs for 10 hours a day, which means running the EMR cluster for the remaining 14 hours is just a waste of money.
References: https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview-benefits.html
Question 24 of 35
24. Question
A retail company stores inventory and historical transaction records in an Amazon S3 bucket integrated with AWS Glue Data Catalog. The customer sales report data is sent and stored every evening to an Amazon Redshift cluster. To complete the processing, the historical transactions must be joined with the sales report data. The Data Analyst is looking for a solution to significantly reduce the workload of the cluster as it is already overutilized. The solution must be serverless and requires minimal configuration effort.
Which of the following configurations will be able to meet the above requirements?
Correct
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can use Amazon Redshift to efficiently query and retrieve structured and semi-structured data from files in S3 without having to load the data into Amazon Redshift native tables. You can create Amazon Redshift external tables by defining the structure for files and registering them as tables in the AWS Glue Data Catalog.
You create Redshift Spectrum tables by defining the structure for your files and registering them as tables in an external data catalog. The external data catalog can be AWS Glue, the data catalog that comes with Amazon Athena, or your own Apache Hive metastore. To allow Amazon Redshift to view tables in the AWS Glue Data Catalog, add glue:GetTable to the Amazon Redshift IAM role.
Hence, the correct answer is: Using Amazon Redshift Spectrum, provision an external table for the customer sales report data and join the tables using Redshift SQL queries.
The option that says: Migrate the customer sales report data from the Amazon Redshift cluster to an EMR cluster via Apache Sqoop. Use Apache Hive to run the join the tables is incorrect because Amazon EMR is not a serverless service. It also entails a lot of effort to launch an Amazon EMR cluster and configure Apache Hive to join the tables.
The option that says: Unload the customer sales report data from the Amazon Redshift cluster to an S3 bucket via a Lambda function. Create an AWS Glue ETL script to join the tables is incorrect. Although this is a possible solution, it will require hours of development activity to create AWS Glue ETL scripts to join the data in Redshift and in Amazon S3. A better to solution here is to use Amazon Redshift Spectrum.
The option that says: Move the customer sales report data from the Amazon Redshift cluster using a Python shell job in AWS Glue. Create an AWS Glue ETL script to join the tables is incorrect because using a custom Python shell job entails a lot of time and development effort. Since the S3 bucket is already integrated with AWS Glue, you can just create an AWS Glue Catalog that allows you to join the transactions table using Amazon Redshift Spectrum.
References: https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html https://aws.amazon.com/blogs/big-data/analyze-your-amazon-s3-spend-using-aws-glue-and-amazon-redshift/
Incorrect
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can use Amazon Redshift to efficiently query and retrieve structured and semi-structured data from files in S3 without having to load the data into Amazon Redshift native tables. You can create Amazon Redshift external tables by defining the structure for files and registering them as tables in the AWS Glue Data Catalog.
You create Redshift Spectrum tables by defining the structure for your files and registering them as tables in an external data catalog. The external data catalog can be AWS Glue, the data catalog that comes with Amazon Athena, or your own Apache Hive metastore. To allow Amazon Redshift to view tables in the AWS Glue Data Catalog, add glue:GetTable to the Amazon Redshift IAM role.
Hence, the correct answer is: Using Amazon Redshift Spectrum, provision an external table for the customer sales report data and join the tables using Redshift SQL queries.
The option that says: Migrate the customer sales report data from the Amazon Redshift cluster to an EMR cluster via Apache Sqoop. Use Apache Hive to run the join the tables is incorrect because Amazon EMR is not a serverless service. It also entails a lot of effort to launch an Amazon EMR cluster and configure Apache Hive to join the tables.
The option that says: Unload the customer sales report data from the Amazon Redshift cluster to an S3 bucket via a Lambda function. Create an AWS Glue ETL script to join the tables is incorrect. Although this is a possible solution, it will require hours of development activity to create AWS Glue ETL scripts to join the data in Redshift and in Amazon S3. A better to solution here is to use Amazon Redshift Spectrum.
The option that says: Move the customer sales report data from the Amazon Redshift cluster using a Python shell job in AWS Glue. Create an AWS Glue ETL script to join the tables is incorrect because using a custom Python shell job entails a lot of time and development effort. Since the S3 bucket is already integrated with AWS Glue, you can just create an AWS Glue Catalog that allows you to join the transactions table using Amazon Redshift Spectrum.
References: https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html https://aws.amazon.com/blogs/big-data/analyze-your-amazon-s3-spend-using-aws-glue-and-amazon-redshift/
Unattempted
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can use Amazon Redshift to efficiently query and retrieve structured and semi-structured data from files in S3 without having to load the data into Amazon Redshift native tables. You can create Amazon Redshift external tables by defining the structure for files and registering them as tables in the AWS Glue Data Catalog.
You create Redshift Spectrum tables by defining the structure for your files and registering them as tables in an external data catalog. The external data catalog can be AWS Glue, the data catalog that comes with Amazon Athena, or your own Apache Hive metastore. To allow Amazon Redshift to view tables in the AWS Glue Data Catalog, add glue:GetTable to the Amazon Redshift IAM role.
Hence, the correct answer is: Using Amazon Redshift Spectrum, provision an external table for the customer sales report data and join the tables using Redshift SQL queries.
The option that says: Migrate the customer sales report data from the Amazon Redshift cluster to an EMR cluster via Apache Sqoop. Use Apache Hive to run the join the tables is incorrect because Amazon EMR is not a serverless service. It also entails a lot of effort to launch an Amazon EMR cluster and configure Apache Hive to join the tables.
The option that says: Unload the customer sales report data from the Amazon Redshift cluster to an S3 bucket via a Lambda function. Create an AWS Glue ETL script to join the tables is incorrect. Although this is a possible solution, it will require hours of development activity to create AWS Glue ETL scripts to join the data in Redshift and in Amazon S3. A better to solution here is to use Amazon Redshift Spectrum.
The option that says: Move the customer sales report data from the Amazon Redshift cluster using a Python shell job in AWS Glue. Create an AWS Glue ETL script to join the tables is incorrect because using a custom Python shell job entails a lot of time and development effort. Since the S3 bucket is already integrated with AWS Glue, you can just create an AWS Glue Catalog that allows you to join the transactions table using Amazon Redshift Spectrum.
References: https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html https://aws.amazon.com/blogs/big-data/analyze-your-amazon-s3-spend-using-aws-glue-and-amazon-redshift/
Question 25 of 35
25. Question
A company is using AWS Glue to perform ETL jobs in a 120 GB dataset. The created job is triggered to run with a Standard worker type. A Data Analyst noticed that the job is still running after waiting for 2 hours, and there were no errors found in the logs. Three hours later, the ETL job had finally completed the processing. The Data Analyst needs to improve the job execution time in AWS Glue.
Which of the following should be implemented to achieve this requirement?
Correct
An AWS Glue job encapsulates a script that connects to your source data, processes it, and then writes it out to your data target. Typically, a job runs extract, transform, and load (ETL) scripts. AWS Glue triggers can start jobs based on a schedule or event or on-demand. You can monitor job runs to understand runtime metrics such as completion status, duration, and start time.
Its stated in the scenario that the allocated worker type in the created ETL job is a Standard worker type. The Standard worker type has a 50 GB disk and 2 executors. A data processing unit (DPU) is a relative measure of processing power that consists of vCPUs and memory. To improve the job execution time, you can enable job metrics in AWS Glue to estimate the number of data processing units (DPUs) that can be used to scale out an AWS Glue job.
Hence, the correct answer is: Modify the job properties and enable job metrics to evaluate the required number of DPUs. Change the maximum capacity parameter value and set it to a higher number.
The option that says: Modify the job properties and enable job bookmarks to evaluate the required number of DPUs. Set a higher parameter value for the number of memory per executor is incorrect because a job bookmark just maintains state information and prevents AWS Glue from reprocessing old data. Job bookmarks are primarily used for incremental data processing. You cant use job bookmarks to improve the job execution time.
The option that says: Modify the job properties and enable job bookmarks to evaluate the required number of DPUs. Set a higher parameter value for the number of executors per node is incorrect. Just like the option above, you must enable job metrics and increase the maximum capacity value.
The option that says: Modify the job properties and enable job metrics to evaluate the required number of DPUs. Change the spark.yarn.executor.memoryOverhead parameter value and set it to a higher number is incorrect. The value of the spark.yarn.executor.memoryOverhead parameter is just a part of the memory allocation that executors use in addition to the executor memory. This memory accounts for things like VM overheads, interned strings, and other native overheads. Now, in case you have insufficient memory overhead, you might encounter an error that says Container killed by YARN for exceeding memory limits .. We can eliminate this option since it was mentioned in the scenario that no errors were found in the logs and the job has been completed after 3 hours. In this case, its just a matter of increasing the DPUs to speed up the execution time.
References: https://docs.aws.amazon.com/glue/latest/dg/add-job.html https://docs.aws.amazon.com/glue/latest/dg/monitor-debug-capacity.html https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
Incorrect
An AWS Glue job encapsulates a script that connects to your source data, processes it, and then writes it out to your data target. Typically, a job runs extract, transform, and load (ETL) scripts. AWS Glue triggers can start jobs based on a schedule or event or on-demand. You can monitor job runs to understand runtime metrics such as completion status, duration, and start time.
Its stated in the scenario that the allocated worker type in the created ETL job is a Standard worker type. The Standard worker type has a 50 GB disk and 2 executors. A data processing unit (DPU) is a relative measure of processing power that consists of vCPUs and memory. To improve the job execution time, you can enable job metrics in AWS Glue to estimate the number of data processing units (DPUs) that can be used to scale out an AWS Glue job.
Hence, the correct answer is: Modify the job properties and enable job metrics to evaluate the required number of DPUs. Change the maximum capacity parameter value and set it to a higher number.
The option that says: Modify the job properties and enable job bookmarks to evaluate the required number of DPUs. Set a higher parameter value for the number of memory per executor is incorrect because a job bookmark just maintains state information and prevents AWS Glue from reprocessing old data. Job bookmarks are primarily used for incremental data processing. You cant use job bookmarks to improve the job execution time.
The option that says: Modify the job properties and enable job bookmarks to evaluate the required number of DPUs. Set a higher parameter value for the number of executors per node is incorrect. Just like the option above, you must enable job metrics and increase the maximum capacity value.
The option that says: Modify the job properties and enable job metrics to evaluate the required number of DPUs. Change the spark.yarn.executor.memoryOverhead parameter value and set it to a higher number is incorrect. The value of the spark.yarn.executor.memoryOverhead parameter is just a part of the memory allocation that executors use in addition to the executor memory. This memory accounts for things like VM overheads, interned strings, and other native overheads. Now, in case you have insufficient memory overhead, you might encounter an error that says Container killed by YARN for exceeding memory limits .. We can eliminate this option since it was mentioned in the scenario that no errors were found in the logs and the job has been completed after 3 hours. In this case, its just a matter of increasing the DPUs to speed up the execution time.
References: https://docs.aws.amazon.com/glue/latest/dg/add-job.html https://docs.aws.amazon.com/glue/latest/dg/monitor-debug-capacity.html https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
Unattempted
An AWS Glue job encapsulates a script that connects to your source data, processes it, and then writes it out to your data target. Typically, a job runs extract, transform, and load (ETL) scripts. AWS Glue triggers can start jobs based on a schedule or event or on-demand. You can monitor job runs to understand runtime metrics such as completion status, duration, and start time.
Its stated in the scenario that the allocated worker type in the created ETL job is a Standard worker type. The Standard worker type has a 50 GB disk and 2 executors. A data processing unit (DPU) is a relative measure of processing power that consists of vCPUs and memory. To improve the job execution time, you can enable job metrics in AWS Glue to estimate the number of data processing units (DPUs) that can be used to scale out an AWS Glue job.
Hence, the correct answer is: Modify the job properties and enable job metrics to evaluate the required number of DPUs. Change the maximum capacity parameter value and set it to a higher number.
The option that says: Modify the job properties and enable job bookmarks to evaluate the required number of DPUs. Set a higher parameter value for the number of memory per executor is incorrect because a job bookmark just maintains state information and prevents AWS Glue from reprocessing old data. Job bookmarks are primarily used for incremental data processing. You cant use job bookmarks to improve the job execution time.
The option that says: Modify the job properties and enable job bookmarks to evaluate the required number of DPUs. Set a higher parameter value for the number of executors per node is incorrect. Just like the option above, you must enable job metrics and increase the maximum capacity value.
The option that says: Modify the job properties and enable job metrics to evaluate the required number of DPUs. Change the spark.yarn.executor.memoryOverhead parameter value and set it to a higher number is incorrect. The value of the spark.yarn.executor.memoryOverhead parameter is just a part of the memory allocation that executors use in addition to the executor memory. This memory accounts for things like VM overheads, interned strings, and other native overheads. Now, in case you have insufficient memory overhead, you might encounter an error that says Container killed by YARN for exceeding memory limits .. We can eliminate this option since it was mentioned in the scenario that no errors were found in the logs and the job has been completed after 3 hours. In this case, its just a matter of increasing the DPUs to speed up the execution time.
References: https://docs.aws.amazon.com/glue/latest/dg/add-job.html https://docs.aws.amazon.com/glue/latest/dg/monitor-debug-capacity.html https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
Question 26 of 35
26. Question
You are working as a data engineer within a financial institution. You‘re required to move a load of data gathered against various datasets in S3 to a Redshift cluster. You‘ve attached an IAM role to your cluster and have issued a COPY command to move data from the S3 bucket into your Redshift database. After a while you check and notice the data is not populated in Redshift. Which of the following errors could be causing the issue with your data population?
Correct
When using the COPY command to move data from an S3 bucket, two things are required: an IAM role for accessing S3 resources and ensuring that data is either auto-committed or there‘s an explicit COMMIT at the end of your COPY command to save the changes uploaded from S3. When using the COPY command to move data from an S3 bucket, two things are required: an IAM role for accessing S3 resources and ensuring that data is either auto-committed or there‘s an explicit COMMIT at the end of your COPY command to save the changes uploaded from S3.
Incorrect
When using the COPY command to move data from an S3 bucket, two things are required: an IAM role for accessing S3 resources and ensuring that data is either auto-committed or there‘s an explicit COMMIT at the end of your COPY command to save the changes uploaded from S3. When using the COPY command to move data from an S3 bucket, two things are required: an IAM role for accessing S3 resources and ensuring that data is either auto-committed or there‘s an explicit COMMIT at the end of your COPY command to save the changes uploaded from S3.
Unattempted
When using the COPY command to move data from an S3 bucket, two things are required: an IAM role for accessing S3 resources and ensuring that data is either auto-committed or there‘s an explicit COMMIT at the end of your COPY command to save the changes uploaded from S3. When using the COPY command to move data from an S3 bucket, two things are required: an IAM role for accessing S3 resources and ensuring that data is either auto-committed or there‘s an explicit COMMIT at the end of your COPY command to save the changes uploaded from S3.
Question 27 of 35
27. Question
You are part of a team of engineers building an attendance tracking system used to keep track of students in a university classroom. The students will be sent a unique QR code to their email address each day before a particular class starts. The QR code will then be scanned as the student enters the university classroom and they will be marked present for class. It is expected that the creation of the QR codes and QR scanning of the QR codes will happen at various times throughout the day, and high traffic spikes will happen regularly. It‘s also important that the data is highly durable and operates with low latency. What bundle of AWS services do you suggest using to meet all of the requirements to build the attendance tracking system?
Correct
API Gateway is used as a REST API and uses Lambda to implement the functionality. Using DynamoDB for storage needs is a great solution, providing high durability and low latency for your application. Query Your AWS Database From Your Serverless Application
Incorrect
API Gateway is used as a REST API and uses Lambda to implement the functionality. Using DynamoDB for storage needs is a great solution, providing high durability and low latency for your application. Query Your AWS Database From Your Serverless Application
Unattempted
API Gateway is used as a REST API and uses Lambda to implement the functionality. Using DynamoDB for storage needs is a great solution, providing high durability and low latency for your application. Query Your AWS Database From Your Serverless Application
Question 28 of 35
28. Question
You are working as a data analyst for a marketing agency. Through a mobile app, the company gathers data from thousands of mobile devices every hour, which is stored in an S3 bucket. The COPY command is used to move data to a Redshift cluster for further analysis. The data reconciliation team is concerned if there‘s any possibility that some of the original data present in S3 files might be missing from the Redshift cluster. Which of the following actions would you take to mitigate this issue with the least amount of development effort?
Correct
Amazon S3 provides strong read-after-write consistency for COPY, UNLOAD, INSERT (external table), CREATE EXTERNAL TABLE AS, and Amazon Redshift Spectrum operations on Amazon S3 buckets in all AWS Regions. You can use a manifest to ensure that the COPY command loads all of the required files, and only the required files, for a data load. Managing Data Consistency Using a manifest to specify data files
Incorrect
Amazon S3 provides strong read-after-write consistency for COPY, UNLOAD, INSERT (external table), CREATE EXTERNAL TABLE AS, and Amazon Redshift Spectrum operations on Amazon S3 buckets in all AWS Regions. You can use a manifest to ensure that the COPY command loads all of the required files, and only the required files, for a data load. Managing Data Consistency Using a manifest to specify data files
Unattempted
Amazon S3 provides strong read-after-write consistency for COPY, UNLOAD, INSERT (external table), CREATE EXTERNAL TABLE AS, and Amazon Redshift Spectrum operations on Amazon S3 buckets in all AWS Regions. You can use a manifest to ensure that the COPY command loads all of the required files, and only the required files, for a data load. Managing Data Consistency Using a manifest to specify data files
Question 29 of 35
29. Question
Your company, Super Data Retention Unlimited, has taken on a new customer that needs to retain log files from their application, which falls under strict regulatory requirements. These log files are not easily reproduced and will not be replicated elsewhere. The files will be frequently accessed for 2 months for application analysis, then infrequently for an additional 10 months for wider analytics purposes, and then must be archived for an additional 6 years. Which option satisfies these requirements with the least amount of development effort?
Correct
This is the best option that meets all the requirements with as little development effort as possible. Because the data is not easily reproduced and due to regulatory requirements, there would be significant repercussions should the data be lost using a storage class with lower availability and durability thresholds, like One Zone-Infrequent Access.
Incorrect
This is the best option that meets all the requirements with as little development effort as possible. Because the data is not easily reproduced and due to regulatory requirements, there would be significant repercussions should the data be lost using a storage class with lower availability and durability thresholds, like One Zone-Infrequent Access.
Unattempted
This is the best option that meets all the requirements with as little development effort as possible. Because the data is not easily reproduced and due to regulatory requirements, there would be significant repercussions should the data be lost using a storage class with lower availability and durability thresholds, like One Zone-Infrequent Access.
Question 30 of 35
30. Question
You work for a movie theater organization that is integrating a new concession system. The movie theaters will be spread across the globe, showing movies in hundreds of different languages. The new concession system needs to be able to handle users all throughout the day and during any given time. The amount of concession purchases spike during certain times of the day and night, so the collected data volume fluctuates. The data that is stored for concession purchases, items, and prices needs to be delivered at low latency and high throughput no matter the size of data; however, the data is typically small in size. What storage option is the best solution for the new concession system?
Correct
DynamoDB scales horizontally and allows applications to deliver data at single-digit millisecond latency at large scale. DynamoDB also offers global tables for multi-Region replication that can be used for your global application. Global Tables: Multi-Region Replication with DynamoDB Amazon DynamoDB FAQs
Incorrect
DynamoDB scales horizontally and allows applications to deliver data at single-digit millisecond latency at large scale. DynamoDB also offers global tables for multi-Region replication that can be used for your global application. Global Tables: Multi-Region Replication with DynamoDB Amazon DynamoDB FAQs
Unattempted
DynamoDB scales horizontally and allows applications to deliver data at single-digit millisecond latency at large scale. DynamoDB also offers global tables for multi-Region replication that can be used for your global application. Global Tables: Multi-Region Replication with DynamoDB Amazon DynamoDB FAQs
Question 31 of 35
31. Question
You work for an organization that heavily utilizes QuickSight as their Business Intelligence tool. The latest project you have been asked to join is looking for ways to set up automation in the process of building dashboards and updating them based on fresh data and data transformations. The goal of your project is to build an automated BI dashboard that customers can use to gain insights about their data with minimal development involved. The dashboards should be as up-to-date as possible, using the most current and recent data. Which of the following solutions will satisfy the requirements with the least amount of development effort?
Correct
You can use the CreateIngestion command to create and start a new SPICE ingestion on the dataset which will refresh the dashboard with the most up-to-date data. CreateIngestion
Incorrect
You can use the CreateIngestion command to create and start a new SPICE ingestion on the dataset which will refresh the dashboard with the most up-to-date data. CreateIngestion
Unattempted
You can use the CreateIngestion command to create and start a new SPICE ingestion on the dataset which will refresh the dashboard with the most up-to-date data. CreateIngestion
Question 32 of 35
32. Question
Your company has been hired to create a search and analytics system for Percival‘s Peculiar Pickles, which is a site where people post pictures of and discuss pictures of peculiar pickles. The solution should provide a REST API interface, enable deep text search capabilities, and be able to generate visualizations of the data stored in the system. Which solution will meet these requirements with minimal development effort?
Correct
Kinesis Firehose is able to deliver records to Elasticsearch Service with no additional development needed. Elasticsearch provides a REST API which satisfies the rest of the requirements.
Incorrect
Kinesis Firehose is able to deliver records to Elasticsearch Service with no additional development needed. Elasticsearch provides a REST API which satisfies the rest of the requirements.
Unattempted
Kinesis Firehose is able to deliver records to Elasticsearch Service with no additional development needed. Elasticsearch provides a REST API which satisfies the rest of the requirements.
Question 33 of 35
33. Question
You work as a data scientist for a new startup in the rapid growing field of health and telemedicine. The organization‘s health data needs to be stored into a data warehousing solution with an initial data load of around 1,000 GB. You also expect rapid data growth due to the growing demand for telemedicine services. You‘ve been tasked with coming up with a data warehousing solution to host the health data that also allows for daily and weekly visualizations regarding global and regional health statistics. These visualizations will help determine health funding from the government and private lenders. It‘s important that the data warehousing solution be able to scale and that computer and managed storage are billed independently. Which of the data warehousing solutions would you suggest that allows for a simple and cost-effective approach?
Correct
Amazon Redshift managed storage uses large, high-performance SSDs in each RA3 node for fast local storage and Amazon S3 for longer-term durable storage. If the data in a node grows beyond the size of the large local SSDs, Amazon Redshift managed storage automatically offloads that data to Amazon S3. You pay the same low rate for Amazon Redshift managed storage regardless of whether the data sits in high-performance SSDs or Amazon S3. For workloads that require ever-growing storage, managed storage lets you automatically scale your data warehouse storage capacity without adding and paying for additional nodes. Amazon Redshift Clusters RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently. With RA3, you choose the number of nodes based on your performance requirements and only pay for the managed storage that you use. Size your RA3 cluster based on the amount of data you process daily. You launch clusters that use the RA3 node types in a virtual private cloud (VPC). You can‘t launch RA3 clusters in EC2-Classic.
Incorrect
Amazon Redshift managed storage uses large, high-performance SSDs in each RA3 node for fast local storage and Amazon S3 for longer-term durable storage. If the data in a node grows beyond the size of the large local SSDs, Amazon Redshift managed storage automatically offloads that data to Amazon S3. You pay the same low rate for Amazon Redshift managed storage regardless of whether the data sits in high-performance SSDs or Amazon S3. For workloads that require ever-growing storage, managed storage lets you automatically scale your data warehouse storage capacity without adding and paying for additional nodes. Amazon Redshift Clusters RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently. With RA3, you choose the number of nodes based on your performance requirements and only pay for the managed storage that you use. Size your RA3 cluster based on the amount of data you process daily. You launch clusters that use the RA3 node types in a virtual private cloud (VPC). You can‘t launch RA3 clusters in EC2-Classic.
Unattempted
Amazon Redshift managed storage uses large, high-performance SSDs in each RA3 node for fast local storage and Amazon S3 for longer-term durable storage. If the data in a node grows beyond the size of the large local SSDs, Amazon Redshift managed storage automatically offloads that data to Amazon S3. You pay the same low rate for Amazon Redshift managed storage regardless of whether the data sits in high-performance SSDs or Amazon S3. For workloads that require ever-growing storage, managed storage lets you automatically scale your data warehouse storage capacity without adding and paying for additional nodes. Amazon Redshift Clusters RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently. With RA3, you choose the number of nodes based on your performance requirements and only pay for the managed storage that you use. Size your RA3 cluster based on the amount of data you process daily. You launch clusters that use the RA3 node types in a virtual private cloud (VPC). You can‘t launch RA3 clusters in EC2-Classic.
Question 34 of 35
34. Question
A global wildlife research group has been collecting a huge amount of data in regionally located Redshift clusters. While planning for the next increase in storage capacity for their cluster, there was significant pushback regarding increased cost. At least 3/4 of the data being stored in Redshift is only accessed 4 times a year to generate reports that are delayed one quarter and do not include the most recent quarter‘s data. The leadership of the research group has requested a solution that will continue generating reports from SQL queries, and charts and graphs generated from the data with QuickSight. Which of the following is the lowest cost solution?
Correct
Because the fresh data is not accessed for this purpose and the cold data is not utilized for live application operations, it is viable to move cold data outside of the Redshift ecosystem.
Incorrect
Because the fresh data is not accessed for this purpose and the cold data is not utilized for live application operations, it is viable to move cold data outside of the Redshift ecosystem.
Unattempted
Because the fresh data is not accessed for this purpose and the cold data is not utilized for live application operations, it is viable to move cold data outside of the Redshift ecosystem.
Question 35 of 35
35. Question
Your team has several rapidly growing applications that run in AWS. Each of the applications run in different AWS accounts and produce loads of data by capturing network traffic from VPC Flow Logs. The networking team has a goal of analyzing all 350 GB of flow log data to produce a report for the security team. The data needs to be stored forever, but only the last 6 months of data needs to be analyzed on a bi-yearly report. Which of these is the most cost-effective solution that meets all of the requirements?
Correct
You can use Lambda to send data to your Amazon ES domain from Amazon S3. New data that arrives in an S3 bucket triggers an event notification to Lambda, which then runs your custom code to perform the indexing. This solution also will keep the data stored in S3 forever and indexes will be dropped for data older than 6 months. Loading Streaming Data into Amazon Elasticsearch Service
Incorrect
You can use Lambda to send data to your Amazon ES domain from Amazon S3. New data that arrives in an S3 bucket triggers an event notification to Lambda, which then runs your custom code to perform the indexing. This solution also will keep the data stored in S3 forever and indexes will be dropped for data older than 6 months. Loading Streaming Data into Amazon Elasticsearch Service
Unattempted
You can use Lambda to send data to your Amazon ES domain from Amazon S3. New data that arrives in an S3 bucket triggers an event notification to Lambda, which then runs your custom code to perform the indexing. This solution also will keep the data stored in S3 forever and indexes will be dropped for data older than 6 months. Loading Streaming Data into Amazon Elasticsearch Service
X
The End of Exam. SkillCertPro wishes you all the best for your exam.