Analysis and Visualization
Analysis and Visualization
QUESTION 1
You work as a data engineer for a large health agency that runs data analytics on world health data.
Currently, there are large datasets of world health data in S3 that is not accessible over the internet. You have
been tasked with setting up a QuickSight account that will enable you to build dashboards from the data in
S3 without moving the data over the public internet. Which of these methods meets these requirements?
ANS: Setup a QuickSight VPC connection and a VPC endpoint for S3 to allow QuickSight private access to S3 world
health data.
QUESTION 2
You work for an organization that uses legacy Microsoft applications to run the day-to-day services, as well
as the authentication mechanisms. Currently, all employees are authenticated into applications using AWS
Managed Microsoft AD in us-west-2. You have recently set up a QuickSight account in us-east-1 that you
need teammates to authenticate into, so they can run data analytics tasks. Your teammates are not able to
authenticate into the QuickSight account. Which of the following is the cause for the issue and what are the
possible solutions?
ANS:
Use the Enterprise edition for the QuickSight account.
Ensure Active Directory is the identity provider for QuickSight and associate your AD groups with Amazon
QuickSight.
QUESTION 3
You work as a data scientist for an organization that builds videos for university students who use them in
place of classroom settings. Each video has a rating system that is positive or negative, which is determined
by the students who view the content. Some of the ratings appear to come from bots that are flooding the
platform with massive amounts of negative feedback. You've been tasked with creating real-time
visualizations for these outliers to bring to the department heads. You have a large dataset of historical data,
as well as the streaming data from current student viewing metrics. Which of the following provides the most
cost-effective way to visualize these outliers?
ANS: Use the anomaly detection feature in QuickSight to detect outliers.
QUESTION 4
A geologic study group has installed thousands of IoT sensors across the globe to measure various soil
attributes. Each sensor delivers its data to a prefix in a single S3 bucket in Parquet-formatted files. The study
group would like to be able to query this data using SQL and avoid any data processing in order to minimize
their costs. What of the following is the best solution?
ANS: Configure Athena tables to make the data queryable and provide the appropriate access to team
members via IAM policy.
QUESTION 5
Goofy Goober Giraffe Rentals has compiled their operational data, against which they would like to perform
simple SQL-based expressions without having to retrieve entire objects. This data has been collected in an
Amazon S3 bucket and is stored in CSV format. Queries against this data will be read-only. What is the
cheapest solution to make this data query-able?
ANS: Use the S3 Select API call to query the data in place.
QUESTION 6
Stupendous Fantasy Football League would like to create near real-time scoreboards for all games being
played on any given day. They have an existing Kinesis Data Firehose which ingests all relevant statistics
about each game as it is being played, but they would like to be able to extract just score data on the fly. Data
is currently being delivered to a Redshift cluster, but they would like score data to be stored and updated in a
DynamoDB table. This table will then function as the datastore for the live scoreboard on their web
application. Which of the following is the best way to accomplish this?
ANS: Utilize a Kinesis Data Analytics Application to filter out just the score data from the data stream. Send
the data to a Lambda function, which then inserts/updates the data in the DynamoDB table.
QUESTION 7
You work for a large computer hardware organization that has many different IT stores across the world. The
computer parts, order details, shipping details, customer, and sales person information is stored in a data lake
in S3. You have been tasked with developing a visualization to show the amount of hardware that was
shipped out by various stores and the sales person who sold the hardware. You have a requirement that the
visualization must be able to apply statistical functions, as well as cluster columns and rows to show values
for subcategories grouped by related dimension. Which type of visualization would meet these requirements?
ANS: Pivot table
QUESTION 8
You work as a data engineer for a large health agency that runs data analytics on water treatment data.
Currently, there are large datasets of water treatment data in an S3 bucket in us-east-1. You have been tasked
with setting up a QuickSight account in us-west-2 that will enable you to build dashboards from the data in
S3. All of the traffic between S3 and QuickSight must happen within a VPC so the security team can monitor
the VPC Flow Logs and CloudTrail logs for security audits. Which of the following actions must be taken in
order for QuickSight to access the data in S3?
ANS:
Set up a QuickSight VPC connection in us-west-2.
Move the data to a bucket in us-west-2 and create a VPC endpoint so the traffic only flows through
the VPC.
QUESTION 9
You work for a large data warehousing company that is constantly running large scale processing jobs for
customers. Every team has the freedom to use whichever EMR cluster configuration they need to accomplish
a particular task, but the solution must be cost optimized. The latest contract requires a very large EMR
cluster to be used throughout the year to process ML data and statistical functions. During a few months out
of the year, the processing will be massive and, during other months, it will be minimal. To contend with
this, your team uses a combination of on-demand and spot instances for the EMR cluster nodes, which is
estimated to be around 40 core and tasks nodes. The team also varies the instance types to handle different
workload types; for example, GPU-intensive ML processes will use g3 instance types and storage optimized
processes will use i2 instance types. Which type of EMR cluster solution would need to be set up to meet the
requirements for the new contract?
ANS: Utilize instance fleets configurations when creating the EMR cluster.
QUESTION 10
Your manager has come to you and announced that he's tired of justifying server budgets for your analytics
pipelines and has tasked you with converting them to a serverless solution. Your existing pipelines use
ingestion/ETL scripts that send data to a Kafka solution, all running on EC2. The data is then collected by
another set of workers running custom scripts to send the data to Redshift and Elasticsearch to provide an
SQL interface for analytics queries, as well as a Kibana instance for generating visual representations of the
data. What is the most efficient way to rebuild this pipeline with available serverless services?
ANS: Replace the ingestion/ETL scripts running on EC2 with API Gateway and Lambda functions. Replace Kafka on
EC2 with Kinesis Firehose with S3 as the destination. Set up Athena to provide an SQL interface for analytics queries,
and utilize QuickSight to generate visualizations of the data stored in S3.
QUESTION 11
You are designing a Kinesis Client Library (KCL) application that reads data from a Kinesis Data Stream
and immediately writes the data to S3. The data is batched into 15-second intervals and sent off to S3. The
batching interval is a regulatory requirement set by the team who owns the data and cannot be changed. You
are using Athena to query the results in S3, but notice over time that the query results are taking longer and
longer to process. Which of the following is the BEST solution to improve query speeds?
ANS: Create an EMR cluster to run the S3DistCp command to combine smaller files into larger objects.
QUESTION 12
You work for a company that is currently using a private Redshift as their data warehousing solution running
inside a VPC. You have been tasked with producing a dashboard for sales and KPI data that is stored in the
Redshift cluster. You have decided to use QuickSight as the visualization and a BI tool to create these
dashboard. Which of the following must be done to enable access and create the dashboards?
ANS: Create a security group that contains an inbound rule authorizing access from the appropriate IP address range
for the region where the QuickSight servers are located.
QUESTION 13
You work for a stock trading company that runs daily ad-hoc queries on data using Athena. There are
multiple silos within the company using Athena to run trading queries specific to their team. The finance
department has a requirement to enforce the amount of money being spent by each team for the queries that
they run in Athena. The security department has a requirement to enforce all query results be encrypted.
Which solution could be implemented that would meet both of these requirements?
ANS: Use Athena Workgroups to assign a unique workgroup to each silo, tagging them appropriately. Configure the
workgroup to encrypt the query results. Generate cost reports from the tags as well as resource-based policies that
assigns each workgroup to a silo.
QUESTION 14
This Is The Way Mercantile has hired your analytics consulting firm to build a solution for their online store.
They need to be able to produce analytics reports from all of the data in their AWS account. Unfortunately,
during each update of their storefront, they have changed the backend datastores. The data that needs to be
analyzed is stored in MySQL databases spread over EC2 and RDS, DynamoDB, as well as archives in S3.
They do not want to perform a full ETL, but instead would prefer to have a single SQL endpoint that is able
to query all of the existing data, achieved with the least operational overhead. How can they accomplish this?
ANS: Deploy Amazon Athena Data Source connectors for MySQL (EC2 and RDS) and DynamoDB. Use
Athena federated query to query the disparate data from a single endpoint.
QUESTION 15
You've been provided with a list of highly structured, normalized data stored in disparate relational
databases. This data needs to be combined to enable business intelligence tooling and analytics queries. The
solution will be used frequently. Speed and cost are equally important. Which of the following will provide
the best data repository given the access requirements?
ANS: Utilize Glue to catalog and ETL the data into a Redshift data warehouse and provide the appropriate
teams access to the Redshift cluster endpoint.
QUESTION 16
Your company is looking to reduce the cost of their Business Intelligence applications. Currently, all data is
stored in a Redshift cluster, which has grown exponentially with the increase in sales. Additionally, the
bespoke visualizations for quarterly reports are incredibly cumbersome to generate by hand. What steps can
be taken to reduce the cost of business intelligence workflow, while keeping all data available for generating
reports from time to time?
ANS: Store data no longer being actively utilized in an S3 bucket using the Standard Infrequent Access
storage class. Create a Redshift Spectrum table to access this data and join it with warm data in the Redshift
cluster for reporting. Utilize QuickSight to create the appropriate charts and graphs to accompany the BI
reports.
QUESTION 17
As Herbert's Hyper Hot Chillies has expanded their hot pepper and spice sales to the global market, they've
accumulated a significant number of S3- backed data lakes in multiple AWS accounts across multiple
regions. They would like to produce some Business Intelligence visualizations that combine data from all of
these sources. How can they do this with minimal cost and development effort?
ANS: Ensure that all data sources are configured with the appropriate permissions to provide QuickSight
access. Configure QuickSight to access the S3 data in the various regions and accounts.
QUESTION 18
Word Origin Resource Detection Society has scanned thousands of books and converted them into JSON-
formatted files that includes the text of the books and all relevant metadata. They need to be able to search
these JSON documents for specific words to perform analysis to aid linguistic research. The solution should
provide an API interface, and perform rich search functions. Which of the following is the best solution?
ANS: Load the JSON documents into an Elasticsearch Index, and provide the team with access to the
Elasticsearch API and Kibana interface.
QUESTION 19
You work for a small car dealership company that has many different locations across the country. The sales
staff alternate between the different dealership locations, since some of the locations are in higher populated
areas and generate more profits. You have been tasked with developing a visualization to show the amount
of sales for each salesperson at the different dealerships and identify any trends and outliers. Which type of
visualization would meet these requirements?
ANS: Heat map
QUESTION 20
You work for a large organization that uses Redshift as their data warehousing solution. The members of the
HR department run simple ad-hoc queries that take very little time and resources to execute. The members of
the engineering team run complex queries that use multiple joins and usually take a long time to run, as they
are quite memory intensive. The HR department is complaining that their complex queries are getting stuck
in queues behind long-running queries by the engineering team. Which of the following solutions could
resolve this issue in the most cost-effective manner?
ANS:
Utilize workload management (WLM) in Redshift to manage query queues.
Configure 2 query queues, 1 for each department. Set the number of queries that can run in each of
the queues to 10.
QUESTION 21
Gustof's Training Emporium is looking to combine the data from all of their testing centers spread around
the world. Each testing center is storing data in their own RDS database and they're planning to utilize Glue
to perform ETL and combine the disparate data. They need to be able to run SQL-based queries against this
data and create visualizations. What is the lowest cost solution to accomplish this?
ANS: Use Glue Crawlers to crawl the data, and utilize Glue Jobs to perform ETL into an S3 bucket in the
Parquet format. Configure Athena as an SQL endpoint for the data, and configure QuickSight to use Athena
as its data source to create visualizations.