Final project on data lakes with AWS
Final project on data lakes with AWS
• IoT sensors sending real-time data, represented by the "Data Stream" logo.
• A database with historical records, represented by the "Multimedia" logo.
• Additional data from third-party entities to enrich the internally generated data, represented
by the "Database" logo.
• The data stream is going to be managed Amazon Kinesis Data Firehose. This service will
prepare and load the data continuously to our storage.
• Kinesis Video Streams is adapted for multimedia. It’s going to ingest, durably stores, encrypts,
and indexes video streams for real-time and batch analytics.
• Snowcone is rugged, secure, and designed for use outside of a traditional data center. Its
compact size makes it ideal for confined spaces or when portability is a necessity. You can use
Snowcone in the backpacks of first responders or for IoT, vehicles, and even drones. You can
run edge computing applications, and you can ship the device with data to AWS for offline
data transfer, or you can transfer data online with AWS DataSync from edge locations.
• Like suggested in the scenario, the storage that we are going to use is Amazon S3. Indeed,
that’s a good choice. Amazon Simple Storage Service (Amazon S3) is an object storage service
that offers scalable capacity, data availability, top-tier security, and performance. Customers
of all sizes and across all industries can store and protect any amount of data for nearly all
use cases, such as data lakes and cloud-native and mobile applications. With cost-effective
storage classes and user-friendly management features, you can optimize costs, organize
data, and configure precise access controls to meet specific operational, organizational, and
compliance requirements.
• In the data lake, we are also using AWS Glue. It helps us to extract data from sources,
transforms it, and loads it into targets by executing a script.
For the consumption part we are going to use theses following services:
• The first one is Amazon EMR which is suggested to use because the client want to use
Hadoop. Indeed, Amazon EMR is a managed cluster platform that simplifies running big data
frameworks like Apache Hadoop and Apache Spark on AWS to process and analyze large
amounts of data. By using these frameworks and associated open-source projects like Apache
Hive and Apache Pig, you can process data for analytical and business intelligence workloads.
Additionally, you can use Amazon EMR to transform and move large amounts of data to and
from other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon
S3) and Amazon DynamoDB.
• The second one is Amazon Athena. Amazon Athena is an interactive query service that makes
it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard
SQL. In just a few steps in the AWS Management Console, you can point Athena to your data
stored in Amazon S3 and start using standard SQL to run ad-hoc queries and get results in
seconds.
• The third one is Amazon QuickSight for the visualization of the data. Amazon QuickSight is an
ultra-fast and user-friendly cloud-based business analytics service that enables all employees
in an organization to quickly create visualizations, perform ad-hoc analysis, and gain market
insights from their data, anywhere, on any device. Load CSV and Excel files, connect to SaaS
applications like Salesforce, access on-premises databases such as SQL Server, MySQL, and
PostgreSQL, and seamlessly discover your AWS data sources such as Amazon Redshift,
Amazon RDS, Amazon Aurora, Amazon Athena, and Amazon S3. QuickSight allows
organizations to scale their business analytics capabilities for hundreds of thousands of users
and deliver fast, responsive query performance using a robust in-memory engine (SPICE).
• The last one is Amazon Redshift. Amazon Redshift simplifies and cost-effectively enables high-
performance querying of petabytes of structured data, allowing you to create powerful
reports and dashboards using your existing business intelligence tools.
• The first one is AWS Lake Formation. AWS Lake Formation easily creates secure data lakes,
making data available for large-scale analytics.
• The second one is AWS CloudTrail which can monitor and record account activity across your
entire AWS infrastructure, giving you control over storage, analysis, and corrective actions.