Report Mohit
Report Mohit
I would like to express our sincere gratitude to these dignitaries, who are with us in
of Computer Science and Engineering for paving a path for me and assisting
I would also like to express our sincere thanks to my friends and family for their
This whitepaper is designed to help architects, data scientists, and developers navigate big data
analytics
options in the AWS cloud. It provides an overview of relevant AWS services, covering ideal usage
patterns, cost models, performance, durability, availability, scalability, elasticity, interfaces, and anti-
patterns. Key objectives include data integration and centralization to unify diverse datasets from
educational institutions, including student records, faculty information, academic performance, and
administrative data. The paper also emphasizes real-time data processing to enable near-instant
analysis
and support quick decision-making for educational administrators and policymakers. Security and
compliance are highlighted, focusing on protecting sensitive information in line with AICTE regulatory
standards.
technical institutions. Additionally, the whitepaper advocates for an analytics and reporting
framework to
generate actionable insights, fostering data-driven decisions at institutional and regulatory levels.
Finally,
a collaborative data ecosystem is proposed to encourage secure data exchange and interoperability
between educational institutions within the AWS environment, ensuring an integrated and
cooperative
approach to data management. The paper concludes with scenarios demonstrating AWS analytics in
COURSE MODULES
Course objectives:
Illustrate a data pipeline by using AWS services to meet a generalized use case.
Identify the risks and approaches to secure and govern data at each step
Identify scaling considerations and best practices for building pipelines that
Design and build a data collection process while considering constraints such
Code Generation
Code Whisperer learns from open-source projects and the code it suggests might occasionally
resemble code samples from the training data. With the reference log, you can view references
to code suggestions that are similar to the training data. When such occurrences happen, Code
Code Whisperer code generation offers many benefits for software development organizations.
repetitive tasks, it optimizes the use of developer time, so developers can focus on more critical
aspects of the project. Additionally, code generation helps mitigate security vulnerabilities,
safeguarding the integrity of the codebase. Code Whisperer also protects open source
intellectual property by providing the open source reference tracker. Code Whisperer enhances
code quality and reliability, leading to robust and efficient applications. And it supports an
efficient response to evolving software threats, keeping the codebase up to date with the latest
security practices. Code Whisperer has the potential to increase development speed, security,
data Pipeline
Another key characteristic of deriving insights by using your data pipeline is that the process
will almost always be iterative. You have a hypothesis about what you expect to find in the
data, and you need to experiment and see where it takes you. You might develop your
hypothesis by using BI tools to do initial discovery and analysis of data that has already been
collected. You might iterate within a pipeline segment, or you might iterate across the entire
pipeline. For example, in this illustration, the initial iteration (number 1) yielded a result that
wasn't as defined as was desired. Therefore, the data scientist refined the model and
reprocessed the data to get a better result (number 2). After reviewing those results, they
determined that additional data could improve the detail available in their result, so an
additional data source was tapped and ingested through the pipeline to produce the desired
result (number 3). A pipeline often has iterations of storage and processing. For example, after
the external data is ingested into pipeline storage, iterative processing transforms the data into
Code Whisperer code generation offers many benefits for software development organizations.
repetitive tasks, it optimizes the use of developer time, so developers can focus on more critical
aspects of the project. Additionally, code generation helps mitigate security vulnerabilities,
safeguarding the integrity of the codebase. Code Whisperer also protects open source
intellectual property by providing the open source reference tracker. Code Whisperer enhances
code quality and reliability, leading to robust and efficient applications. And it supports an
efficient response to evolving software threats, keeping the codebase up to date with the latest
security practices. Code Whisperer has the potential to increase development speed, security,
Chapter- 3:
Data Characteristics
So, which of these data stores or data architectures is the best one for your data pipeline?
The reality is that a modern architecture might include all of these elements. The key to a
modern data architecture is to apply the three-pronged strategy that you learned about earlier.
Modernize the technology that you are using. Unify your data sources to create a single source
of truth that can be accessed and used across the organization. And innovate to get higher
The architecture illustrates the following other AWS purpose-built services that integrate
with Amazon S3 and map to each component that was described on the previous slide:
•Amazon OpenSearch Service is a purpose-built data store and search engine that is
•Amazon EMR provides big data processing and simplifies some of the most complex
high-performance applications.
storage:
Data being ingested into the Amazon S3 data lake arrives at the landing zone, where it is first
cleaned and stored into the raw zone for permanent storage. Because data that is destined for
the data warehouse needs to be highly trusted and conformed to a schema, the data needs to
be processed further additional transformations would include applying the schema and
partitioning (structuring) as well as other transformations that are required to make the data
conform to requirements that are established for the trusted zone. Finally, the processing layer
prepares the data for the curated zone by modeling and augmenting it to be joined with other
datasets (enrichment) and then stores the transformed, validated data in the curated layer.
Datasets from the curated layer are ready to be ingested into the data warehouse to make them
Producers ingest records onto the stream. Producers are integrations that collect data from
a source and load it onto the stream. Consumers process records. Consumers read data from
the stream and perform their own processing on it. The stream itself provides a temporary but
durable storage layer for the streaming solution. In the pipeline that is depicted in this slide,
Amazon CloudWatch Events is the producer that puts CloudWatch Events event data onto the
stream. Kinesis Data Streams provides the storage. The data is then available to multiple
consumers.
Chapter-4:
to describe and provision all of the infrastructure resources in your cloud environment. Cloud
Formation creates, updates, and deletes the resources for your applications in environments
called stacks. A stack is a collection of AWS resources that are managed as a single unit.
it simplifies the task of repeatedly and predictably creating groups of related resources that
power your applications. Resources are written in text files by using JSON or YAML format.
Data wrangling: Transforming large amounts of unstructured or structured raw data from multiple
sources with different schemas into a meaningful set of data that has value for downstream
processes or users. Data Structuring: For the scenario that was described previously, the structuring
step includes exporting a .json file from the customer support ticket system, loading the .json file into
Excel, and letting Excel parse the file. For the mapping step for the supp2 data, the data engineer
would modify the cust num field to match the customer id field in the data warehouse. Forthis
example, you would performadditionaldata wrangling steps before compressing the file for upload
to the S3 bucket. Data Cleaning: It includes; • Remove unwanted data. • Fill in missing data values. •
Validate or modify the data types. • Fix outliers
To generalize the characteristics of batch processing, batch ingestion involves running batch
jobs that query a source, move the resulting dataset or datasets to durable storage in the
pipeline, and then perform whatever transformations are required for the use case. As noted in
the Ingesting and Preparing Data module, this could be just cleaning and minimally formatting
data to put it into the lake. Or, it could be more complex enrichment, augmentation, and
processing to support complex querying or big data and machine learning (ML) applications.
Traditional extract, transform, and load (ETL) uses batch processing, but extract, load, and
The process of transporting data from one or more sources to a target site for further processing
and analysis. This data can originate from a range of sources, including data lakes,IoT devices,
on-premises databases, and SaaS apps, and end up in different target environments, such as
Data in cloud object storage is handled as objects. Each object is assigned a key, which is a
unique identifier. When the key is paired with metadata that is attached to the objects, other
AWS services can use the information to unlock a multitude of capabilities. Thanks to
economies of scale, cloud object storage comes at a lower cost than traditional storage.
•Might contain multiple databases that are organized into tables and columns
•ETL pipelines transform data in buffered memory prior to loading data into a data lake or
•ELT pipelines extract and load data into a data lake or data warehouse for storage without
transformation. Here are a few key points to summarize this section. Storage plays an integral
part in ELT and ETL pipelines. Data often moves in and out of storage numerous times, based
on pipeline type and workload type.
ETL pipelines transform data in buffered memory prior to loading data into a data lake or
ELT pipelines extract and load data into data lake or data warehouse storage without
transformation. The transformation of the data is part of the target system’s workload
Securing Storage
•Amazon Redshift database securityis distinct from the security of the service itself.
•Due to third-party auditing, Amazon Redshift can help to support applications that are
•Big data solution for petabyte-scale data processing, interactive analytics, and machine
learning
For accessibility: Data from multiple sources is put in Amazon S3, where Athena can be used
for one-time queries. Amazon EMR aggregates the data and stores the aggregates in S3.
Athena can be used to query the aggregated datasets. From S3, the data can be used in Amazon
Redshift, where Quick Sight can access the data to create visualizations. End of accessibility
description.
you build infrastructure with code, you gain the benefits of repeatability and reusability
while you build your environments. In the example shown, a single template is used to deploy
Network Load Balancers and Auto Scaling groups that contain Amazon Elastic Compute
Cloud (Amazon EC2) instances. Network Load Balancers distribute traffic evenly across
targets.
CI/CD:
CI/CD can be pictured as a pipeline, where new code is submitted on one end, tested over a
series of stages (source, build, test, staging, and production), and then published as
production-ready cod
CONCLUSION
Data engineering is a critical component in the modern data landscape, playing a crucial role
Data engineering serves as the foundation for extracting, transforming, and loading (ETL) data
from diverse sources into a format suitable for analysis. This process is essential for generating
Maintaining data quality and integrity is paramount in data engineering. Data engineers are
responsible for cleaning, validating, and ensuring the accuracy of data, contributing to the
With the increasing volume, velocity, and variety of data, data engineering solutions must be
scalable and performant. Scalability ensures that systems can handle growing amounts of data,
while performance optimization ensures timely processing and availability of data for analytics.
Data engineering enables the integration of data from various sources, whether structured or